Table of Contents

This document is largely built on Jared Knowles R Bootcamp files. For those intereseted in more (than we will be covering) about R please refer to his web page.

Overview

R

What Does it Look Like?

The R workspace in RStudio

A Bit of HistoRy

The Philosophy

John Chambers, in describing the logic behind the S language said:

[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.

R is Born

Why Use R

Thoughts on Free

R Advantages Continued

R Can Compliment Other Tools

R's Popularity

R has recently passed Stata on Google Scholar hits and it is catching up to the two major players SPSS and SAS

Check KDnudgets Languages for analytics/data mining (Aug 2013) poll. The question was: What programming/statistics languages you used for an analytics / data mining / data science work in 2013?

R Has an Active Web Presence

R is linked to from more and more sites

R Extensions

These links come from the explosion of add-on packages to R

R Has an Active Community

Usage of the R listserv for help has really exploded recently

Data from Bob Muenchen available online

R's Drawbacks

R Vocabulary

Components of an R Setup

Advanced R Setup

Open Source Toolchain

Some Notes about Maintaining R

Help

Help (2)

foo <- c(1, "b", 5, 7, 0)
bar <- c(1, 2, 3, 4, 5)
foo + bar
Error: non-numeric argument to binary operator

“sos” Package: Help (3)

# install.packages('sos')
require(sos)
# Functions provided by sos package
ls("package:sos")

# Find functions through RSiteSearch() function. Try:
z <- findFn("GARCH", maxPages = 1)
# print(z) or z

# Like it. Try
z <- findFn("World Development Index", maxPages = 1)
# print(z) or z

Check RStudio

The Data Frame

data(mtcars)
mtcars
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2

A Data Frame

R As A Calculator

2 + 2  # add numbers
[1] 4
2 * pi  #multiply by a constant
[1] 6.283
7 + runif(1, min = 0, max = 1)  #add a random variable
[1] 7.394
4^4  # powers
[1] 256
sqrt(4^4)  # functions
[1] 16

Arithmetic Operators

2 + 2
[1] 4
2/2
[1] 1
2 * 2
[1] 4
2^2
[1] 4
2 == 2
[1] TRUE
23%/%2
[1] 11
23%%2
[1] 1

Other Key Symbols

foo <- 3
foo
[1] 3
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
# it increments by one
a <- 100:120
a
 [1] 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116
[18] 117 118 119 120

Comments in R

# Something I want to keep from R
# Like my secret from the R engine
# Maybe intended for a human and not the computer
# Like: Look at this cool plot!

myplot(readSS,mathSS,data=df)

R Advanced Math

Using the Workspace

Using the Workspace (2)

x <- 5  #store a variable with <-
x  #print the variable
[1] 5
z <- 3
ls()  #list all variables
 [1] "a"        "A"        "b"        "bar"      "big"      "c"       
 [7] "cleandf"  "d"        "dat.dta"  "dat.spss" "defac"    "df"      
[13] "foo"      "foo.mat"  "i"        "messdf"   "mtcars"   "myarray" 
[19] "mycorr"   "mycorr2"  "mydate"   "mydate2"  "myfac"    "myfac_o" 
[25] "mylist"   "mymat"    "mymod"    "myvec"    "newmat"   "p"       
[31] "random"   "random2"  "small"    "uniqstu"  "x"        "z"       
[37] "zap"     
# ls.str() #list and describe variables
rm(x)  # delete a variable
ls()
 [1] "a"        "A"        "b"        "bar"      "big"      "c"       
 [7] "cleandf"  "d"        "dat.dta"  "dat.spss" "defac"    "df"      
[13] "foo"      "foo.mat"  "i"        "messdf"   "mtcars"   "myarray" 
[19] "mycorr"   "mycorr2"  "mydate"   "mydate2"  "myfac"    "myfac_o" 
[25] "mylist"   "mymat"    "mymod"    "myvec"    "newmat"   "p"       
[31] "random"   "random2"  "small"    "uniqstu"  "z"        "zap"     

R as a Language

  1. Case sensitivity matters
a <- 3
A <- 4
print(c(a, A))
[1] 3 4

“c”“ is our friend

A <- c(3, 4)
print(A)
[1] 3 4

Language

a <- runif(100)  # Generate 100 random numbers
b <- runif(100)  # 100 more
c <- NULL  # Setup for loop (declare variables)
for (i in 1:100) {
    # Loop just like in Java or C
    c[i] <- a[i] * b[i]
}
d <- a * b
identical(c, d)  # Test equality
[1] TRUE

More Language Bugs Features

Objects

summary(df[, 28:31])  #summary look at df object
   schoollow         readSS        mathSS           proflvl    
 Min.   :0.000   Min.   :252   Min.   :210   advanced   : 788  
 1st Qu.:0.000   1st Qu.:430   1st Qu.:418   basic      : 523  
 Median :0.000   Median :495   Median :480   below basic: 210  
 Mean   :0.242   Mean   :496   Mean   :483   proficient :1179  
 3rd Qu.:0.000   3rd Qu.:562   3rd Qu.:543                     
 Max.   :1.000   Max.   :833   Max.   :828                     
summary(df$readSS)  #summary of a single column
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    252     430     495     496     562     833 

-The $ says to look for object readSS in object df

Graphics too

library(ggplot2) # Load graphics Package
p <- ggplot(mtcars, aes(wt, mpg)) + geom_point(aes(size = cyl)) +  
      geom_smooth(aes(group=cyl), method="loess") 
print(p)

mtcars

Handling Data in R

length(unique(df$school))
[1] 173
length(unique(df$stuid))
[1] 1200
uniqstu <- length(unique(df$stuid))
uniqstu
[1] 1200

Special Operators

big <- c(9, 12, 15, 25)
small <- c(9, 3, 4, 2)
# Give us a nice vector of logical values
big > small
[1] FALSE  TRUE  TRUE  TRUE
big = small
# Oops--don't do this, reassigns big to small
print(big)
[1] 9 3 4 2
print(small)
[1] 9 3 4 2

Special Operators II

big <- c(9, 12, 15, 25)
big[big == small]
[1] 9
# Returns values where the logical vector is true
big[big > small]
[1] 12 15 25
big[big < small]  # Returns an empty set
numeric(0)

Special operators (III)

big <- c(9, 12, 15, 25)
small <- c(9, 12, 15, 25, 9, 1, 3)
big[small %in% big]
[1]  9 12 15 25 NA
big[big %in% small]
[1]  9 12 15 25

Special operators (IV)

foo <- c("a", NA, 4, 9, 8.7)
!is.na(foo)  # Returns TRUE for non-NA
[1]  TRUE FALSE  TRUE  TRUE  TRUE
class(foo)
[1] "character"
a <- foo[!is.na(foo)]
a
[1] "a"   "4"   "9"   "8.7"
class(a)
[1] "character"

Special operators (V)

zap <- c(1, 4, 8, 2, 9, 11)
zap[zap > 2 | zap < 8]
[1]  1  4  8  2  9 11
zap[zap > 2 & zap < 8]
[1] 4

Regular Expressions

R Data Modes

Data Modes in R (numeric)

is.numeric(A)
[1] TRUE
class(A)
[1] "numeric"
print(A)
[1] 3 4

Data Modes (Character)

b <- c("one", "two", "three")
print(b)
[1] "one"   "two"   "three"
is.numeric(b)
[1] FALSE

Data Modes (Logical)

c <- c(TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)
is.numeric(c)
[1] FALSE
is.character(c)
[1] FALSE
is.logical(c)  # Results in a logical value
[1] TRUE

Easier way

class(A)
[1] "numeric"
class(b)
[1] "character"
class(c)
[1] "logical"

A Note on Vectors

Factor

myfac <- factor(c("basic", "proficient", "advanced", "minimal"))
class(myfac)
[1] "factor"
myfac  # What order are the factors in?
[1] basic      proficient advanced   minimal   
Levels: advanced basic minimal proficient

Ordering the Factor

myfac_o <- ordered(myfac, levels = c("minimal", "basic", "proficient", "advanced"))
myfac_o
[1] basic      proficient advanced   minimal   
Levels: minimal < basic < proficient < advanced
summary(myfac_o)
   minimal      basic proficient   advanced 
         1          1          1          1 

Reclassifying Factors

class(myfac_o)
[1] "ordered" "factor" 
unclass(myfac_o)
[1] 2 3 4 1
attr(,"levels")
[1] "minimal"    "basic"      "proficient" "advanced"  
defac <- unclass(myfac_o)
defac
[1] 2 3 4 1
attr(,"levels")
[1] "minimal"    "basic"      "proficient" "advanced"  

Defactor

defac <- function(x) {
    x <- as.character(x)
    x
}
defac(myfac_o)
[1] "basic"      "proficient" "advanced"   "minimal"   
defac <- defac(myfac_o)
defac
[1] "basic"      "proficient" "advanced"   "minimal"   

Convert to Numeric?

myfac_o
[1] basic      proficient advanced   minimal   
Levels: minimal < basic < proficient < advanced
as.numeric(myfac_o)
[1] 2 3 4 1
myfac
[1] basic      proficient advanced   minimal   
Levels: advanced basic minimal proficient
as.numeric(myfac)
[1] 2 4 1 3

Dates

mydate <- as.Date("7/20/2012", format = "%m/%d/%Y")
# Input is a character string and a parser
class(mydate)  # this is date
[1] "Date"
weekdays(mydate)  # what day of the week is it?
[1] "Friday"
mydate + 30  # Operate on dates
[1] "2012-08-19"

More Dates

# We can parse other formats of dates
mydate2 <- as.Date("8-5-1988", format = "%d-%m-%Y")
mydate2
[1] "1988-05-08"

mydate - mydate2
Time difference of 8839 days
# Can add and subtract two date objects

A few notes on dates

as.numeric(mydate)  # days since 1-1-1970
[1] 15541
as.Date(56, origin = "2013-4-29")  # we can set our own origin
[1] "2013-06-24"

Other Classes

b <- rnorm(5000)
c <- runif(5000)
a <- b + c
mymod <- lm(a ~ b)
class(mymod)
[1] "lm"

Why care so much about classes?

Data Structures in R

Vectors

print(1)
[1] 1
# The 1 in braces means this element is a vector of length 1
print("This tutorial is awesome")
[1] "This tutorial is awesome"
# This is a vector of length 1 consisting of a single 'string of characters'

Vectors 2

print(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
# This vector has 26 character elements
print(LETTERS[6])
[1] "F"
# The sixth element of this vector has length 1
length(LETTERS[6])
[1] 1
# The length of that element is a number with length 1

Matrices

mymat <- matrix(1:36, nrow = 6, ncol = 6)
rownames(mymat) <- LETTERS[1:6]
colnames(mymat) <- LETTERS[7:12]
class(mymat)
[1] "matrix"

Matrices II

rownames(mymat)
[1] "A" "B" "C" "D" "E" "F"
colnames(mymat)
[1] "G" "H" "I" "J" "K" "L"
mymat
  G  H  I  J  K  L
A 1  7 13 19 25 31
B 2  8 14 20 26 32
C 3  9 15 21 27 33
D 4 10 16 22 28 34
E 5 11 17 23 29 35
F 6 12 18 24 30 36

More Matrices

dim(mymat)  # We have 6 rows and 6 columns
[1] 6 6
myvec <- c(5, 3, 5, 6, 1, 2)
length(myvec)  # What happens when you do dim(myvec)?
[1] 6
newmat <- cbind(mymat, myvec)
newmat
  G  H  I  J  K  L myvec
A 1  7 13 19 25 31     5
B 2  8 14 20 26 32     3
C 3  9 15 21 27 33     5
D 4 10 16 22 28 34     6
E 5 11 17 23 29 35     1
F 6 12 18 24 30 36     2

Matrix Functions

foo.mat <- matrix(c(rnorm(100), runif(100), runif(100), rpois(100, 2)), ncol = 4)
head(foo.mat)
        [,1]   [,2]    [,3] [,4]
[1,]  0.1869 0.9402 0.53147    2
[2,] -0.3302 0.3644 0.04441    1
[3,] -1.1225 0.5714 0.38434    3
[4,]  1.6899 0.7589 0.03472    2
[5,] -0.2813 0.4407 0.16683    1
[6,] -2.2288 0.1938 0.17287    2
cor(foo.mat)
         [,1]    [,2]    [,3]     [,4]
[1,]  1.00000 0.03157 0.01530 -0.03358
[2,]  0.03157 1.00000 0.02744  0.06862
[3,]  0.01530 0.02744 1.00000  0.02234
[4,] -0.03358 0.06862 0.02234  1.00000

Converting Matrices

mycorr <- cor(foo.mat)
class(mycorr)
[1] "matrix"
mycorr2 <- as.data.frame(mycorr)
class(mycorr2)
[1] "data.frame"
mycorr2
        V1      V2      V3       V4
1  1.00000 0.03157 0.01530 -0.03358
2  0.03157 1.00000 0.02744  0.06862
3  0.01530 0.02744 1.00000  0.02234
4 -0.03358 0.06862 0.02234  1.00000

Arrays

myarray <- array(1:42, dim = c(7, 3, 2), dimnames = list(c("tiny", "small", 
    "medium", "medium-ish", "large", "big", "huge"), c("slow", "moderate", "fast"), 
    c("boring", "fun")))
class(myarray)
[1] "array"
dim(myarray)
[1] 7 3 2

Arrays II

dimnames(myarray)
[[1]]
[1] "tiny"       "small"      "medium"     "medium-ish" "large"     
[6] "big"        "huge"      

[[2]]
[1] "slow"     "moderate" "fast"    

[[3]]
[1] "boring" "fun"   
myarray
, , boring

           slow moderate fast
tiny          1        8   15
small         2        9   16
medium        3       10   17
medium-ish    4       11   18
large         5       12   19
big           6       13   20
huge          7       14   21

, , fun

           slow moderate fast
tiny         22       29   36
small        23       30   37
medium       24       31   38
medium-ish   25       32   39
large        26       33   40
big          27       34   41
huge         28       35   42

Lists

mylist <- list(vec = myvec, mat = mymat, arr = myarray, date = mydate)
class(mylist)
[1] "list"
length(mylist)
[1] 4
names(mylist)
[1] "vec"  "mat"  "arr"  "date"

Print a List

str(mylist)
List of 4
 $ vec : num [1:6] 5 3 5 6 1 2
 $ mat : int [1:6, 1:6] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:6] "A" "B" "C" "D" ...
  .. ..$ : chr [1:6] "G" "H" "I" "J" ...
 $ arr : int [1:7, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "dimnames")=List of 3
  .. ..$ : chr [1:7] "tiny" "small" "medium" "medium-ish" ...
  .. ..$ : chr [1:3] "slow" "moderate" "fast"
  .. ..$ : chr [1:2] "boring" "fun"
 $ date: Date[1:1], format: "2012-07-20"

Lists (II)

mylist$vec
[1] 5 3 5 6 1 2
mylist[[2]][1, 3]
[1] 13

So what?

attributes(mylist)
$names
[1] "vec"  "mat"  "arr"  "date"
attributes(myarray)[1:2][2]
$dimnames
$dimnames[[1]]
[1] "tiny"       "small"      "medium"     "medium-ish" "large"     
[6] "big"        "huge"      

$dimnames[[2]]
[1] "slow"     "moderate" "fast"    

$dimnames[[3]]
[1] "boring" "fun"   

Dataframes

str(df[, 25:32])
'data.frame':   2700 obs. of  8 variables:
 $ district  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ schoolhigh: int  0 0 0 0 0 0 0 0 0 0 ...
 $ schoolavg : int  1 1 1 1 1 1 1 1 1 1 ...
 $ schoollow : int  0 0 0 0 0 0 0 0 0 0 ...
 $ readSS    : num  357 264 370 347 373 ...
 $ mathSS    : num  387 303 365 344 441 ...
 $ proflvl   : Factor w/ 4 levels "advanced","basic",..: 2 3 2 2 2 4 4 4 3 2 ...
 $ race      : Factor w/ 5 levels "A","B","H","I",..: 2 2 2 2 2 2 2 2 2 2 ...

Converting Between Types

Summing it Up

Other References

Books

Session Info

It is good to include the session info, e.g. this document is produced with knitr version 1.5. Here is my session info:

print(sessionInfo(), locale = FALSE)
R version 3.1.0 (2014-04-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sos_1.3-8            rmarkdown_0.1.90     markdown_0.6.7      
[4] knitrBootstrap_0.9.0 ggvis_0.1.0.99       ggplot2_0.9.3.1     
[7] foreign_0.8-61       brew_1.0-6           knitr_1.5           

loaded via a namespace (and not attached):
 [1] assertthat_0.1   bitops_1.0-6     caTools_1.17     codetools_0.2-8 
 [5] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10.4  
 [9] grid_3.1.0       gtable_0.1.2     httpuv_1.3.0     labeling_0.2    
[13] MASS_7.3-31      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
[17] Rcpp_0.11.1      reshape2_1.4     RJSONIO_1.0-3    scales_0.2.4    
[21] shiny_0.9.1.9005 stringr_0.6.2    tools_3.1.0      whisker_0.3-2   
[25] xtable_1.7-3    

Reading Data, Loading Packages

Click for a brief introduction of R data reading/importing facilities

Attribution and License

Public Domain Mark
This work (R Tutorial for Education, by Jared E. Knowles), in service of the Wisconsin Department of Public Instruction, is **free of known copyright restrictions**. Some pages were deleted/inserted by [I. Ozkan](http://yunus.hacettepe.edu.tr/~iozkan) to make more suitable to the Economics Students.