R Basics

Stat 205 refresher

Author

Affiliation

Dr. Irene Vrbik

University of British Columbia Okanagan

R basics

Calculations

Much of R works as you might expect. It can function as a basic calculator…

3 + 4

[1] 7

3 - 4

[1] -1

3/4

[1] 0.75

3^2

[1] 9

pi^2

[1] 9.869604

Creating variables

You can make objects and assign values…

x <- 3
y <- 4
x + y

[1] 7

Note that in R you can use <- or = as the assignment operator. Namely, we could have written:

x = 3
y = 4

R is case-sensitive

Error: object 'X' not found

And be careful about leaving out operators…

2(x+y)

Error: attempt to apply non-function

versus

2*(x+y)

[1] 14

Data Types

Numeric

The numeric data type in R represents continuous valued variables (i.e. real numbers or decimal values). Example:

x <- 3.14
class(x)

[1] "numeric"

Note the use of class() to check the data type in R.

Integer

The integer data type in R represents discrete value varaibles (i.e. whole numbers without decimal points). Example:

y <- 42L #  L indicates integer
class(y)

[1] "integer"

Logical

The logical data type in R represents logical, i.e. Boolean values (TRUE or FALSE). Used for logical conditions and comparisons. Example

ans <- x > y
ans

[1] FALSE

class(ans)

[1] "logical"

Character

The logical data type in R represents text or strings. Example:

z <- "hello world!"; class(z) # ';' for statement separator

[1] "character"

Note the use of semicolon ; for statement separator.

Data Structures

Vectors

We can create a vector using c (for combine)

vec <- c(x,y,5,6,2,1)

Vectors can contain strings instead of numbers:

z = c("apples","bananas","oranges", "pineapples")

Note that vectors have to have the same data type. For example:

mixed <- c(3, "pillow", TRUE)
mixed # treats everything as characters

[1] "3"      "pillow" "TRUE"

eg2 <- c(3, TRUE)
eg2 # treats everything as numbers

[1] 3 1

We can also make quick sequences…

j <- 5:300
j

  [1]   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22
 [19]  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40
 [37]  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58
 [55]  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76
 [73]  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94
 [91]  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112
[109] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
[127] 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
[145] 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
[163] 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
[181] 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
[199] 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
[217] 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
[235] 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
[253] 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
[271] 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292
[289] 293 294 295 296 297 298 299 300

We can check how many elements a vector has using length() function:

length(j)

[1] 296

To index an element from a vector, use single square brackets

vec[3]

[1] 5

To index multiple elements from a vector we could use:

z[2:3]    # extracts the second to third element (inclusive)

[1] "bananas" "oranges"

z[c(4,2)] # extracts the fourth and second element (in that order)

[1] "pineapples" "bananas"

z[-1]     # extracts every element *except* for the first

[1] "bananas"    "oranges"    "pineapples"

We can also extract elements of our vector that fulfill a certain condition. For example let’s extract all of the elements that are larger than 3 from vec

vec

[1]  3.14 42.00  5.00  6.00  2.00  1.00

vec[vec>3]

[1]  3.14 42.00  5.00  6.00

Note that vec>3 is a logical vector:

vec > 3

[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE

This logical vector in this case is being used to index all the elements for which this logical check returns true. In this case, we are extracting elements 1, 2, 3, 4. To extract that index in R we could use the which() function

which(vec>3)

[1] 1 2 3 4

Checks can also be done on strings. For example,

z[z=="apples"]

[1] "apples"

z[grepl("apples", z)]  # chekcs for any element that has the word "apples" in it

[1] "apples"     "pineapples"

There is almost always more than one way of accomplishing a task. For example, finding the average of our vector vec

# tedious/long/hard
(3 + 4 + 5 + 6 + 2 + 1) / 6

[1] 3.5

# better
sum(vec) / 6

[1] 9.856667

# Generalizable
sum(vec) / length(vec)

[1] 9.856667

# easiest
mean(vec)

[1] 9.856667

Notice the use of # to indicate the start of a comment.
For another example, how about the standard deviation of vec?

#"Hard" sample std dev
sqrt((sum( vec ^ 2 - mean(vec) ^ 2)) / (6 - 1))

[1] 15.8552

#Easy
sd(vec)

[1] 15.8552

But without having manually calculated the sample standard deviation beforehand, how could we be confident that the “sd” function is using the unbiased divisor “(n - 1)”? Type the following in your console (bottom left window under default RStudio settings), then hit enter:

?sd

You should see a help document pop up in the bottom right window (again, default Rstudio settings). If you read through the Details section of that document, it will specify the denominator it is using.

Caution: R is open-source, and even packages hosted on the official CRAN repository will vary in terms of the actual helpfulness of the help files.

Matrices

By default, matrices are constructed column-wise.

(m <- matrix(1:6, nrow=2, ncol =3)) # fill columnwise

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

You can always change to row-wise by specifying the argument byrow=TRUE

(m <- matrix(1:6, nrow=2, ncol =3, byrow=TRUE)) #fill row-wise

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

The following extracts a column, row, and cell from matrix m,

# extract the third column of m
m[,3]

[1] 3 6

# extract the first row of m
m[1,]

[1] 1 2 3

# extract the element in the 1st row and 3rd column
m[1,3]

[1] 3

Data frames

Data frames are perhaps the most used data structure used in statistics. They can store different data types and can contain additional attributes such as column and row names. When combining vectors in a data frame all must have the same size (i.e. length). Missing observations will be recorded as NA.

Here is an example of how to create a data frame and add to it:

Person=c('John', 'Jill', 'Jack')
Grade=c('45','92','91')
(Lab=data.frame(Person, Grade))

  Person Grade
1   John    45
2   Jill    92
3   Jack    91

To extract the first column, we can either reference it by number:

Lab[,1]

[1] "John" "Jill" "Jack"

OR we can reference it by name:

Lab[,"Person"]

[1] "John" "Jill" "Jack"

A third option is to extract a column of this data frame using the $ operator:

Lab$Person

[1] "John" "Jill" "Jack"

Adding a column to the data frame is as easy as typing:

Lab$Passed=c(FALSE,TRUE,TRUE)
Lab

  Person Grade Passed
1   John    45  FALSE
2   Jill    92   TRUE
3   Jack    91   TRUE

Alternatively, we could have let R figure out whether the student passed or not based on the grade and typed:

Lab$Passed = Lab$Grade >= 50
Lab

  Person Grade Passed
1   John    45  FALSE
2   Jill    92   TRUE
3   Jack    91   TRUE

Importing data into R

Using R data sets

Many data sets are available in “base” R and a larger collection can be accessed via R packages. For example, if we want to gain access to the Old Faithful data frame, we could type:

data("faithful")

To learn more on this data set type ?faithful into the R console. To see the first few rows of this data set type:

head(faithful)

  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55

Another cool feature is we can get a summary of each column using:

summary(faithful)

   eruptions        waiting    
 Min.   :1.600   Min.   :43.0  
 1st Qu.:2.163   1st Qu.:58.0  
 Median :4.000   Median :76.0  
 Mean   :3.488   Mean   :70.9  
 3rd Qu.:4.454   3rd Qu.:82.0  
 Max.   :5.100   Max.   :96.0

As before, we can extract column names using the dollar sign and perform operations on those vectors, for example

mean(faithful$eruptions) # to give the average eruption time

[1] 3.487783

Rather than calling the columns from the data frame using $, you could instead “attach” the data set (see ?attach). This essentially means the columns will be saved to variables that you can call directly. For example, the following code will produce an error since,

mean(eruptions) # will produce an Error if we use before calling attach(faithful)

However, once we attach the faithful data set, R will have a vector called eruptions and waiting:

attach(faithful)
mean(eruptions)

[1] 3.487783

Importing from csv

In a practical setting, we will most likely be required to load our own data into R at some point. The easiest way to do this is by saving your data in csv (comma separated values) format and using the read.csv() function; see ?read.csv for more details. Assuming your data is stored in your working directory (more on this in a second), you can load your csv file into R using the read.csv function:

read.csv(name_of_file.csv).

Typically, we would like to save the data set to some object that can perform actions on. For example, I can load a simple data matrix (stored in file datamatrix.csv) and save it to an object called dat:

dat <- read.csv("data/datamatrix.csv")
dat

The above code assumes you data is stored in you working directory. Read on to see how to specify this in R. This data set is available on Canvas for you to test.

Alternatively, we can pull a csv file straight from the web:

penguins <- read.csv("https://irene.vrbik.ok.ubc.ca/data/penguins.csv")
head(penguins)

  X species    island bill_length_mm bill_depth_mm flipper_length_mm
1 1  Adelie Torgersen           39.1          18.7               181
2 2  Adelie Torgersen           39.5          17.4               186
3 3  Adelie Torgersen           40.3          18.0               195
4 4  Adelie Torgersen             NA            NA                NA
5 5  Adelie Torgersen           36.7          19.3               193
6 6  Adelie Torgersen           39.3          20.6               190
  body_mass_g    sex year
1        3750   male 2007
2        3800 female 2007
3        3250 female 2007
4          NA   <NA> 2007
5        3450 female 2007
6        3650   male 2007

Setting your working directory

Independent of R, it is usually a good idea to put school projects into some organized folder system. For instance, in my Documents folder on my Mac I currently have the following filing organization:

data311/
|-- labs/
|   |-- Lab00.Rmd
|   |-- practice00.Rmd
|   |-- datamatrix.csv
|-- assignemnts/
|   |-- template.Rmd
|   |-- example.csv

Suppose I am working on this lab and I want to read in the datamatrix.csv file. I could access the file using the whole path. For example:

dat <- read.csv("/Users/ivrbik/Documents/data311/labs/datamatrix.csv")

but I don’t recommend this (and if fact this won’t work within Rmd documents–more on this below).

Alternatively (and preferably) you should save this into your working directory. The working directory is just a file path on your computer that sets the default location of any files you read (or write out) into R. To set your working directory use setwd. Following from the file organization example above, if I want my working directory to be the labs folder, I would need to specify that path as the sole argument in the setwd function:

setwd("/Users/ivrbik/Documents/data311/labs/")

Once you set your working directory to the same folder in which your data file is stored you can reference it with no path.

dat <- read.csv("datamatrix.csv")

N.B. You can check what your working directory is use getwd(). If you are working within an R Markdown document your working directory is the folder that contains the Rmd file (more on this below)

R packages

Most standard statistical analyses are built-in, but a vast array of more complex analyses are available. We will often require installation of additional packages to complete assignments and labs. For example, if you want to install the ISL2 package (the package associated with our text book) type:

install.packages("ISLR2")

The above needs only to be done once. With every new R session/script/Rmd file, if you want to gain access to all of the contents within this package, we first need to load or “attach” it using:

library("ISLR2")

By loading this library into our session, we can now access any functions and data sets within this package. The manual and details can be found on the CRAN website: https://cran.rstudio.com/web/packages/ISLR2/index.html

Writing Functions - basics

Let’s learn the structure of writing our own functions in R. To learn the syntax, we’ll simply make a function that adds 10 to any inputted value.

shift10 <- function(x){
  newx <- x + 10 
  return(newx)
}

Note that the final line in the function will be the outputted value. Let’s test it:

shift10(40)

[1] 50

Example

In many of the ‘artistic’ olympic sports, the judging panel is comprised of several countries, and any participant’s score is averaged (or totaled) after removing both the minimum and maximum values. Let’s write a function in R which performs this type of averaging for any inputted vector of numeric values. Call it olymean.

# x = vector of scores
olymean <- function(x){
  trimmed_x <- sort(x)[-c(1, length(x))] # removes the smallest and largest number from x
  mean(trimmed_x) 
}

Let’s test out this function on the fictious data scores in the cheer.csv file which contains the scores for two Cheerleeding teams (Navarro and Trinity Valley) across five judges. First let’s read in the data set and view it:

scores <- read.csv("data/cheer.csv")
scores

  Judge Navarro Trinity.Valley
1     1     9.8            9.9
2     2     9.6            9.9
3     3     9.7            9.8
4     4     9.7            9.7
5     5     9.9            9.5

N.B. spaces are automatically replaced by . in column names since R does not allow spaces in variable names.

mean(scores$Navarro) # overall average

[1] 9.74

olymean(scores$Navarro) # average of the middle three scores

[1] 9.733333

mean(scores$Navarro)

[1] 9.74

olymean(scores$Trinity.Valley)

[1] 9.8

What happens if your input data includes missing values? While these are not scores, let’s try and apply this function to the Ozone values in the airquality data set which indeed contain NAs (i.e. missing values)

airquality$Ozone

  [1]  41  36  12  18  NA  28  23  19   8  NA   7  16  11  14  18  14  34   6
 [19]  30  11   1  11   4  32  NA  NA  NA  23  45 115  37  NA  NA  NA  NA  NA
 [37]  NA  29  NA  71  39  NA  NA  23  NA  NA  21  37  20  12  13  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA 135  49  32  NA  64  40  77  97  97  85  NA
 [73]  10  27  NA   7  48  35  61  79  63  16  NA  NA  80 108  20  52  82  50
 [91]  64  59  39   9  16  78  35  66 122  89 110  NA  NA  44  28  65  NA  22
[109]  59  23  31  44  21   9  NA  45 168  73  NA  76 118  84  85  96  78  73
[127]  91  47  32  20  23  21  24  44  21  28   9  13  46  18  13  24  16  13
[145]  23  36   7  14  30  NA  14  18  20

By default, we will get an error when we try to calculate the average of these values:

mean(airquality$Ozone)

[1] NA

If we look at the documentation for the mean function (?mean) you will see that there is an argument na.rm that tells R whether NA vluaes should be removed before the computation proceeds. By default this argument set to FALSE. If, we change this to TRUE, R will calculate the average with the NAs removed:

mean(airquality$Ozone, na.rm = TRUE)

[1] 42.12931

Interestingly, our olymean function works fine:

olymean(airquality$Ozone)

[1] 42.48696

This is because the sort function removed NAs by default (see ?sort):

sort(airquality$Ozone)

  [1]   1   4   6   7   7   7   8   9   9   9  10  11  11  11  12  12  13  13
 [19]  13  13  14  14  14  14  16  16  16  16  18  18  18  18  19  20  20  20
 [37]  20  21  21  21  21  22  23  23  23  23  23  23  24  24  27  28  28  28
 [55]  29  30  30  31  32  32  32  34  35  35  36  36  37  37  39  39  40  41
 [73]  44  44  44  45  45  46  47  48  49  50  52  59  59  61  63  64  64  65
 [91]  66  71  73  73  76  77  78  78  79  80  82  84  85  85  89  91  96  97
[109]  97 108 110 115 118 122 135 168

Since we are calling sort before calling the mean function this will run without error, however, it is important to note that this function is removing the smallest number, the largest number, AND any missing values before computing the mean.