3 + 4
[1] 7
3 - 4
[1] -1
3/4
[1] 0.75
3^2
[1] 9
^2 pi
[1] 9.869604
Stat 205 refresher
Dr. Irene Vrbik
University of British Columbia Okanagan
Much of R works as you might expect. It can function as a basic calculator…
You can make objects and assign values…
Note that in R you can use <-
or =
as the assignment operator. Namely, we could have written:
R is case-sensitive
And be careful about leaving out operators…
versus
The numeric
data type in R represents continuous valued variables (i.e. real numbers or decimal values). Example:
Note the use of class()
to check the data type in R.
The integer
data type in R represents discrete value varaibles (i.e. whole numbers without decimal points). Example:
The logical
data type in R represents logical, i.e. Boolean values (TRUE
or FALSE
). Used for logical conditions and comparisons. Example
The logical
data type in R represents text or strings. Example:
Note the use of semicolon ;
for statement separator.
We can create a vector using c
(for combine)
Vectors can contain strings instead of numbers:
Note that vectors have to have the same data type. For example:
[1] "3" "pillow" "TRUE"
[1] 3 1
We can also make quick sequences…
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[19] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
[37] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
[55] 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
[73] 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[91] 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
[109] 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
[127] 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
[145] 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
[163] 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184
[181] 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
[199] 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
[217] 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
[235] 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
[253] 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274
[271] 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292
[289] 293 294 295 296 297 298 299 300
We can check how many elements a vector has using length()
function:
To index an element from a vector, use single square brackets
To index multiple elements from a vector we could use:
[1] "bananas" "oranges"
[1] "pineapples" "bananas"
[1] "bananas" "oranges" "pineapples"
We can also extract elements of our vector that fulfill a certain condition. For example let’s extract all of the elements that are larger than 3 from vec
Note that vec>3
is a logical vector:
This logical vector in this case is being used to index all the elements for which this logical check returns true. In this case, we are extracting elements 1, 2, 3, 4. To extract that index in R we could use the which()
function
Checks can also be done on strings. For example,
[1] "apples"
[1] "apples" "pineapples"
There is almost always more than one way of accomplishing a task. For example, finding the average of our vector vec
[1] 3.5
[1] 9.856667
[1] 9.856667
[1] 9.856667
Notice the use of #
to indicate the start of a comment.
For another example, how about the standard deviation of vec
?
[1] 15.8552
[1] 15.8552
But without having manually calculated the sample standard deviation beforehand, how could we be confident that the “sd” function is using the unbiased divisor “(n - 1)”? Type the following in your console (bottom left window under default RStudio settings), then hit enter:
You should see a help document pop up in the bottom right window (again, default Rstudio settings). If you read through the Details section of that document, it will specify the denominator it is using.
Caution: R is open-source, and even packages hosted on the official CRAN repository will vary in terms of the actual helpfulness of the help files.
By default, matrices are constructed column-wise.
You can always change to row-wise by specifying the argument byrow=TRUE
The following extracts a column, row, and cell from matrix m
,
Data frames are perhaps the most used data structure used in statistics. They can store different data types and can contain additional attributes such as column and row names. When combining vectors in a data frame all must have the same size (i.e. length). Missing observations will be recorded as NA
.
Here is an example of how to create a data frame and add to it:
Person Grade
1 John 45
2 Jill 92
3 Jack 91
To extract the first column, we can either reference it by number:
OR we can reference it by name:
A third option is to extract a column of this data frame using the $
operator:
Adding a column to the data frame is as easy as typing:
Alternatively, we could have let R figure out whether the student passed or not based on the grade and typed:
Many data sets are available in “base” R and a larger collection can be accessed via R packages. For example, if we want to gain access to the Old Faithful data frame, we could type:
To learn more on this data set type ?faithful
into the R console. To see the first few rows of this data set type:
Another cool feature is we can get a summary of each column using:
eruptions waiting
Min. :1.600 Min. :43.0
1st Qu.:2.163 1st Qu.:58.0
Median :4.000 Median :76.0
Mean :3.488 Mean :70.9
3rd Qu.:4.454 3rd Qu.:82.0
Max. :5.100 Max. :96.0
As before, we can extract column names using the dollar sign and perform operations on those vectors, for example
Rather than calling the columns from the data frame using $
, you could instead “attach” the data set (see ?attach
). This essentially means the columns will be saved to variables that you can call directly. For example, the following code will produce an error since,
However, once we attach the faithful data set, R will have a vector called eruptions
and waiting
:
In a practical setting, we will most likely be required to load our own data into R at some point. The easiest way to do this is by saving your data in csv (comma separated values) format and using the read.csv()
function; see ?read.csv
for more details. Assuming your data is stored in your working directory (more on this in a second), you can load your csv file into R using the read.csv
function:
read.csv(
name_of_file.csv)
.
Typically, we would like to save the data set to some object that can perform actions on. For example, I can load a simple data matrix (stored in file datamatrix.csv
) and save it to an object called dat
:
The above code assumes you data is stored in you working directory. Read on to see how to specify this in R. This data set is available on Canvas for you to test.
Alternatively, we can pull a csv file straight from the web:
X species island bill_length_mm bill_depth_mm flipper_length_mm
1 1 Adelie Torgersen 39.1 18.7 181
2 2 Adelie Torgersen 39.5 17.4 186
3 3 Adelie Torgersen 40.3 18.0 195
4 4 Adelie Torgersen NA NA NA
5 5 Adelie Torgersen 36.7 19.3 193
6 6 Adelie Torgersen 39.3 20.6 190
body_mass_g sex year
1 3750 male 2007
2 3800 female 2007
3 3250 female 2007
4 NA <NA> 2007
5 3450 female 2007
6 3650 male 2007
Independent of R, it is usually a good idea to put school projects into some organized folder system. For instance, in my Documents folder on my Mac I currently have the following filing organization:
data311/
|-- labs/
| |-- Lab00.Rmd
| |-- practice00.Rmd
| |-- datamatrix.csv
|-- assignemnts/
| |-- template.Rmd
| |-- example.csv
Suppose I am working on this lab and I want to read in the datamatrix.csv
file. I could access the file using the whole path. For example:
but I don’t recommend this (and if fact this won’t work within Rmd documents–more on this below).
Alternatively (and preferably) you should save this into your working directory. The working directory is just a file path on your computer that sets the default location of any files you read (or write out) into R. To set your working directory use setwd
. Following from the file organization example above, if I want my working directory to be the labs
folder, I would need to specify that path as the sole argument in the setwd
function:
Once you set your working directory to the same folder in which your data file is stored you can reference it with no path.
N.B. You can check what your working directory is use getwd()
. If you are working within an R Markdown document your working directory is the folder that contains the Rmd file (more on this below)
Most standard statistical analyses are built-in, but a vast array of more complex analyses are available. We will often require installation of additional packages to complete assignments and labs. For example, if you want to install the ISL2
package (the package associated with our text book) type:
The above needs only to be done once. With every new R session/script/Rmd file, if you want to gain access to all of the contents within this package, we first need to load or “attach” it using:
By loading this library into our session, we can now access any functions and data sets within this package. The manual and details can be found on the CRAN website: https://cran.rstudio.com/web/packages/ISLR2/index.html
Let’s learn the structure of writing our own functions in R. To learn the syntax, we’ll simply make a function that adds 10 to any inputted value.
Note that the final line in the function will be the outputted value. Let’s test it:
In many of the ‘artistic’ olympic sports, the judging panel is comprised of several countries, and any participant’s score is averaged (or totaled) after removing both the minimum and maximum values. Let’s write a function in R which performs this type of averaging for any inputted vector of numeric values. Call it olymean
.
Let’s test out this function on the fictious data scores in the cheer.csv file which contains the scores for two Cheerleeding teams (Navarro and Trinity Valley) across five judges. First let’s read in the data set and view it:
Judge Navarro Trinity.Valley
1 1 9.8 9.9
2 2 9.6 9.9
3 3 9.7 9.8
4 4 9.7 9.7
5 5 9.9 9.5
N.B. spaces are automatically replaced by .
in column names since R does not allow spaces in variable names.
[1] 9.74
[1] 9.733333
[1] 9.74
[1] 9.8
What happens if your input data includes missing values? While these are not scores, let’s try and apply this function to the Ozone
values in the airquality
data set which indeed contain NA
s (i.e. missing values)
[1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
[19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
[37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
[55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
[73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
[91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
[109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
[127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20
By default, we will get an error when we try to calculate the average of these values:
If we look at the documentation for the mean
function (?mean
) you will see that there is an argument na.rm
that tells R whether NA
vluaes should be removed before the computation proceeds. By default this argument set to FALSE
. If, we change this to TRUE
, R will calculate the average with the NA
s removed:
Interestingly, our olymean
function works fine:
This is because the sort
function removed NA
s by default (see ?sort
):
[1] 1 4 6 7 7 7 8 9 9 9 10 11 11 11 12 12 13 13
[19] 13 13 14 14 14 14 16 16 16 16 18 18 18 18 19 20 20 20
[37] 20 21 21 21 21 22 23 23 23 23 23 23 24 24 27 28 28 28
[55] 29 30 30 31 32 32 32 34 35 35 36 36 37 37 39 39 40 41
[73] 44 44 44 45 45 46 47 48 49 50 52 59 59 61 63 64 64 65
[91] 66 71 73 73 76 77 78 78 79 80 82 84 85 85 89 91 96 97
[109] 97 108 110 115 118 122 135 168
Since we are calling sort
before calling the mean
function this will run without error, however, it is important to note that this function is removing the smallest number, the largest number, AND any missing values before computing the mean.