[1] "numeric"
STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
At the heart of any statistical analysis is data!

Variables can take different forms, such as:
Note
The data type will dictate what kind of visualizations we can create and what type of statistical tests can be used.
To learn from data, we can summarize them:
numerically (e.g., means, standard deviations)
visually (e.g., bar plots, violin plots)
This lecture focuses1 on the mechanics and construction of summary statistics and graphs using R.
⚖️ Measures of central tendency:
📐Measures of dispersion:
🏷️ Categorical (qualitative)
nominal (factor)
ordinal (ordered factor)
Oftentimes categorical variables are represented as numbers:
0 = never married, 1 = married, 2 = divorced1 = Strongly disagree, 2 = disagree, 3 = neutral, 4 = somewhat agree, 5 = strongly agreeWhile it’s sometimes appropriate to treat these variables as numeric, we will often coerce1 them to the proper data type.
Character Data Type (character or string):
Example 1 coerce character type to numeric.
R Structures
If stored in a data frame, you can see the data type using str() function
In my lecture slides, I will be print my data frames in a a paged table1 display which presents the structure of the created data frame including:
Example 2 (Marital status) coerce numeric to factor
In R, there are several data structures for organizing and storing data.
Vectors: a collection of elements of the same data type.
iClicker
Exercise 1 (Data Types in R) What is the correct R data type for representing ordered categories like "Low", "Medium", and "High"?
The most common data structure we will be working with is a data.frame.
In R, a data frame is a two-dimensional, tabular data structure similar to a spreadsheet or a SQL table.
Rows represent individual observations or cases, and columns represent variables or attributes associated with those observations.
Columns can have different data types.
In R there are several ways for indexing or subsetting elements in a data structure. Recall objects we created earlier:
For vectors, [i] is used to extract the element at index i.
Recall the matrix we defined earlier:
For matrices, [i, j] is used to extract the element in the \(i\)th row and \(j\)th column.
For lists, [[i]] is used to extract the \(i\)th element.
We can use [i, j] to extract the \(i\)th row and \(j\)th column from data frames
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
Similarly the $ operator can be used to extract columns from data frames
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
We’ll often use data sets from R packages
Steps to using a dataset from an R package
install.packages("packagename")1
load("packagename")
help files
Dataset documentation is available through R’s help system via ?"objectname"
Note
Example 3 (Dataset from openintro) Let’s load the births14 data set from the openinto package and view its help file.
Install the package
Never use install.packages() in a .qmd/.rmd/.r file
Package installation should be done once, in the Console. If you include install.packages() in your Quarto documents, you will re-download the package every time you render your document.
A screenshot of my R session where I have installed the openintro package using the install.packages() function in the Console
Never use the install.packages() function in qmd/rmd files or R scripts. This function should exclusively be run in the Console.
Running the above command will make all the package’s datasets and functions available in your session.
Warning
Unlike install.packages(), which only needs to be run once, the library() command must be run every time you want to use a package.
Now that we have loaded the library, we can access its datasets (and functions).
In R, help() and ? (shortcut) are used to open the help page (documentation) for a function, dataset, or package.
For example, help(mean) or ?mean shows how the function works, its arguments, and examples.
Important
The help file documents a short description and source of the data, variable definitions, units, and known limitations. Always consult the help file before analyzing any dataset to ensure correct interpretation of variables and context.
The help file (?births14) provides the authoritative description of the dataset, including it’s source and variable definitions. Always consult the help file before analyzing any dataset to ensure correct interpretation of variables and context.
🖥️ Console (Interactive Mode)
📄 Quarto Render
.qmd fileObjects created during Quarto rendering are not available in the Console.
Objects created in the Console are not available when rendering a Quarto document.
Here are some other useful functions for inspecting your data:
tibble [1,000 × 13] (S3: tbl_df/tbl/data.frame)
$ fage : int [1:1000] 34 36 37 NA 32 32 37 29 30 29 ...
$ mage : num [1:1000] 34 31 36 16 31 26 36 24 32 26 ...
$ mature : chr [1:1000] "younger mom" "younger mom" "mature mom" "younger mom" ...
$ weeks : num [1:1000] 37 41 37 38 36 39 36 40 39 39 ...
$ premie : chr [1:1000] "full term" "full term" "full term" "full term" ...
$ visits : num [1:1000] 14 12 10 NA 12 14 10 13 15 11 ...
$ gained : num [1:1000] 28 41 28 29 48 45 20 65 25 22 ...
$ weight : num [1:1000] 6.96 8.86 7.51 6.19 6.75 6.69 6.13 6.74 8.94 9.12 ...
$ lowbirthweight: chr [1:1000] "not low" "not low" "not low" "not low" ...
$ sex : chr [1:1000] "male" "female" "female" "male" ...
$ habit : chr [1:1000] "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
$ marital : chr [1:1000] "married" "married" "married" "not married" ...
$ whitemom : chr [1:1000] "white" "white" "not white" "white" ...
A benefit of looking at the structure of a data frame using str() has the benefit of learning the data type of each variable.
It is always important the data is coded as we expect (especially true when reading data into R using read.csv).
The data type will dictate which summary statistics and visualizations are most appropriate.
In this section we will consider the loan50 dataset from the openintro pacakge.
It represents 50 loans made through the Lending Club platform, which allows individuals to lend to other individuals.
This is a subset of the loans_full_schema dataset from the same package; see ?loans_full_schema for details.
loan50
Let’s consider the
tibble [50 × 18] (S3: tbl_df/tbl/data.frame)
$ state : Factor w/ 51 levels "","AK","AL","AR",..: 32 6 41 6 36 16 35 25 11 11 ...
$ emp_length : num [1:50] 3 10 NA 0 4 6 2 10 6 3 ...
$ term : num [1:50] 60 36 36 36 60 36 36 36 60 60 ...
$ homeownership : Factor w/ 3 levels "rent","mortgage",..: 1 1 2 1 2 2 1 2 1 2 ...
$ annual_income : num [1:50] 59000 60000 75000 75000 254000 67000 28800 80000 34000 80000 ...
$ verified_income : Factor w/ 4 levels "","Not Verified",..: 2 2 4 2 2 3 3 2 2 3 ...
$ debt_to_income : num [1:50] 0.558 1.306 1.056 0.574 0.238 ...
$ total_credit_limit : int [1:50] 95131 51929 301373 59890 422619 349825 15980 258439 87705 330394 ...
$ total_credit_utilized : int [1:50] 32894 78341 79221 43076 60490 72162 2872 28073 23715 32036 ...
$ num_cc_carrying_balance: int [1:50] 8 2 14 10 2 4 1 3 10 4 ...
$ loan_purpose : Factor w/ 14 levels "","car","credit_card",..: 4 3 4 3 5 5 4 3 3 4 ...
$ loan_amount : int [1:50] 22000 6000 25000 6000 25000 6400 3000 14500 10000 18500 ...
$ grade : Factor w/ 8 levels "","A","B","C",..: 3 3 6 3 3 3 5 2 2 4 ...
$ interest_rate : num [1:50] 10.9 9.92 26.3 9.92 9.43 ...
$ public_record_bankrupt : int [1:50] 0 1 0 0 0 0 0 0 0 1 ...
$ loan_status : Factor w/ 7 levels "","Charged Off",..: 3 3 3 3 3 3 3 3 3 3 ...
$ has_second_income : logi [1:50] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ total_income : num [1:50] 59000 60000 75000 75000 254000 67000 28800 80000 34000 192000 ...
Scatterplots are useful for visualizing the relationship between two numerical variables. Create scatterplots in R using:
or alternatively:
Important
When you attach() a dataframe, the column names of that data frame become directly accessible without specifying the data frame name.

Alternatively we can create a scatterplot using the popular ggplot package
# Load the ggplot2 library
library(ggplot2)
ggplot(loan50, aes(x = annual_income, y = loan_amount)) +
geom_point() +
scale_x_continuous(
labels = scales::dollar_format()
) +
scale_y_continuous(
labels = scales::dollar_format()
) +
labs(
title = "Scatterplot of Annual Income vs Loan Amount",
x = "Annual Income",
y = "Loan Amount"
)
A dot plot is a one-variable scatterplot
An example of a dotchart using the interest rate of 50 loans
The mean, often called the average, is a common way to measure the center of a distribution of data.
To compute the mean interest rate, we add up all the interest rates and divide by the number of observations.
In R this is accomplished using the mean() function.
Mean (Average) formula for sample vs. population
The sample mean is often labeled \(\overline{x}\). The sample mean can be computed as the sum of the observed values divided by the number of observations:
\[ \overline{x} = \frac{x_1 + x_2 + \dots + x_n}{n} \]
where \(x_1, \dots x_n\) represent the \(n\) observed values. The population mean is given by
\[ \mu = \frac{x_1 + x_2 + \dots + x_N}{N} \]
The population mean is also computed the same way but is denoted as \(\mu\). It is often not possible to calculate \(\mu\) since population data are rarely available.
The sample mean is a sample statistic, and serves as a point estimate of the population mean.
This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.
A stacked dot plot of interest_rate for the loan50 data set. The distribution’s mean is shown as a red triangle.
The median is a measure of central tendency that represents the middle value in a dataset when it is ordered from least to greatest.
It is a robust1 statistic, meaning that it is not sensitive to extreme values or outliers in the data.
The median is also referred to as the 50th percentile because it divides the dataset into two equal halves. Fifty percent of the observations are below the median, and fifty percent are above it.
# Create a dot plot with mean and median using ggplot2
median_value = median(interest_rate)
ggplot(loan50, aes(x = interest_rate)) +
geom_dotplot() + labs(x = "Interest Rate (in %)") +
geom_point(aes(x = mean_value, y = 0), color = "red", size = 5, shape = 17) +
geom_point(aes(x = median_value, y = 0), color = "green", size = 5, shape = 15) A stacked dot plot of interest_rate for the loan50 data set. The distribution’s mean is shown as a red triangle, the median is shown as a green square.
The mode is the value or category in a dataset that occurs most frequently.
For Numerical Data1: The mode is the value that appears most often.
{2, 3, 3, 4, 4, 4, 5}, the mode is 4
For Categorical Data: The mode represents the category with the highest frequency.
{red, blue, blue, green, red, blue}, the mode is blue.# Simple mode function
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
mode_value <- mode(loan50$interest_rate)
# Dot plot with mean, median, and mode
ggplot(loan50, aes(x = interest_rate)) +
geom_dotplot() +
labs(x = "Interest Rate (in %)") +
geom_point(aes(x = mean_value, y = 0),
color = "red", size = 5, shape = 17) +
geom_point(aes(x = median_value, y = 0),
color = "green", size = 5, shape = 15) +
geom_point(aes(x = mode_value, y = 0),
color = "blue", size = 5, shape = 18)Dot plots draw a dot for every single observation.
This is useful for small data sets, but they can become hard to read with larger samples.
Rather than showing the value of each observation, we prefer to think of the value as belonging to a bins1
| 5 to 10 | 10 to 15 | 15 to 20 | 20 to 25 | 25 to 30 |
|---|---|---|---|---|
| 26 | 12 | 9 | 2 | 1 |
Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Usually R will do a pretty good job of picking the bin size but we can play around with it.
The breaks argument represents the suggested number1 of bins you want the data to be divided into. R will then automatically determine the appropriate bin width.
A histogram of interest_rate. This distribution is strongly skewed to the right
| 4 to 6 | 6 to 8 | 8 to 10 | 10 to 12 | 12 to 14 | 14 to 16 | 16 to 18 | 18 to 20 | 20 to 22 | 22 to 24 | 24 to 26 | 26 to 28 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 12 | 11 | 8 | 3 | 2 | 4 | 4 | 1 | 0 | 1 | 1 |
Histograms are especially convenient for describing the shape of the data distribution.
Symmetric: The distribution is balanced on both sides of the center.
Skewed: The distribution has a longer tail on one side, either to the right (positive skew) or left (negative skew).
Uniform: The distribution is flat and lacks pronounced peaks or troughs.
The median is particularly useful when describing the center of skewed or asymmetrical datasets
For right/positively skewed distributions the median is typically less than the mean (see Dot plot with median)
For left/negatively skewed distributions, the median is greater than the mean.
We can also identify the “center” of the distribution.
By examining the width of the distribution histograms can also assess how “spread” out the values are.
Consider the final exam score for a STAT course:
set.seed(888)
test_scores <- 100*rbeta(500, shape1 = 4.5, shape2 = 2)
hist(test_scores, main = "Left-Skewed Distribution", xlab = "Final Exam Score (in %)", col = "lightblue", border = "black")
# Add vertical lines for mean and median
abline(v = mean(test_scores), col = "red", lty = 2, lw = 2)
abline(v = median(test_scores), col = "blue", lty = 2, lw = 2)
# Add legend
legend("topleft", legend = c("Mean", "Median"), col = c("red", "blue"), lty = 2, lw = 2)Modality describes the number of peaks (modes) in a distribution’s shape. These peaks correspond to local maximums in the frequency or probability density function.
A distribution can be:


Quartiles are values that divide a dataset into four equal parts, each representing 25% of the data. There are three quartiles:
First Quartile (Q1): Also known as the lower quartile, Q1 represents the 25th percentile of the data. It is the value below which 25% of the data falls.
Second Quartile (Q2): is the median of the dataset. It represents the 50th percentile, and 50% of the data falls below and 50% above this value.
Third Quartile (Q3): Also known as the upper quartile, Q3 represents the 75th percentile of the data. It is the value below which 75% of the data falls.
\[ IQR = Q3 - Q1 \]
A box plot summarizes a data set using five statistics while also plotting unusual observations
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
You create on in R using:
\[ [Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR] \]
Individual data points outside of the range are flagged as suspected outliers and are often plotted individually.
They can be important to identify because they may indicate unusual observations or errors.
A vertical dot plot, where points have been horizontally stacked, next to a labeled box plot for the interest rates of the 50 loans. Figure 2.10 (Diez, Barr, and Çetinkaya-Rundel 2016)
Boxplots can also help to identify skewness:
Variance
Variance is roughly the average squared deviation from the mean. For a sample of size \(n\) the sample variance1 is calculated as
\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
The population variance is given by:
\[ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]
Sample Standard deviation
\[ s = \sqrt{s^2} \]
Population st. deviation
\[ \sigma = \sqrt{\sigma^2} \]
The standard deviation represents the typical deviation of observations from the mean.
Standard deviation has the same units as the data.
Robustness
Which measure of central tendency is least affected by extreme values in a dataset?
Distribution shape

Which of the following best describes the shape of the histogram?
Boxplot Whiskers
In a boxplot, the whiskers extend to:
Modality

Identify the number of modes (peaks) in the distribution.
Figure 1: A density plot of the interest rates for the loan50 data using base ploting in R. Use ?plot.density and ?density to learn more.
Figure 2: A density plot of the interest rates for the loan50 data using ggplot.
🤔 Discussion Question
What difference do you notice between the two smoothed density plots?
Violin plots combine both ideas:
Figure 3: A violin plot of the interest rates for the loan50 data. The width of the violin reflects the density of values.
Figure 4: A violin plot of the interest rates for the loan50 data with the trim removed and a different theme.
Figure 5: A violin plot with an embedded boxplot and data showing the distribution of interest rates for the loan50 data.
Let’s consider the email dataset which comprise 3921 incoming emails for the first three months of 2012 for an email account (from openintro pacakge):
A bar plot is a common way to display a single categorical variable.
We can either plot according to frequency/count (left) or proportion (right).

A table that summarizes data for two categorical variables is called a contingency table.
A contingency table is a tabular representation of the joint distribution of two or more categorical variables.
email:
number: what number, if any, is present in the email: none, small (under 1 million), big (a number > 1 million)spam Indicator for whether the email was spam.We can make nicer-looking tables using additional R packages.
number
|
||||
|---|---|---|---|---|
| none | small | big | Total | |
| not spam | 400 | 2659 | 495 | 3554 |
| spam | 149 | 168 | 50 | 367 |
| Total | 549 | 2827 | 545 | 3921 |
Stacked bar plots visualize two categorical variables
Alternatively, you may prefer to situate your bars side-by-side.
Similarly, it can be useful to look at side-by-side boxplots.
formula notation
The formula notation y ~ x specifies that the variable “y” should be plotted against the levels of the “x” variable.

Comments
In the previous visualization, the categories had a natural order (ordinal)
For unordered categories (nominal) variables, its common to order the categories from largest to smallest. For example
Code