[1] "numeric"
STAT 205: Introduction to Mathematical Statistics
University of British Columbia Okanagan
January 23, 2024
This lecture focuses on the mechanics and construction of summary statistics and graphs using R.
Measures of central tendency: Mean (Average) and Median
Measures of dispersion: Variance (and Standard Deviation) and Interquartile Range (IQR)
Here’s a brief overview of how each data type is coded in R:
Numerical (quantitative)
Continuous (numeric
)
Discrete (integer
)
Categorical (qualitative)
nominal (factor
)
ordinal (ordered factor
)
Sometimes categorical variables are represented as numbers:
0
= never married, 1
= married, 2
= divorced1
= Strongly disagree, 2
= disagree, 3
= neutral, 4
= somewhat agree, 5
= strongly agreeWhile it will sometimes be appropriate to treat these variables as numeric, we will often need to coerce1 them to the proper data type.
Character Data Type (character
or string
):
Logical Data Type (logical
or boolean
):
Represents binary or Boolean values (TRUE or FALSE).
Used for logical conditions and comparisons.
In R, you can coerce objects using functions like as.numeric()
, as.character()
, as.factor()
, etc.
e.g. character to numeric
e.g. marital status
[1] 0 2 1 1 0
[1] 0 2 1 1 0
Levels: 0 1 2
[1] never married divorced married married never married
Levels: never married married divorced
In R, there are several data structures for organizing and storing data.
Vectors: a collection of elements of the same data type. Example:
Lists: A collection of elements that can be of different data types. Examples
Now that we have had a refresher in R, let’s load the openinto package as well as the loan
dataset available here: https://www.openintro.org/data/index.php?data=loan50
The most common data structure we will be working with is a data.frame
.
In R, a data frame is a two-dimensional, tabular data structure similar to a spreadsheet or a SQL table.
Rows represent individual observations or cases, and columns represent variables or attributes associated with those observations.
Columns can have different data types.
In R there are several ways for indexing or subsetting elements in a data structure. Recall these objects:
For vectors, [i]
is used to extract the element at index i
.
We can use [i, j]
to extract the \(i\)th row and \(j\)th column from data frames
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
Similarly the $
operator can be used to extract columns from data frames
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
While you can see a full description on the website, you could also type the following into the Console you can obtain the help documentation associated with that object or function.
Other useful function for inspecting your data include:
A benefit of looking at the structure of a data frame using str()
has the benefit of learning the data type of each variable.
It is always important the data is coded as we expect (especially true when reading data into R using read.csv
).
The data type will dictate which summary statistics and visualizations are most appropriate.
loan50
tibble [50 × 18] (S3: tbl_df/tbl/data.frame)
$ state : Factor w/ 51 levels "","AK","AL","AR",..: 32 6 41 6 36 16 35 25 11 11 ...
$ emp_length : num [1:50] 3 10 NA 0 4 6 2 10 6 3 ...
$ term : num [1:50] 60 36 36 36 60 36 36 36 60 60 ...
$ homeownership : Factor w/ 3 levels "rent","mortgage",..: 1 1 2 1 2 2 1 2 1 2 ...
$ annual_income : num [1:50] 59000 60000 75000 75000 254000 67000 28800 80000 34000 80000 ...
$ verified_income : Factor w/ 4 levels "","Not Verified",..: 2 2 4 2 2 3 3 2 2 3 ...
$ debt_to_income : num [1:50] 0.558 1.306 1.056 0.574 0.238 ...
$ total_credit_limit : int [1:50] 95131 51929 301373 59890 422619 349825 15980 258439 87705 330394 ...
$ total_credit_utilized : int [1:50] 32894 78341 79221 43076 60490 72162 2872 28073 23715 32036 ...
$ num_cc_carrying_balance: int [1:50] 8 2 14 10 2 4 1 3 10 4 ...
$ loan_purpose : Factor w/ 14 levels "","car","credit_card",..: 4 3 4 3 5 5 4 3 3 4 ...
$ loan_amount : int [1:50] 22000 6000 25000 6000 25000 6400 3000 14500 10000 18500 ...
$ grade : Factor w/ 8 levels "","A","B","C",..: 3 3 6 3 3 3 5 2 2 4 ...
$ interest_rate : num [1:50] 10.9 9.92 26.3 9.92 9.43 ...
$ public_record_bankrupt : int [1:50] 0 1 0 0 0 0 0 0 0 1 ...
$ loan_status : Factor w/ 7 levels "","Charged Off",..: 3 3 3 3 3 3 3 3 3 3 ...
$ has_second_income : logi [1:50] FALSE FALSE FALSE FALSE FALSE FALSE ...
$ total_income : num [1:50] 59000 60000 75000 75000 254000 67000 28800 80000 34000 192000 ...
Scatterplots are useful for visualizing the relationship between two numerical variables. Create scatterplots in R using:
or alternatively:
Important
When you attach()
a dataframe, the column names of that data frame become directly accessible without specifying the data frame name.
Alternatively we can create a scatterplot using the popular ggplot package
# Load the ggplot2 library
library(ggplot2)
ggplot(loan50, aes(x = annual_income, y = loan_amount)) +
geom_point() +
scale_x_continuous(
labels = scales::dollar_format()
) +
scale_y_continuous(
labels = scales::dollar_format()
) +
labs(
title = "Scatterplot of Annual Income vs Loan Amount",
x = "Annual Income",
y = "Loan Amount"
)
A dot plot is a one-variable scatterplot
The mean, often called the average, is a common way to measure the center of a distribution of data.
To compute the mean interest rate, we add up all the interest rates and divide by the number of observations.
In R this is accomplished using the mean()
function.
Mean (Average) formula for sample vs. population
The sample mean is often labeled \(\overline{x}\). The sample mean can be computed as the sum of the observed values divided by the number of observations:
\[ \overline{x} = \frac{x_1 + x_2 + \dots + x_n}{n} \]
where \(x_1, \dots x_n\) represent the \(n\) observed values. The population mean is given by
\[ \mu = \frac{x_1 + x_2 + \dots + x_N}{N} \]
The population mean is also computed the same way but is denoted as \(\mu\). It is often not possible to calculate \(\mu\) since population data are rarely available.
The sample mean is a sample statistic, and serves as a point estimate of the population mean.
This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.
The median is a measure of central tendency that represents the middle value in a dataset when it is ordered from least to greatest.
It is a robust1 statistic, meaning that it is not sensitive to extreme values or outliers in the data.
The median is also referred to as the 50th percentile because it divides the dataset into two equal halves. Fifty percent of the observations are below the median, and fifty percent are above it.
# Create a dot plot with mean and median using ggplot2
median_value = median(interest_rate)
ggplot(loan50, aes(x = interest_rate)) +
geom_dotplot() + labs(x = "Interest Rate (in %)") +
geom_point(aes(x = mean_value, y = 0), color = "red", size = 5, shape = 17) +
geom_point(aes(x = median_value, y = 0), color = "green", size = 5, shape = 15)
Dot plots show the exact value for each observation. This is useful for small data sets, but they can become hard to read with larger samples.
Rather than showing the value of each observation, we prefer to think of the value as belonging to a bins1
5 to 10 | 10 to 15 | 15 to 20 | 20 to 25 | 25 to 30 |
---|---|---|---|---|
26 | 12 | 9 | 2 | 1 |
Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Usually R will do a pretty good job of picking the bin size but we can play around with it.
The breaks
argument represents the suggested number1 of bins you want the data to be divided into. R will then automatically determine the appropriate bin width.
4 to 6 | 6 to 8 | 8 to 10 | 10 to 12 | 12 to 14 | 14 to 16 | 16 to 18 | 18 to 20 | 20 to 22 | 22 to 24 | 24 to 26 | 26 to 28 |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 12 | 11 | 8 | 3 | 2 | 4 | 4 | 1 | 0 | 1 | 1 |
A histogram of interest_rate
. This distribution is strongly skewed to the right
Histograms are especially convenient for describing the shape of the data distribution.
Symmetric: The distribution is balanced on both sides of the center.
Skewed: The distribution has a longer tail on one side, either to the right (positive skew) or left (negative skew).
Uniform: The distribution is flat and lacks pronounced peaks or troughs.
The median is particularly useful when describing the center of a distribution in the presence of skewed or asymmetrical datasets (see Shapes of Distributions)
In a right/positively skewed distribution the median is typically less than the mean (see Dot plot with median), and in a left/negatively skewed distribution, the median is greater than the mean.
We can also identify the “center” of the distribution.
By examining the width of the distribution histograms can also assess how “spread” out the values are.
Consider the final exam score for a STAT course:
set.seed(888)
test_scores <- 100*rbeta(500, shape1 = 4.5, shape2 = 2)
hist(test_scores, main = "Right-Skewed Distribution", xlab = "Final Exam Score (in %)", col = "lightblue", border = "black")
# Add vertical lines for mean and median
abline(v = mean(test_scores), col = "red", lty = 2, lw = 2, label = "Mean")
abline(v = median(test_scores), col = "blue", lty = 2, lw = 2, label = "Median")
# Add legend
legend("topleft", legend = c("Mean", "Median"), col = c("red", "blue"), lty = 2, lw = 2)
Quartiles are values that divide a dataset into four equal parts, each representing 25% of the data. There are three quartiles:
First Quartile (Q1): Also known as the lower quartile, Q1 represents the 25th percentile of the data. It is the value below which 25% of the data falls.
Second Quartile (Q2): is the median of the dataset. It represents the 50th percentile, and 50% of the data falls below and 50% above this value.
Third Quartile (Q3): Also known as the upper quartile, Q3 represents the 75th percentile of the data. It is the value below which 75% of the data falls.
\[ IQR = Q3 - Q1 \]
IQR is a measure of statistical dispersion that describes the spread or variability within the middle 50% of a dataset.
It is particularly useful when dealing with skewed or non-normally distributed data, as it is less sensitive to extreme values (outliers) than the full range or Standard Deviation.
A box plot summarizes a data set using five statistics while also plotting unusual observations
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
You create on in R using:
\[ [Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR] \]
Individual data points outside of the range are flagged as suspected outliers and are often plotted individually.
They can be important to identify because they may indicate unusual observations or errors.
Boxplots can also help to identify skewness:
Variance
Variance is roughly the average squared deviation from the mean. For a sample of size \(n\) the sample variance1 is calculated as
\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]
The population variance is given by:
\[ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]
Sample Standard deviation
\[ s = \sqrt{s^2} \]
Population st. deviation
\[ \sigma = \sqrt{\sigma^2} \]
The standard deviation represents the typical deviation of observations from the mean.
Standard deviation has the same units as the data.
Identify the number of modes (peaks) in the distribution. A unimodal distribution has one peak, while bimodal or multimodal distributions have two or more peaks.
Is the histogram right skewed, left skewed, or symmetric?
Let’s consider the email
dataset which comprise 3921 incoming emails for the first three months of 2012 for an email account (from openintro pacakge):
A table that summarizes data for two categorical variables is called a contingency table.
A contingency table is a tabular representation of the joint distribution of two or more categorical variables.
For instance, we can consider the following 2 categorical varaibles from the email
dataset:
number
: what type of number, if any, is present in the email: none
, small
(under 1 million)), big
spam
Indicator for whether the email was spam.A bar plot is a common way to display a single categorical variable.
Alternatively, you may prefer to situate your bars side-by-side rather than stacked. In this side-by-side barplot we have opted to use different colours and create a legend.
Similarly, it can be useful to look at side-by-side boxplots.
formula notation
The formula notation y ~ x
specifies that the variable “y” should be plotted against the levels of the “x” variable.