Lecture 2: Summarizing Data

STAT 205: Introduction to Mathematical Statistics

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

At the heart of any statistical analysis is data!

Data are most often organized in tidy format where:

observations in rows
variables in columns

Data collected on students in a statistics class on a variety of variables.

Data types

Variables can take different forms, such as:
- Categorical (stored as factor in R)
- Quantitative (stored as numeric for continuous values or integer for discrete values in R)

Note

The data type will dictate what kind of visualizations we can create and what type of statistical tests can be used.

Learning from Data

To learn from data, we can summarize them:

numerically (e.g., means, standard deviations)
visually (e.g., bar plots, violin plots)

This lecture focuses¹ on the mechanics and construction of summary statistics and graphs using R.

Outline

📊 Plots

Scatterplots
Dot Plot
Histograms
Boxplot
Violin Plots

⚖️ Measures of central tendency:

Mean (Average), Median, Mode

📐Measures of dispersion:

Variance (and Standard Deviation)
Interquartile Range (IQR)

Types of Data

A brief overview of how each data type is coded in R:

🔢 Numerical (quantitative)

Continuous (numeric)
```
x <- 3.14
class(x)
```
```
[1] "numeric"
```

Discrete (integer)

y <- 5L # L indicates integer
class(y)

[1] "integer"

🏷️ Categorical (qualitative)

nominal (factor)

class(gender <- factor(c("male", "female", "male")))
gender

[1] "factor"
[1] male   female male  
Levels: female male

ordinal (ordered factor)

dose <- factor(c("High", "Medium", "Low", "High"),
        levels = c("Low", "Medium", "High"),
        ordered = TRUE)
class(dose)
dose

[1] "ordered" "factor" 
[1] High   Medium Low    High  
Levels: Low < Medium < High

Categorical as numbers

Oftentimes categorical variables are represented as numbers:

0 = never married, 1 = married, 2 = divorced
1 = Strongly disagree, 2 = disagree, 3 = neutral, 4 = somewhat agree, 5 = strongly agree

While it’s sometimes appropriate to treat these variables as numeric, we will often coerce¹ them to the proper data type.

Other R Data Types

Character Data Type (character or string):

Represents text data.

z <- "hello world!"; class(z) # ';' for statement separator

[1] "character"

Logical Data Type (logical or boolean):
- Represents binary or Boolean values (TRUE or FALSE).
- Used for logical conditions and comparisons.
(is_male <- gender == "male")
[1] TRUE FALSE TRUE
class(is_male)
[1] "logical"

Coercion

Example 1 coerce character type to numeric.

x <- "42"          
x

[1] "42"

class(x)

[1] "character"

y <- as.numeric(x)
y

[1] 42

class(y)

[1] "numeric"

R Structures

If stored in a data frame, you can see the data type using str() function

df <- data.frame(x,y, z=c("apple","orange"), w=c(T,F))
str(df)

'data.frame':   2 obs. of  4 variables:
 $ x: chr  "42" "42"
 $ y: num  42 42
 $ z: chr  "apple" "orange"
 $ w: logi  TRUE FALSE

Aside

In my lecture slides, I will be print my data frames in a a paged table¹ display which presents the structure of the created data frame including:

The dimensions of the data frame (number of rows and columns).
The names and types of each column.

df

Coercion (cont’d)

Example 2 (Marital status) coerce numeric to factor

# class: "numeric"
(mstat <- c(0, 2, 1, 1, 0))

[1] 0 2 1 1 0

# class: "factor"
(mstat <- as.factor(mstat))

[1] 0 2 1 1 0
Levels: 0 1 2

Relabeling Factor Levels

We can assigns more descriptive labels to each level:

mstat <- factor(
  mstat,           # a vector of data, 
  levels = 0:2,    # the unique set of values allowed for this variable
  labels = c("never married", "married", "divorced") #  new labels (in the same order)
)

Data Structures

In R, there are several data structures for organizing and storing data.

Vectors: a collection of elements of the same data type.

# numeric vector
nums <- c(1.5, 2.3, 3.7)
# character vector
fruit <- c("apple", "orange", "banana")

Matrices: A two-dimensional data structure with elements of the same data type.

(my_matrix <- matrix(1:6, nrow = 2, ncol = 3))

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Lists: A collection of elements that can be of different data types.

(my_list <- 
   list(nums, my_fruit = fruit, tf = TRUE, 
    integer = 42L, my_matrix = my_matrix))

[[1]]
[1] 1.5 2.3 3.7

$my_fruit
[1] "apple"  "orange" "banana"

$tf
[1] TRUE

$integer
[1] 42

$my_matrix
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

Note: if you have more than one data type in vectors/matrices, it will try and pick one that makes the most sense.

vec <- c(1, "apple", TRUE) # character class
vec

[1] "1"     "apple" "TRUE"

class(vec)

[1] "character"

iClicker: Data types

iClicker

Exercise 1 (Data Types in R) What is the correct R data type for representing ordered categories like "Low", "Medium", and "High"?

Numeric
Factor
Ordered Factor
Character

Data Frame

The most common data structure we will be working with is a data.frame.
In R, a data frame is a two-dimensional, tabular data structure similar to a spreadsheet or a SQL table.
Rows represent individual observations or cases, and columns represent variables or attributes associated with those observations.
Columns can have different data types.

Iris Data frame

class(iris)
iris

[1] "data.frame"

Indexing Vectors

In R there are several ways for indexing or subsetting elements in a data structure. Recall objects we created earlier:

fruit

[1] "apple"  "orange" "banana"

For vectors, [i] is used to extract the element at index i.

fruit[3] # extracts the third element of the vector

[1] "banana"

fruit[c(1,3)] # extracts the first and third element

[1] "apple"  "banana"

Indexing Matrices

Recall the matrix we defined earlier:

my_matrix

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

For matrices, [i, j] is used to extract the element in the $i$th row and $j$th column.

my_matrix[1,3] 
my_matrix[2,] # 2nd row
my_matrix[,3] # 3rd column

[1] 5
[1] 2 4 6
[1] 5 6

Indexing Lists

For lists, [[i]] is used to extract the $i$th element.

my_list

[[1]]
[1] 1.5 2.3 3.7

$my_fruit
[1] "apple"  "orange" "banana"

$tf
[1] TRUE

$integer
[1] 42

$my_matrix
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

my_list[[1]]

[1] 1.5 2.3 3.7

my_list[[2]]

[1] "apple"  "orange" "banana"

my_list[[3]]

[1] TRUE

my_list[[4]]

[1] 42

Alternative you can use the $ operator for indexing named elements in list

my_list$my_fruit

[1] "apple"  "orange" "banana"

Indexing Data Frames

We can use [i, j] to extract the $i$th row and $j$th column from data frames

iris[,1]

  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

Similarly the $ operator can be used to extract columns from data frames

iris$Sepal.Length

  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

R packages

We’ll often use data sets from R packages

Steps to using a dataset from an R package

Install the package using install.packages("packagename")¹
Load the package into your R session via load("packagename")
Access the dataset by name

help files

Dataset documentation is available through R’s help system via ?"objectname"

loan data from openintro

Note

Example 3 (Dataset from openintro) Let’s load the births14 data set from the openinto package and view its help file.

Installing Packages

Install the package
1. Open up Rstudio
2. Navigate to the Console (i.e. “Interacative Mode”)
3. Type the following command in your Console

install.packages("openintro")

Never use install.packages() in a .qmd/.rmd/.r file

Package installation should be done once, in the Console. If you include install.packages() in your Quarto documents, you will re-download the package every time you render your document.

A screenshot of my R session where I have installed the openintro package using the install.packages() function in the Console

Never use the install.packages() function in qmd/rmd files or R scripts. This function should exclusively be run in the Console.

Loading Packages

Load the package in your R session

library(openintro) # load the library

Running the above command will make all the package’s datasets and functions available in your session.

Warning

Unlike install.packages(), which only needs to be run once, the library() command must be run every time you want to use a package.

Accessing the Data

Now that we have loaded the library, we can access its datasets (and functions).

births14

Help Files

In R, help() and ? (shortcut) are used to open the help page (documentation) for a function, dataset, or package.
For example, help(mean) or ?mean shows how the function works, its arguments, and examples.

Important

The help file documents a short description and source of the data, variable definitions, units, and known limitations. Always consult the help file before analyzing any dataset to ensure correct interpretation of variables and context.

The help file (?births14) provides the authoritative description of the dataset, including it’s source and variable definitions. Always consult the help file before analyzing any dataset to ensure correct interpretation of variables and context.

⚠️ Two R Sessions

🖥️ Console (Interactive Mode)

Used for experimenting and testing
Resets when you restart R

📄 Quarto Render

Starts a fresh R session
Does not see Console code
Only runs what’s in the .qmd file

Objects created during Quarto rendering are not available in the Console.

Objects created in the Console are not available when rendering a Quarto document.

Data description

Here are some other useful functions for inspecting your data:

str(births14)   # prints data type and the first few values of each variable.

tibble [1,000 × 13] (S3: tbl_df/tbl/data.frame)
 $ fage          : int [1:1000] 34 36 37 NA 32 32 37 29 30 29 ...
 $ mage          : num [1:1000] 34 31 36 16 31 26 36 24 32 26 ...
 $ mature        : chr [1:1000] "younger mom" "younger mom" "mature mom" "younger mom" ...
 $ weeks         : num [1:1000] 37 41 37 38 36 39 36 40 39 39 ...
 $ premie        : chr [1:1000] "full term" "full term" "full term" "full term" ...
 $ visits        : num [1:1000] 14 12 10 NA 12 14 10 13 15 11 ...
 $ gained        : num [1:1000] 28 41 28 29 48 45 20 65 25 22 ...
 $ weight        : num [1:1000] 6.96 8.86 7.51 6.19 6.75 6.69 6.13 6.74 8.94 9.12 ...
 $ lowbirthweight: chr [1:1000] "not low" "not low" "not low" "not low" ...
 $ sex           : chr [1:1000] "male" "female" "female" "male" ...
 $ habit         : chr [1:1000] "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
 $ marital       : chr [1:1000] "married" "married" "married" "not married" ...
 $ whitemom      : chr [1:1000] "white" "white" "not white" "white" ...

head

head(births14)  # display the first 6 rows of a data frame

head(births14, n=2)  # display the first 2 rows of a data frame

View spreadsheet

View(births14)  # used in RStudio to open a data viewer (i.e. spreadsheet)

Data Structure

A benefit of looking at the structure of a data frame using str() has the benefit of learning the data type of each variable.
It is always important the data is coded as we expect (especially true when reading data into R using read.csv).
The data type will dictate which summary statistics and visualizations are most appropriate.

Plotting data

loan example

In this section we will consider the loan50 dataset from the openintro pacakge.
It represents 50 loans made through the Lending Club platform, which allows individuals to lend to other individuals.
This is a subset of the loans_full_schema dataset from the same package; see ?loans_full_schema for details.

Structure of `loan50`

Let’s consider the

str(loan50)

tibble [50 × 18] (S3: tbl_df/tbl/data.frame)
 $ state                  : Factor w/ 51 levels "","AK","AL","AR",..: 32 6 41 6 36 16 35 25 11 11 ...
 $ emp_length             : num [1:50] 3 10 NA 0 4 6 2 10 6 3 ...
 $ term                   : num [1:50] 60 36 36 36 60 36 36 36 60 60 ...
 $ homeownership          : Factor w/ 3 levels "rent","mortgage",..: 1 1 2 1 2 2 1 2 1 2 ...
 $ annual_income          : num [1:50] 59000 60000 75000 75000 254000 67000 28800 80000 34000 80000 ...
 $ verified_income        : Factor w/ 4 levels "","Not Verified",..: 2 2 4 2 2 3 3 2 2 3 ...
 $ debt_to_income         : num [1:50] 0.558 1.306 1.056 0.574 0.238 ...
 $ total_credit_limit     : int [1:50] 95131 51929 301373 59890 422619 349825 15980 258439 87705 330394 ...
 $ total_credit_utilized  : int [1:50] 32894 78341 79221 43076 60490 72162 2872 28073 23715 32036 ...
 $ num_cc_carrying_balance: int [1:50] 8 2 14 10 2 4 1 3 10 4 ...
 $ loan_purpose           : Factor w/ 14 levels "","car","credit_card",..: 4 3 4 3 5 5 4 3 3 4 ...
 $ loan_amount            : int [1:50] 22000 6000 25000 6000 25000 6400 3000 14500 10000 18500 ...
 $ grade                  : Factor w/ 8 levels "","A","B","C",..: 3 3 6 3 3 3 5 2 2 4 ...
 $ interest_rate          : num [1:50] 10.9 9.92 26.3 9.92 9.43 ...
 $ public_record_bankrupt : int [1:50] 0 1 0 0 0 0 0 0 0 1 ...
 $ loan_status            : Factor w/ 7 levels "","Charged Off",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ has_second_income      : logi [1:50] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ total_income           : num [1:50] 59000 60000 75000 75000 254000 67000 28800 80000 34000 192000 ...

state: Two-letter state code.

emp_length: Number of years in the job, rounded down. If longer than 10 years, then this is represented by the value 10.

term: The number of months of the loan the applicant received.

homeownership: The ownership status of the applicant’s residence.

annual_income: Annual income

verified_income: Type of verification of the applicant’s income.

debt_to_income: Debt-to-income ratio.

total_credit_limit: Total available credit, e.g. if only credit cards, then the total of all the credit limits. This excludes a mortgage.

total_credit_utilized: Total credit balance, excluding a mortgage.

num_cc_carrying_balance: Number of credit cards that are carrying a balance.

loan_purpose: The category for the purpose of the loan.

loan_amount: The amount of the loan the applicant received.

grade: Grade associated with the loan.

interest_rate: Interest rate of the loan the applicant received.

public_record_bankrupt: Number of bankruptcies listed in the public record for this applicant.

loan_status: Status of the loan.

has_second_income: logical

total_income: self explanatory (but not in the help file?)

Scatterplots

Scatterplots are useful for visualizing the relationship between two numerical variables. Create scatterplots in R using:

plot(x = loan50$annual_income, y = loan50$loan_amount)

or alternatively:

attach(loan50)
plot(annual_income, loan_amount, xlab = "Annual Income (in dollars $)",
     ylab = "Loan Amount (in dollars $)", 
     main = "Scatterplot of Annual Income vs Loan Amount")

Important

When you attach() a dataframe, the column names of that data frame become directly accessible without specifying the data frame name.

Scatterplots

Scatterplot in ggplot

Alternatively we can create a scatterplot using the popular ggplot package

# Load the ggplot2 library
library(ggplot2)

ggplot(loan50, aes(x = annual_income, y = loan_amount)) +
  geom_point() +
  scale_x_continuous(
    labels = scales::dollar_format()
  ) +
  scale_y_continuous(
    labels = scales::dollar_format()
  ) +
  labs(
    title = "Scatterplot of Annual Income vs Loan Amount",
    x = "Annual Income",
    y = "Loan Amount"
  )

Scatterplot in ggplot

Dot Plot

A dot plot is a one-variable scatterplot

stripchart(interest_rate, method = "stack", pch = 19, offset = .5, at = 0, 
           col = "steelblue", xlab = "Interest Rates (in %)", cex = 2)

An example of a dotchart using the interest rate of 50 loans

Notice that some numbers appear more than once; there are non integer values

interest_rate[1:5]

[1] 10.90  9.92 26.30  9.92  9.43

table(interest_rate)

interest_rate
 5.31  5.32  6.08  6.71  7.34  7.35  7.96  7.97  9.43  9.44  9.92  9.93 10.42 
    2     1     3     2     1     2     3     1     2     3     4     2     2 
 10.9 10.91 11.98 12.62 14.08 15.04 16.02 17.09 18.06 18.45 19.42    20 21.45 
    2     3     1     3     1     1     1     3     1     1     1     1     1 
24.85  26.3 
    1     1

Stacked dot plot

# Create a stacked dot plot using ggplot2
ggplot(loan50, aes(x = interest_rate)) +
    geom_dotplot() + labs(x = "Interest Rate (in %)")

Mean (Average)

The mean, often called the average, is a common way to measure the center of a distribution of data.
To compute the mean interest rate, we add up all the interest rates and divide by the number of observations.
In R this is accomplished using the mean() function.
```
(mean_value = mean(interest_rate))
```
```
[1] 11.5672
```

Mean (average) formula

Mean (Average) formula for sample vs. population

The sample mean is often labeled $\overline{x}$. The sample mean can be computed as the sum of the observed values divided by the number of observations:

\[ \overline{x} = \frac{x_1 + x_2 + \dots + x_n}{n} \]

where $x_1, \dots x_n$ represent the $n$ observed values. The population mean is given by

\[ \mu = \frac{x_1 + x_2 + \dots + x_N}{N} \]

Sample vs. Population Mean

The population mean is also computed the same way but is denoted as $\mu$. It is often not possible to calculate $\mu$ since population data are rarely available.
The sample mean is a sample statistic, and serves as a point estimate of the population mean.
This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.

Dotplot with mean

Code

# Create a dot plot with mean using ggplot2
ggplot(loan50, aes(x = interest_rate)) +
  geom_dotplot() + labs(x = "Interest Rate (in %)") +
  geom_point(aes(x = mean_value, y = 0), color = "red", size = 5, shape = 17)

A stacked dot plot of interest_rate for the loan50 data set. The distribution’s mean is shown as a red triangle.

Median

The median is a measure of central tendency that represents the middle value in a dataset when it is ordered from least to greatest.
It is a robust¹ statistic, meaning that it is not sensitive to extreme values or outliers in the data.
The median is also referred to as the 50th percentile because it divides the dataset into two equal halves. Fifty percent of the observations are below the median, and fifty percent are above it.

Dot plot with median

Code

# Create a dot plot with mean and median using ggplot2

median_value = median(interest_rate)
ggplot(loan50, aes(x = interest_rate)) +
  geom_dotplot() + labs(x = "Interest Rate (in %)") +
  geom_point(aes(x = mean_value, y = 0), color = "red", size = 5, shape = 17) +
  geom_point(aes(x = median_value, y = 0), color = "green", size = 5, shape = 15)

A stacked dot plot of interest_rate for the loan50 data set. The distribution’s mean is shown as a red triangle, the median is shown as a green square.

Mode

The mode is the value or category in a dataset that occurs most frequently.

For Numerical Data¹: The mode is the value that appears most often.
- e.g. {2, 3, 3, 4, 4, 4, 5}, the mode is 4
For Categorical Data: The mode represents the category with the highest frequency.
- e.g {red, blue, blue, green, red, blue}, the mode is blue.

Dot plot with mode

Code

# Simple mode function
mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

mode_value <- mode(loan50$interest_rate)

# Dot plot with mean, median, and mode
ggplot(loan50, aes(x = interest_rate)) +
  geom_dotplot() +
  labs(x = "Interest Rate (in %)") +
  geom_point(aes(x = mean_value, y = 0),
             color = "red", size = 5, shape = 17) +
  geom_point(aes(x = median_value, y = 0),
             color = "green", size = 5, shape = 15) +
  geom_point(aes(x = mode_value, y = 0),
             color = "blue", size = 5, shape = 18)

mean_value

[1] 11.5672

median_value

[1] 9.93

mode_value

[1] 9.92

Histograms

Dot plots draw a dot for every single observation.
This is useful for small data sets, but they can become hard to read with larger samples.
Rather than showing the value of each observation, we prefer to think of the value as belonging to a bins¹

5 to 10	10 to 15	15 to 20	20 to 25	25 to 30
26	12	9	2	1

Histogram in R

hist(interest_rate)

Bins

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Usually R will do a pretty good job of picking the bin size but we can play around with it.
The breaks argument represents the suggested number¹ of bins you want the data to be divided into. R will then automatically determine the appropriate bin width.

More breaks

hist(interest_rate, breaks = 12, main = "", xlab = "Interest Rate (in %)")

A histogram of interest_rate. This distribution is strongly skewed to the right

4 to 6	6 to 8	8 to 10	10 to 12	12 to 14	14 to 16	16 to 18	18 to 20	20 to 22	22 to 24	24 to 26	26 to 28
3	12	11	8	3	2	4	4	1	0	1	1

Shapes of Distributions

Histograms are especially convenient for describing the shape of the data distribution.

Symmetric: The distribution is balanced on both sides of the center.
Skewed: The distribution has a longer tail on one side, either to the right (positive skew) or left (negative skew).
Uniform: The distribution is flat and lacks pronounced peaks or troughs.

Central tendencies for skewed data

The median is particularly useful when describing the center of skewed or asymmetrical datasets
For right/positively skewed distributions the median is typically less than the mean (see Dot plot with median)
For left/negatively skewed distributions, the median is greater than the mean.

Characteristics of Histograms

We can also identify the “center” of the distribution.

For a symmetric distribution, the center is typically the peak of the density.
For skewed distributions, the center may be closer to the peak on the longer tail.

By examining the width of the distribution histograms can also assess how “spread” out the values are.

A wider distribution indicates higher dispersion, while a narrower distribution suggests lower dispersion.

Histogram with mean/median

Consider the final exam score for a STAT course:

Code

set.seed(888)
test_scores <- 100*rbeta(500, shape1 = 4.5, shape2 = 2)
hist(test_scores, main = "Left-Skewed Distribution", xlab = "Final Exam Score (in %)", col = "lightblue", border = "black")

# Add vertical lines for mean and median
abline(v = mean(test_scores), col = "red", lty = 2, lw = 2)
abline(v = median(test_scores), col = "blue", lty = 2, lw = 2)

# Add legend
legend("topleft", legend = c("Mean", "Median"), col = c("red", "blue"), lty = 2, lw = 2)

Modality

Modality describes the number of peaks (modes) in a distribution’s shape. These peaks correspond to local maximums in the frequency or probability density function.
A distribution can be:
1. Unimodal: One clear peak (e.g., Normal distribution).
2. Bimodal: Two distinct peaks (e.g., heights of males and females combined).
3. Multimodal: More than two peaks.
4. Uniform: No distinct peaks; the frequency is evenly spread.

Summary

Descriptors for the modality of a distribution

Descriptors for the skewness of a distribution

Quartiles

Quartiles are values that divide a dataset into four equal parts, each representing 25% of the data. There are three quartiles:

First Quartile (Q1): Also known as the lower quartile, Q1 represents the 25th percentile of the data. It is the value below which 25% of the data falls.
Second Quartile (Q2): is the median of the dataset. It represents the 50th percentile, and 50% of the data falls below and 50% above this value.
Third Quartile (Q3): Also known as the upper quartile, Q3 represents the 75th percentile of the data. It is the value below which 75% of the data falls.

Interquartile Range

The Interquartile Range (or IQR for short) is a measure of statistical dispersion that describes the spread or variability within the middle 50% of a dataset. It is defined:

\[ IQR = Q3 - Q1 \]

It is particularly useful when dealing with skewed or non-normally distributed data, as it is less sensitive to extreme values (outliers) than the full range or Standard Deviation.

Boxplot

A box plot summarizes a data set using five statistics while also plotting unusual observations
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.

You create on in R using:

boxplot(interest_rate)

Whiskers and Outliers

The so-called whiskers extend from the edges of the box to the minimum and maximum values within a specified range:

\[ [Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR] \]

Individual data points outside of the range are flagged as suspected outliers and are often plotted individually.
They can be important to identify because they may indicate unusual observations or errors.

A vertical dot plot, where points have been horizontally stacked, next to a labeled box plot for the interest rates of the 50 loans. Figure 2.10 (Diez, Barr, and Çetinkaya-Rundel 2016)

Shape of boxplot

Boxplots can also help to identify skewness:

Variance

Variance

Variance is roughly the average squared deviation from the mean. For a sample of size $n$ the sample variance¹ is calculated as

\[ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 \]

The population variance is given by:

\[ \sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]

Standard Deviation

The standard deviation is the square root of the variance

Sample Standard deviation

\[ s = \sqrt{s^2} \]

Population st. deviation

\[ \sigma = \sqrt{\sigma^2} \]

The standard deviation represents the typical deviation of observations from the mean.
Standard deviation has the same units as the data.

iClicker: Robustness

Robustness

Which measure of central tendency is least affected by extreme values in a dataset?

Mean
Median
Mode
Standard Deviation

iClicker: Shape

Distribution shape

Which of the following best describes the shape of the histogram?

right skewed
left skewed
symmetric
uniform

iClicker Boxplot

Boxplot Whiskers

In a boxplot, the whiskers extend to:

The minimum and maximum values in the dataset
The 25th and 75th percentiles
The range defined by $[Q_1 - 1.5 \times IQR, Q_3 + 1.5 \times IQR]$
None of the above

iClicker Modality

Modality

Identify the number of modes (peaks) in the distribution.

unimodal
bimodal
multimodal
uniform

Smoothed Density Plots

Histograms depend on bin choices, which can obscure shape
An alterative is to plot a smooth curve that represents how data are distributed
The curve describes the overall shape of the distribution
Like a probility distribution function (or PDF) the total area under the curve is 1.
Hence it is interpreted as a density, not a count

Code

d <- density(interest_rate)
par(mar=c(4,5,0,0))
plot(d, xlab = "Interest Rate (in %)", ylab = "Density", main = "")

Figure 1: A density plot of the interest rates for the loan50 data using base ploting in R. Use ?plot.density and ?density to learn more.

Code

ggplot(loan50, aes(x = interest_rate)) +
  geom_density() +
  labs(x = "Interest Rate (in %)", y = "Density")

Figure 2: A density plot of the interest rates for the loan50 data using ggplot.

🤔 Discussion Question

What difference do you notice between the two smoothed density plots?

Violin Plots

Boxplots summarize data, but hide distribution shape
Smoothed density plots show shape, but are hard to compare across groups

Violin plots combine both ideas:

Is a smoothed density plot, mirrored vertically
Shows the shape of the distribution
Is often overlaid with a summaries (median and quartiles)

Code

ggplot(loan50, aes(x = "", y = interest_rate)) +
  geom_violin() +
  labs(y = "Interest Rate (in %)")

Figure 3: A violin plot of the interest rates for the loan50 data. The width of the violin reflects the density of values.

Code

ggplot(loan50, aes(x = "", y = interest_rate)) +
  geom_violin(fill = "lightgray", trim = FALSE) +
  labs(x = NULL, y = "Interest Rate (in %)") +
  theme_minimal()

Figure 4: A violin plot of the interest rates for the loan50 data with the trim removed and a different theme.

Code

ggplot(loan50, aes(x = "", y = interest_rate)) +
  geom_violin(fill = "lightgray", trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA) +
  geom_jitter(width = 0.08, alpha = 0.5, size = 1.8) +
  labs(x = NULL, y = "Interest Rate (in %)") +
  theme_minimal()

Figure 5: A violin plot with an embedded boxplot and data showing the distribution of interest rates for the loan50 data.

Categorical Data

email

Let’s consider the email dataset which comprise 3921 incoming emails for the first three months of 2012 for an email account (from openintro pacakge):

data(email); email

Frequency Table

A table for a single variable is called a frequency table:

tab <- table(email$number)
tab


 none small   big 
  549  2827   545

# TIP: nicer printout
knitr::kable(t(tab))

none	small	big
549	2827	545

We can create a barplot in R using barplot(height)¹

barplot(tab)

Bar Plot

A bar plot is a common way to display a single categorical variable.

Code

par(mar=c(4.1,4,0,0))
par(mfrow=c(1,2))

tab <- table(email$number)
barplot(tab, ylab = "Frequency", 
        xlab = "Homeownership")


tabTemp <- tab/sum(tab)
barplot(tabTemp, ylab = "Proportion", 
        xlab = "Homeownership")

We can either plot according to frequency/count (left) or proportion (right).

Comments

In the previous visualization, the categories had a natural order (ordinal)
For unordered categories (nominal) variables, its common to order the categories from largest to smallest. For example

Code

library(dplyr)
library(ggplot2)

loan50 %>%
  filter(grade != "") %>%        # drop empty level if present
  count(grade) %>%
  ggplot(aes(x = reorder(grade, -n), y = n)) +
  geom_col() +
  labs(
    x = "Loan Grade",
    y = "Count",
    title = "Distribution of Loan Grades"
  )

Comments

Contingency Tables

A table that summarizes data for two categorical variables is called a contingency table.
A contingency table is a tabular representation of the joint distribution of two or more categorical variables.

e.g. consider the following categorical variables from email:
- number: what number, if any, is present in the email: none, small (under 1 million), big (a number > 1 million)
- spam Indicator for whether the email was spam.

email contingency table

(tab <- table(email$spam, email$number))

   
    none small  big
  0  400  2659  495
  1  149   168   50

🔍 Tip: Pretty tables

We can make nicer-looking tables using additional R packages.

Code

library(kableExtra)
contingency_table <- addmargins(tab, margin = 1:2)
rownames(contingency_table) <- c("not spam", "spam", "Total")
colnames(contingency_table)[4] <- "Total"
knitr::kable(contingency_table, "html") |> add_header_above(c(" ", "number" = 3, ""))

	number
	none	small	big	Total
not spam	400	2659	495	3554
spam	149	168	50	367
Total	549	2827	545	3921

Stacked Bar plot

Stacked bar plots visualize two categorical variables

They help us see:

How categories break down within another category
Relative contributions of each subgroup to a total

Code

barplot(tab,
        legend.text = TRUE,
        args.legend = list(
          title = "Spam status",
          x = "topright",
          bty = "n"
        ))

Side-by-side bar plot

Alternatively, you may prefer to situate your bars side-by-side.

# reduce white space in margins
par(mar=c(5.1,4.1,0.1,0.1))

# COL vector from openintro
# see ?COL for details
barplot(tab, beside = TRUE, 
        col = c(COL[1], COL[3]))

# Create a legend
legend("topleft",
  fill = COL[c(1,3)],
  legend = c("not spam", "spam"))

Side-by-side Boxplot

Similarly, it can be useful to look at side-by-side boxplots.

# plotted output on the next slide
boxplot(
  annual_income~homeownership, 
  data = loan50,
  ylab = "Annual Income")

formula notation

The formula notation y ~ x specifies that the variable “y” should be plotted against the levels of the “x” variable.

Side-by-side Boxplot

Side-by-side Violin plot

Code

library(scales)

ggplot(loan50, aes(x = homeownership, y = annual_income)) +
  geom_violin(fill = "lightgray", trim = FALSE) +
  geom_boxplot(width = 0.1, outlier.shape = NA) +
  scale_y_continuous(labels = dollar) +
  labs(
    x = "Homeownership Status",
    y = "Annual Income"
  ) +
  theme_minimal()

Lecture 2: Summarizing Data

Introduction

Data types

Learning from Data

Outline

Types of Data

A brief overview of how each data type is coded in R:

Categorical as numbers

Other R Data Types

Coercion

Aside

Coercion (cont’d)

Data Structures

iClicker: Data types

Data Frame

Iris Data frame

Indexing Vectors

Indexing Matrices

Indexing Lists

Indexing Data Frames

R packages

loan data from openintro

Installing Packages

Loading Packages

Accessing the Data

Help Files

⚠️ Two R Sessions

Data description

head

View spreadsheet

Data Structure

Plotting data

loan example

Structure of loan50

Scatterplots

Scatterplots

Scatterplot in ggplot

Scatterplot in ggplot

Dot Plot

Stacked dot plot

Mean (Average)

Mean (average) formula

Sample vs. Population Mean

Dotplot with mean

Median

Dot plot with median

Mode

Dot plot with mode

Histograms

Histogram in R

Bins

More breaks

Shapes of Distributions

Central tendencies for skewed data

Characteristics of Histograms

Histogram with mean/median

Modality

Summary

Quartiles

Interquartile Range

Boxplot

Whiskers and Outliers

Shape of boxplot

Variance

Standard Deviation

iClicker: Robustness

iClicker: Shape

iClicker Boxplot

iClicker Modality

Smoothed Density Plots

Violin Plots

Categorical Data

email

Frequency Table

Bar Plot

Comments

Comments

Contingency Tables

email contingency table

🔍 Tip: Pretty tables

Structure of `loan50`