Review Session

DATA 101: Making Prediction with Data

Ladan Tazik

University of British Columbia Okanagan

Introduction

Today, we will review the concepts we learnt in lecture 12 through 14

  • Regression Problem

    • KNN Regression

    • Linear Regression

  • Clustering

Regression Problem

Regression involves predicting the value of a numeric variable using one or more variables.

Regression vs. Classification:

  • Both involve using past data to predict the target value for future observations.
  • Regression predicts continuous numeric values, while classification predicts categorical labels.

There are many statistical/machine learning methods to perform regression, of those we covered K-nearest neighbor, linear regression.

KNN for Regression

Key Concepts:

  • It assumes that similar data points have similar target values.

  • Predicts the target value of a new observation by averaging the responses of its K nearest neighbors in the training dataset.

  • To determine the nearest neighbors, KNN calculates the Euclidean distance between data points in the feature space.

  • The choice of K (number of neighbors) significantly impacts the model’s performance. cross-validation is done to find the best value for k.

Steps for KNN Regression Model Fitting

  1. Divide the dataset into a training set and a test set. Stratified splitting helps maintain a balanced representation of target values.

  2. Perform cross-validation to determine K, by using training set only.

  3. Train the model with the chosen K.

  4. Assess the model’s performance on the test set using appropriate regression metrics such as RMSPE or MSE. Lower values indicate better predictive accuracy.

\[\begin{equation} \text{RMSPE} = \sqrt{\frac{1}{n}\sum\limits_{i=1}^{n}(y_i - \hat{y}_i)^2} \end{equation}\]

Considerations:

  • Scaling Features is crucial when using KNN as it relies on distance measures.

  • Impact of K:

    • When K is too small, the model overfits the training data, capturing noise and following it closely, which results in poor generalization to new data.(overfitting)

    • When K is too large, the model underfits the data, lacking flexibility to capture underlying patterns. This leads to poor performance on both training and new data (underfitting)

Pros and Cons KNN regression

Pros

  • Offers a straightforward and easy-to-understand approach.
  • Requires minimal assumptions about the data’s underlying distribution.
  • Excels in capturing non-linear relationships, making it suitable for data with curved or complex patterns.

Cons

  • Becomes computationally intensive as the size of the training dataset increases.
  • May exhibit suboptimal performance when dealing with a high number of predictors.
  • Could produce less accurate predictions when extrapolating beyond the range of values present in the training data.

Linear Regression

We start with the assumption that the response variable and predictors are linearly related. We find the best line/plane using our data.

  • Simple linear regression involves one predictor and one response variable. The model is a straight line: \[ y = \beta_0 + \beta_1 x \]

  • In multiple linear regression, we have more than one variable. The model would be flat higher dimension plane.

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \]

Model Fitting in Linear Regression

  • The “best” line (fitted line) is that one that that minimizes the sum of squared differences between observed and predicted values.

  • No need for cross validation to find the regression coefficients.

  • In linear regression, as apposed to KNN, not scaling the features does not affect the fit

Considerations:

  • The assumption of linearity carries significant importance. Ignoring this assumption in linear regression leads to underfitting, resulting in a model with high bias.

  • Collinearity occurs when two predictor variables exhibit a linear relationship with each other. In such instances, the estimated regression coefficients fluctuate significantly, resulting in an unreliable fitted model.

  • SLR is not robust to outliers, which means that observations that deviate significantly from the main data cluster, can have a substantial impact on the regression coefficients if not appropriately managed.

Extrapolation

Predicting outside the range of the observed data is known as extrapolation.

Extrapolation should be approached with extreme caution, especially when the underlying process may change, and assumptions may no longer hold.

Linear vs KNN Regression

Linear Regression

  • Pro: Interpretability, e.g. slope: change in \(Y\) for a one-unit change \(X\).

KNN Regression

  • Con: Lack of interpretibility
  • Con: assumptions of linearity
  • Pro: Can model non-linear relationships

Example

We want to predict the price of diamonds based on two predictors; carat and depth.

We are using diamonds dataset in ggplot2 package.

library(tidyverse)
library(tidymodels)
library(ggplot2)

data("diamonds")

head(diamonds, 3)

Example - Con’d

# set seed for reproducible
set.seed(123)

#selecting smaller subset of data and selected columns
my_diamonds <- diamonds |>
  sample_n(1000, replace = FALSE) |>
  dplyr::select(price, carat, depth)

## split the datasets into training and testing set
diamonds_split <- initial_split(my_diamonds, prop = 0.75, strata = price)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)

# create a recipe
diamonds_recipe <- recipe(price ~ ., data = diamonds_train) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())

We can use the same recipe for linear regression model, even though we didn’t have to scale and center our predictors.

Example - Con’d

knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("regression")

set.seed(4123456) # before cross validation

diamond_vfold <- vfold_cv(diamonds_train, v = 5, strata = price)

## create a tibble of possible values for k
gridvals <- tibble(neighbors = seq(from = 1, to = 50, by = 3))

#perform cross validation
diamonds_cv_results <- workflow() |>
  add_recipe(diamonds_recipe) |>
  add_model(knn_spec) |>
  tune_grid(resamples = diamond_vfold, grid = gridvals) |>
  collect_metrics() |>
  filter(.metric == "rmse")

Choosing best k

diamonds_cv_results

choosing best k

rmspe_vs_k <- ggplot(diamonds_cv_results, aes(x = neighbors, y = mean)) +
  geom_point() +
  geom_line() +
  labs(x = "Number of Nearest Neighbors (K)", y = "Cross validation RMSPE Estimate") 


rmspe_vs_k

Fit the modeling using best k

# extract best k from the results
best_k <- diamonds_cv_results |>
  filter(mean == min(mean)) |>
  pull(neighbors)

best_k
[1] 19
knn_reg_new <- nearest_neighbor(weight_func = "rectangular", 
                              neighbors = best_k) |>
  set_engine("kknn") |>
  set_mode("regression")
#fit the model with best k
diamonds_knn_fit <- workflow() |>
  add_recipe(diamonds_recipe) |>
  add_model(knn_reg_new) |>
  fit(data = diamonds_train)

KNN results

#evaluate on test data
diamonds_knn_summary <- diamonds_knn_fit |>
  predict(diamonds_test) |>
  bind_cols(diamonds_test) |>
  metrics(truth = price, estimate = .pred) |>
  filter(.metric == 'rmse')

diamonds_knn_summary

Linear Regression

#we are using the same training data and recipe.

# model specification
lm_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

# no need to perform cross validation!!
diamonds_lm_fit <- workflow() |>
  add_recipe(diamonds_recipe) |>
  add_model(lm_spec) |>
  fit(data = diamonds_train)

Fitted Model

# extracting coefficients
coeffs <- diamonds_lm_fit |>
             extract_fit_parsnip() |>
             tidy()

coeffs

The fitted model is:

\[\begin{align*} \text{diamond price} &= 3918 + 3626 \cdot(\text{carat})-137\cdot(\text{depth})\end{align*}\]

Evaluation

lm_test_results <- diamonds_lm_fit |>
  predict(diamonds_test) |>
  bind_cols(diamonds_test) |>
  metrics(truth = price, estimate = .pred) |>
  filter(.metric == "rmse")

lm_test_results

Which one would you choose? why?

Clustering

  • Supervised problems involving classification and regression based on labeled training data.

  • In real life, not all datasets have pre-assigned labels.

  • Clustering is a method for exploring unlabeled data to identify patterns and groupings.(Unsupervised learning)

  • Example of clustering: Customer Segmentation, Spotify Music Recommendation System

Clustering- Con’d:

  • In clustering, the absence of a response variable makes it challenging to assess the “quality” of the clusters.

  • For this course, we will rely on visualization to gauge the quality of clustering.

  • Similar to classification and regression, there are numerous methods available for clustering observations and identifying subgroups.

  • one of the most widely used and relatively straightforward techniques for clustering is the K-means algorithm.

K-means

  1. Initialization: Start by randomly selecting K initial cluster centroids.

  2. Assignment: For each data point in the dataset, calculate its distance to each of the K centroids and assign the point to the cluster whose centroid is closest.

  3. Update: After all data points have been assigned to clusters, calculate the new centroids for each cluster. (average of all data points in each cluster).

  4. Repeat: Steps 2 and 3 are repeated iteratively until a stopping criterion is met. Common stopping criteria include a maximum number of iterations or when the centroids no longer change significantly between iterations.

When the algorithm converges (i.e., the centroids no longer change significantly), the K-means algorithm produces a clustering solution where data points are grouped into K clusters.

Considerations:

K-means has some key characteristics and considerations:

  • It is sensitive to the initial placement of centroids. Different initializations can lead to different cluster results. Therefore, it’s common to run the algorithm multiple times with different initialization and choose the best result.

  • The choice of K (the number of clusters) is crucial and often requires domain knowledge or experimentation. Various techniques, such as the elbow method can help in determining an appropriate value for K.

  • K-means assumes spherical, equally sized clusters, which may not apply to all data

Selecting the Optimal K

When choosing the number of clusters (K) for K-means, you must strike a balance:

Underfitting: Selecting a small K can lead to broad clusters that fail to capture the data’s inherent structure.

Overfitting: Opting for a large K can result in overly specific clusters, capturing noise rather than meaningful patterns.

Elbow Method: This approach involves plotting the number of clusters (K) against the corresponding within-cluster sum of squares distance (WSSD) and looking for an “elbow point” on the graph.

Example: Fruit data

We are going to use two predictors (mass and width) from fruit dataset to find clusters in our data.

Example: Fruit data

# Set a seed for reproducibility
set.seed(11630447)

sampled_fruit = fruit %>% 
  dplyr::select(mass, width) 

# standardized the two variables of interest
fruit_recipe <- recipe(~., data = sampled_fruit) %>% 
  step_scale(all_numeric()) %>% 
  step_center(all_numeric()) 

standardized_fruit <- fruit_recipe %>%
  prep() %>%
  bake(new_data = NULL)

# rename the columns 
standardized_fruit <- standardized_fruit %>% 
  rename(mass_standardized = mass, 
         width_standardized = width)

ggplot(standardized_fruit, aes(x = mass_standardized, 
                                  y = width_standardized)) +
  geom_point() +
  xlab("Mass (standardized)") +
  ylab("Width (standardized)") + 
  theme(text = element_text(size = 12))

K-mean model tuning

library(tidyclust)
set.seed(11630447)
# candidate values for K
fruit_clust_ks <- tibble(num_clusters = 1:9)
# the number of clusters is set to be tuned 
kmeans_spec <- k_means(num_clusters = tune()) |>
  set_engine("stats")
#  Run k-means clustering for k = 1, ... 9
kmeans_results <- workflow() |>
  add_recipe(fruit_recipe) |>
  add_model(kmeans_spec) |>
  tune_cluster(resamples=apparent(sampled_fruit), grid=fruit_clust_ks) |>
  collect_metrics()
kmeans_results

K-mean results

## Total WSSD
kmeans_results <- kmeans_results |> 
  filter(.metric == "sse_within_total") |>
  mutate(total_WSSD = mean) |>
  dplyr::select(num_clusters, total_WSSD)
kmeans_results

Elbow Plot

elbow_plot <- ggplot(kmeans_results, aes(x = num_clusters, y = total_WSSD)) +
  geom_point() +
  geom_line() +
  xlab("K") +
  ylab("Total within-cluster sum of squares") +
  scale_x_continuous(breaks = 1:9) + 
  theme(text = element_text(size = 12))
# "elbow" --> "best" choice for K
elbow_plot

Fit the model again with best number of cluster

kmeans_spec <- k_means(num_clusters = 3) |>
  set_engine("stats")

kmeans_fit <- workflow() |>
  add_recipe(fruit_recipe) |>
  add_model(kmeans_spec) |>
  fit(data = sampled_fruit) 

# summarizes on a per-cluster level:
tidy(kmeans_fit)
clustered_data <-  kmeans_fit |> 
  augment(sampled_fruit) # use augment() to add the classifications to the original data set

Final results

clustered_data

Visualization for clusters