DATA 101: Making Prediction with Data
University of British Columbia Okanagan
Today, we will review the concepts we learnt in lecture 12 through 14
Regression Problem
KNN Regression
Linear Regression
Clustering
Regression involves predicting the value of a numeric variable using one or more variables.
Regression vs. Classification:
There are many statistical/machine learning methods to perform regression, of those we covered K-nearest neighbor, linear regression.
Key Concepts:
It assumes that similar data points have similar target values.
Predicts the target value of a new observation by averaging the responses of its K nearest neighbors in the training dataset.
To determine the nearest neighbors, KNN calculates the Euclidean distance between data points in the feature space.
The choice of K (number of neighbors) significantly impacts the model’s performance. cross-validation is done to find the best value for k.
Divide the dataset into a training set and a test set. Stratified splitting helps maintain a balanced representation of target values.
Perform cross-validation to determine K, by using training set only.
Train the model with the chosen K.
Assess the model’s performance on the test set using appropriate regression metrics such as RMSPE or MSE. Lower values indicate better predictive accuracy.
Scaling Features is crucial when using KNN as it relies on distance measures.
Impact of K:
When K is too small, the model overfits the training data, capturing noise and following it closely, which results in poor generalization to new data.(overfitting)
When K is too large, the model underfits the data, lacking flexibility to capture underlying patterns. This leads to poor performance on both training and new data (underfitting)
We start with the assumption that the response variable and predictors are linearly related. We find the best line/plane using our data.
Simple linear regression involves one predictor and one response variable. The model is a straight line: \[ y = \beta_0 + \beta_1 x \]
In multiple linear regression, we have more than one variable. The model would be flat higher dimension plane.
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k \]
The “best” line (fitted line) is that one that that minimizes the sum of squared differences between observed and predicted values.
No need for cross validation to find the regression coefficients.
In linear regression, as apposed to KNN, not scaling the features does not affect the fit
The assumption of linearity carries significant importance. Ignoring this assumption in linear regression leads to underfitting, resulting in a model with high bias.
Collinearity occurs when two predictor variables exhibit a linear relationship with each other. In such instances, the estimated regression coefficients fluctuate significantly, resulting in an unreliable fitted model.
SLR is not robust to outliers, which means that observations that deviate significantly from the main data cluster, can have a substantial impact on the regression coefficients if not appropriately managed.
Predicting outside the range of the observed data is known as extrapolation.
Extrapolation should be approached with extreme caution, especially when the underlying process may change, and assumptions may no longer hold.
We want to predict the price of diamonds based on two predictors; carat and depth.
We are using diamonds
dataset in ggplot2
package.
# set seed for reproducible
set.seed(123)
#selecting smaller subset of data and selected columns
my_diamonds <- diamonds |>
sample_n(1000, replace = FALSE) |>
dplyr::select(price, carat, depth)
## split the datasets into training and testing set
diamonds_split <- initial_split(my_diamonds, prop = 0.75, strata = price)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
# create a recipe
diamonds_recipe <- recipe(price ~ ., data = diamonds_train) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
We can use the same recipe for linear regression model, even though we didn’t have to scale and center our predictors.
knn_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("regression")
knn_spec <- nearest_neighbor(weight_func = "rectangular",
neighbors = tune()) |>
set_engine("kknn") |>
set_mode("regression")
set.seed(4123456) # before cross validation
diamond_vfold <- vfold_cv(diamonds_train, v = 5, strata = price)
## create a tibble of possible values for k
gridvals <- tibble(neighbors = seq(from = 1, to = 50, by = 3))
#perform cross validation
diamonds_cv_results <- workflow() |>
add_recipe(diamonds_recipe) |>
add_model(knn_spec) |>
tune_grid(resamples = diamond_vfold, grid = gridvals) |>
collect_metrics() |>
filter(.metric == "rmse")
The fitted model is:
\[\begin{align*} \text{diamond price} &= 3918 + 3626 \cdot(\text{carat})-137\cdot(\text{depth})\end{align*}\]lm_test_results <- diamonds_lm_fit |>
predict(diamonds_test) |>
bind_cols(diamonds_test) |>
metrics(truth = price, estimate = .pred) |>
filter(.metric == "rmse")
lm_test_results
Which one would you choose? why?
Supervised problems involving classification and regression based on labeled training data.
In real life, not all datasets have pre-assigned labels.
Clustering is a method for exploring unlabeled data to identify patterns and groupings.(Unsupervised learning)
Example of clustering: Customer Segmentation, Spotify Music Recommendation System
In clustering, the absence of a response variable makes it challenging to assess the “quality” of the clusters.
For this course, we will rely on visualization to gauge the quality of clustering.
Similar to classification and regression, there are numerous methods available for clustering observations and identifying subgroups.
one of the most widely used and relatively straightforward techniques for clustering is the K-means algorithm.
Initialization: Start by randomly selecting K initial cluster centroids.
Assignment: For each data point in the dataset, calculate its distance to each of the K centroids and assign the point to the cluster whose centroid is closest.
Update: After all data points have been assigned to clusters, calculate the new centroids for each cluster. (average of all data points in each cluster).
Repeat: Steps 2 and 3 are repeated iteratively until a stopping criterion is met. Common stopping criteria include a maximum number of iterations or when the centroids no longer change significantly between iterations.
When the algorithm converges (i.e., the centroids no longer change significantly), the K-means algorithm produces a clustering solution where data points are grouped into K clusters.
K-means has some key characteristics and considerations:
It is sensitive to the initial placement of centroids. Different initializations can lead to different cluster results. Therefore, it’s common to run the algorithm multiple times with different initialization and choose the best result.
The choice of K (the number of clusters) is crucial and often requires domain knowledge or experimentation. Various techniques, such as the elbow method can help in determining an appropriate value for K.
K-means assumes spherical, equally sized clusters, which may not apply to all data
When choosing the number of clusters (K) for K-means, you must strike a balance:
Underfitting: Selecting a small K can lead to broad clusters that fail to capture the data’s inherent structure.
Overfitting: Opting for a large K can result in overly specific clusters, capturing noise rather than meaningful patterns.
Elbow Method: This approach involves plotting the number of clusters (K) against the corresponding within-cluster sum of squares distance (WSSD) and looking for an “elbow point” on the graph.
We are going to use two predictors (mass and width) from fruit
dataset to find clusters in our data.
# Set a seed for reproducibility
set.seed(11630447)
sampled_fruit = fruit %>%
dplyr::select(mass, width)
# standardized the two variables of interest
fruit_recipe <- recipe(~., data = sampled_fruit) %>%
step_scale(all_numeric()) %>%
step_center(all_numeric())
standardized_fruit <- fruit_recipe %>%
prep() %>%
bake(new_data = NULL)
# rename the columns
standardized_fruit <- standardized_fruit %>%
rename(mass_standardized = mass,
width_standardized = width)
ggplot(standardized_fruit, aes(x = mass_standardized,
y = width_standardized)) +
geom_point() +
xlab("Mass (standardized)") +
ylab("Width (standardized)") +
theme(text = element_text(size = 12))
library(tidyclust)
set.seed(11630447)
# candidate values for K
fruit_clust_ks <- tibble(num_clusters = 1:9)
# the number of clusters is set to be tuned
kmeans_spec <- k_means(num_clusters = tune()) |>
set_engine("stats")
# Run k-means clustering for k = 1, ... 9
kmeans_results <- workflow() |>
add_recipe(fruit_recipe) |>
add_model(kmeans_spec) |>
tune_cluster(resamples=apparent(sampled_fruit), grid=fruit_clust_ks) |>
collect_metrics()
kmeans_results