DATA 101: Making Prediction with Data
University of British Columbia Okanagan
So far we have dealt with the following supervised problems based on patterns learned from labeled training data:
Clustering is a method for exploring unlabeled data to identify patterns and potential groupings.
This can be helpful at the exploratory data analysis stage as a tool for understanding complex datasets.
Even when data is labeled, these clusters can be used for many purposes, such as generating new questions or improving predictive analyses (via feature engineering).
This approach has both advantages and disadvantages.
Clustering requires unlabelled data which is inherently more easy to obtain than labelled data
However the task of clustering is a lot more challenging since we don’t have labelled examples to guide us.
As with classification and regression, there are a there are many possible methods that we could use to cluster our observations to look for subgroups.
In this lecture, we focus one of the most popular (and easiest) methods; K-means algorithm (Lloyd 1982)
As an illustrative example we use a data set from the palmerpenguins R package (Horst, Hill, and Gorman 2020).
This data set was collected by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research Site, and includes measurements for adult penguins found near there (Gorman, Williams, and Fraser 2014).
We focus on using two variables—penguin bill and flipper length (both in millimeters) and take a subset of 18 observations of the original data, which standardize.
library(tidyverse)
library(tidymodels)
# Set a seed for reproducibility
set.seed(11630447)
# randomly sample 18 penguins
sampled_penguins = penguins %>%
sample_n(18) %>%
select(flipper_length_mm, bill_length_mm)
# standardized the two variables of interest
penguins_recipe <- recipe(~., data = sampled_penguins) %>%
step_scale(all_numeric()) %>%
step_center(all_numeric())
standardized_penguins <- penguins_recipe %>%
prep() %>%
bake(new_data = NULL)
# rename the columns (to match the book)
standardized_penguins <- standardized_penguins %>%
rename(flipper_length_standardized = flipper_length_mm,
bill_length_standardized = bill_length_mm)
standardized_penguins
Next, we can create a scatter plot using this data set to see if we can detect subtypes or groups in our data set.
We might suspect there are 3 subtypes of penguins:
library(tidyclust)
kmeans_spec <- k_means(num_clusters = 3) |>
set_engine("stats")
kmeans_fit <- workflow() |>
add_recipe(penguins_recipe) |>
add_model(kmeans_spec) |>
fit(data = sampled_penguins)
clustered_data <- kmeans_fit |>
augment(sampled_penguins)
cluster_plot <- ggplot(clustered_data,
aes(x = flipper_length_mm,
y = bill_length_mm,
color = .pred_cluster),
size = 2) +
geom_point() +
labs(x = "Flipper Length",
y = "Bill Length",
color = "Cluster") +
scale_color_manual(values = c("dodgerblue3",
"darkorange3",
"goldenrod1")) +
theme(text = element_text(size = 12))
cluster_plot
In simple where we can easily visualize date, we can give human-made labels based on their positions on the plot:
orange cluster: small flipper length and small bill length,
yellow cluster: small flipper length and large bill length,
blue cluster: large flipper length and large bill length.
K-means is a procedure that partitions a data into K distinct, non-overlapping subsets called clusters
The objective of this procedure is to minimize the sum of squared distances within each cluster (i.e. making the clusters as tight and compact as possible)
This within-cluster sum-of-squared-distances (WSSD) will also serve as a metric for assessing the “quality” of a partitions (i.e. the resulting cluster assignments).
See Alison Horst’s gif here
Fig 9.10 from textbook: Clustering of the penguin data for K clusters ranging from 1 to 9. Cluster centers are indicated by larger points that are outlined in black.
For each of the K-group solutions on the previous page we could calculate the total WSSD.
If we plot the total WSSD versus the number of clusters, we see that the decrease in total WSSD levels off (or forms an “elbow shape”) when we reach roughly the “right” number of clusters
Choosing the K that coincides with “elbow” point (i.e. where the reduction in sum of squared distances slows down) is called the “eblow method”.
Fig 9.12from textbook: Total WSSD for K clusters ranging from 1 to 9.
Strengths
Limitations
Let’s look at the code
As discussed in KNN (another distance-based approach), you should always consider standardizing variables before applying K-means.
Due to the random nature of this algorithm, it is suggested to run K-means multiple times (with different initializations) and choose the best (in terms of WSSD) result.
–>
–>
–>
–>
–>