Data 311: Machine Learning

Lecture 10: Clustering

Dr. Irene Vrbik

University of British Columbia Okanagan

Recall

  • The goal in supervised setting is to predict \(Y\) (either continuous or categorical) using \(p\) features \(X_1, X_2, \dots, X_p\) measured on \(n\) observations.

  • In unsupervised learning we do not have a response variable \(Y\) so the goal is to discover interesting things about the measurements on \(X_1, X_2, \dots X_p\)

Goal of Unsupervised Learning

The goal is to discover interesting things about the measurements:

  • e.g. is there an informative way to visualize the data?

  • e.g. can we discover subgroups among the variables or among the observations?

  • e.g. can we discover interesting patterns, relationships, or associations among items or variables in a dataset?

Practical Applications

Techniques for unsupervised learning are of growing importance in a number of fields. For instance:

  • search for subgroups among breast cancer patients in order to gain a better understanding of the disease
  • understand customer buying behavior to identify groups of shoppers with similar browsing and purchase histories (for targeted adds)
  • representing a high-dimensional data set (e.g. gene expression) in smaller dimensions

Practical Applications

Techniques for unsupervised learning are of growing importance in a number of fields. For instance:

  • search for subgroups among breast cancer patients in order to gain a better understanding of the disease (clustering)
  • representing a high-dimensional data set (e.g. gene expression) in smaller dimensions (dimensionality reduction)
  • understand customer buying behavior to identify groups of shoppers with similar browsing and purchase histories (for targeted adds) (association)

Challenges and Advantages of Unsupervised Learning

Challenge: Unsupervised learning is more subjective than supervised learning, as there is no simple goal for the analysis, such as prediction of a response.

Advantage It is often easier to obtain unlabeled data—from a lab instrument or a computer—than labeled data, which can require human intervention.

Types of Unsupervised Learning

Unsupervised learning is utilized for three main tasks [1]:

  1. clustering
  2. dimensionality reduction
  3. association

This lecture will focus on two clustering algorithms: Hierarchical Clustering and K-means Clustering

Clustering

The goal of clustering is to find groups (or clusters) such that all observations are

  • more similar to observations inside their group and
  • more dissimilar to observations in other groups.

Types of clustering algorithms include:

  • Hierarchical
  • Partitioning (e.g. k-means “hard” clustering)
  • Probabilistic Clustering (e.g. Finite Mixture models)

Old Faithful

Old Faithful Geyser Data is available in R (see ?faithful) and contains the following information for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

  • waiting the waiting time between eruptions and
  • eruptions the duration of the eruption

Old Faithful Plot

While there are no true “classes” in this data set, many would argue that there are two natural “clusters”.

Scatter plot with Eruption time (in minutes) on the x-axis and Waiting time in (in minutes) on the y-axis.  Points are clustering in the bottom left corner of the plot and the upper right corner of the plot.

Wine

  • The wine data from the gclus package comprise 178 Italian wines from three different cultivars that correspond to the wine varietals: Barolo, Grignolino, Barbera

  • Recorded are 13 measurements (eg. alcohol content, malic acid, ash, …)

  • Notably, this data set does have a known class!
  • However, is often used to test and compare the performance of various classification algorithms.

Wine Data

library(gclus); data(wine); head(wine)

A pairwise scatterplot of the wine data set with points coloured according to wine class (1,2,3)

Comments before we begin

  • Clustering will have an element of subjectivity (some may argue that there are 3 groups in the Old Faithful data set)

  • It can therefore be hard to assess the results obtained from unsupervised learning methods

  • For these reasons, clustering algorithms are often tried and tested on data sets for which classes do exist (eg. wine)

  • In practice unsupervised learning is often performed as part of an exploratory data analysis

Hierarchical Clustering

  • Once we have information on the distances between all observations in our data set, we’re ready to come up with ways to group those observations.

  • Hierarchical clustering (HC), or hierarchical cluster analysis (HCA), is in many ways the most straightforward method for finding groups.

  • As it’s name suggests, this method builds a hierarchy of clusters.

Types of HC

HC can be categorized in two ways (we’ll focus on 1.):

  1. Agglomerative “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

  2. Divisive “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Agglomerative HC algorithm

Agglomerative HC can be boiled down to the following steps:

  1. Start with all observations in their own groups ( \(n\) unique groups)

  2. Join the two closest observations (now there are \(n-1\) groups)

  3. Recalculate distances (more on this soon)

  4. Repeat 2) and 3) until you are left with only 1 group.

Simple Example

Code
a <- c(1,2)
b <- c(2,2.5)
c <- c(1.5, 2)
d <- c(5,3)
e <- c(4.5,4)
x <- rbind(a,b,c,d,e)
colors <- c("purple",
 "dodgerblue",
 "firebrick",
 "forestgreen",
 "gold")
par(mar = c(4, 4, 1.5, 0.1))
plot(x, xlab="", ylab="", pch=letters[1:5], cex=2
     , ylim=c(1,5), xlim=c(0,6))

A scatter plot of 5 points: a. at (1,2), b at (2,2.5), c at (1.5, 2), d at (5,3), and e. at (4.5,4).

If I were to ask you to group this data, what would you do?

(Again, since we are in the realm of clustering there is no “right” or “wrong” answer).

Now let’s see what HCA produces …

Step 1

A scatter plot of 5 points: a. at (1,2), b at (2,2.5), c at (1.5, 2), d at (5,3), and e. at (4.5,4).  Each letter is circled indicating that they are the sole member of their cluster.

Example distance matrix:

x <- rbind(a,b,c,d,e); dist(x)
a b c d e
a 0
b 1.118 0
c 0.5 0.707 0
d 4.123 3.041 3.64 0
e 4.031 2.915 3.606 1.118 0

Step 2

Same scatter plot as previous page, only now point a and point c are enclosed in the same circle indicating that they belong to the same group.

Since a and c are the closest, we join them to a group

a b c d e
a 0
b 1.118 0
c 0.5 0.707 0
d 4.123 3.041 3.64 0
e 4.031 2.915 3.606 1.118 0

Step 3: Recalculating Distances

  • Now that we’ve joined the two closest observations how do we determine the distance between groups?

  • For instance, whats the distance between the purple group containing observations a and c and the blue group containing a single observation b? (denote this \(d_{\{ac\}\{b\}}\))

  • There are 3 common ways of recalculating distances between groups; these are referred to as linkages

Linkages

  1. Single linkage: calculates the minimum distance between two points in each cluster.

  2. Complete linkage: calculates the maximum distance between two points in each cluster.

  3. Average linkage: calculates the average pairwise distance between the observations inside group with those outside.

  4. Centroid Linkage: calculates the distance between the centroids (mean points) of two clusters.

Linkages Example:

Same scatter plot as previous page, only now point a and point c are enclosed in the same circle indicating that they belong to the same group.

  1. Single linkage: \(d_{\{ac\}\{b\}}=d_{cb}\)
  2. Complete linkage: \(d_{\{ac\}\{b\}}=d_{ab}\).
  3. Average linkage: \(d_{\{ac\}\{b\}}=\frac{d_{ab}+d_{cb}}{2}\).

Linkages Example:

a b c d e
a 0
b 1.118 0
c 0.5 0.707 0
d 4.123 3.041 3.64 0
e 4.031 2.915 3.606 1.118 0
  1. Single linkage: \[d_{\{ac\}\{b\}}=d_{cb} = 0.707\]
  2. Complete linkage: \[d_{\{ac\}\{b\}}=d_{ab} = 1.118\]
  3. Average linkage: \[d_{\{ac\}\{b\}}=\frac{d_{ab}+d_{cb}}{2}= 0.9125\]

Distance recalculation

ac b d e
ac 0
b 0.707 0
d 3.64 3.041 0
e 3.606 2.915 1.118 0
  1. Single linkage: \[d_{\{ac\}\{b\}}=d_{cb} = 0.707\]

Adopting single linkage, our distance matrix would then get updated to something like what you see on the left.

Step 4a

The same plot from before with a and c joined.

Since group {a,c} and {b} are the closest, we merge them.

ac b d e
ac 0
b 0.707 0
d 3.64 3.041 0
e 3.606 2.915 1.118 0

Step 4b

The same plot from before.

Recalculate distance:

abc d e
abc 0
d 3.041 0
e 2.915 1.118 0

Step 4c

The same plot from before with e and d joined.

Since group {e} and {d} are the closest, we merge them.

abc d e
abc 0
d 3.041 0
e 2.915 1.118 0

Step 4d

The same plot from previous slide.

Recalculate Distance

abc de
abc 0
de 2.915 0

Step 4e

The same plot from previous slide with all members joined to a single group

Join the two remaining groups: {a,b,c} and {e,d}

abc de
abc 0
de 2.915 0

Hierarchy of Clusters

  • What we’ve done is provided a hierarchy of potential solutions to the following non-trivial questions:

    • how many groups are in this data?
    • what do they look like?
  • Also, do we really need to go through the algorithm step-by-step to see the process?

Hierarchical Clustering

  • We can summarize the algorithm graphically, using what’s called a dendrogram1, i.e. a diagram representing a tree.
  • On the Y-axis, “Height”, is a measure of closeness of either individual data points or clusters.

  • Horizontal bars indicate the distance at which two clusters are joined.

  • Large “jumps” between combining groups signify large distances between groups.

Dendrogram

Step 1, 2, 3, 4a, 4b, 4c, 4d, 4e

In this case, we can see that a and c are most similar, since the height of the horizontal line that joins them together is lowest on the Y-axis.

Dendrogram “jump”

In this case, the dendrogram shows us that there is big difference between cluster {a, b, c} and cluster {d, e} and that the distance within each of these clusters is pretty small.

Determining the Number of Clusters

  • Dendrograms can also be used to determine how many groups are present in the data.

  • We can perform a horizontal “cut” in the tree to partitioning the data into clusters.

  • So generally, we can “cut” at the largest jump in distance, and that will determine the number of groups, as well as each observation’s group membership.

Cut-points (2 group solution)

You can visualize this as an upside-down tree (I have extended the “leaves” of the dendrogram to the the bottom to better explain the cutting process). Where you cut will dictate how many “branches” (i.e. clusters) you end up with.

Dendrogram cut at 1.7

Cutting the dendrogram somewhere along the large “jump” would result in two branches, i.e. clusters

2 group visualization

Which matches our intuition …

The same plot from before with e and d joined.

We could have cut this tree elsewhere however …

5-group solution

Any horizontal cut at r cutpts[1] or less, would result in 5 singleton clusters, i.e. 5 clusters with only 1 member.

4-group solution

Any horizontal cut between r cutpts[1] and r cutpts[2] , would result in 4 clusters: {b}, {a,c}, {d}, and {e}.

3-group solution

Any horizontal cut between r cutpts[2] and r cutpts[3] , would result in 3 clusters: {b, a,c}, {d}, and {e}.

2-group solution

Any horizontal cut between r cutpts[3] and r cutpts[4] , would result in 2 clusters: {b, a,c}, {d, e}.

1-group solution

Any horizontal cut above and r cutpts[4] , would result in 1 clusters comprised of all the observations: {a,b, c, d, e}.

The same plot from previous slide with all members joined to a single group

HC: Old Faithful

Below is the dendogram on the raw distance matrix (calculated using single linkage) on the Old faithful data

Old Faithful (2 Group Solution)

Here is the two group solution from that model:

Old Faithful (3 Group Solution)

Here is the three group solution from that model:

Old Faithful (4 Group Solution)

Here is the four group solution from that model:

Question

  • Hmm… These clusters are not ideal.

  • What do you think might have happened?

  • Can you think of something else to try?

HC: Old Faithful (second attempt)

Here is the dendogram on the standardized euclidean distance matrix (using single linkage and the scale function)

faits <- hclust(dist(scale(faithful)), method="single")
plot(faits, main="", labels=FALSE)
abline(h=0.4, lwd=2, col=2)

hclust() Arguments:

  • d: a dissimilarity structure as produced by dist.
  • method: the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median" or "centroid"

HC: Old Faithful (second attempt)

Once we perform HC on the scaled data, we get a two-group solution that much more matches our intuition …

par(mar = c(1.9, 1.9, 1, 0))     

# a vector of length n
# of clusters
cls <- cutree(faits,2)

# my colours:
mycol <- c("purple",
 "dodgerblue")

# plot the dendogram
plot(faithful,  
     col = mycol[cls])

HC: Wine data

In some cases, the choice of linkage method could make a big difference. Notably, in all cases the data were scaled.

Notes on Linkage

  • Single-linkage clustering is susceptible to an effect known as “chaining” whereby poorly separated, but distinct clusters are merged at an early stage

  • Average linkage may lead to a number of singleton clusters (which is generally bad sign)

  • Complete linkage provides a reasonable partition of the data, suggesting probably 2 or 3 groups. (We’ll come back to this later.)

Comments

Cons

  • Distance matrices (of size \(n \times n\), symmetric) must be calculated.

  • For very large samples, this can be time consuming (even for computers).

  • Results are often sensitive to what distance type and what linkage method are used.

Pro

  • easy to implement and understand

K-means Clustering

Next we look at \(k\)-means Clustering.

  • \(k\)-means clustering is a popular method that requires the user to provide \(k\), the number of groups they are looking for1

  • In this algorithm, each observation will belong to the cluster whose mean is closest.

After specifying the steps of the algorithm, we will demonstate on the Old Faithful data from last lecture.

Algorithm

  1. Randomly select \(k\) (the number of groups) points in your data. These will serve as the first centroids1
  2. Assign all observations to their closest centroid (in terms of Euclidean distance). You now have \(k\) groups.
  3. Calculate the means of observations from each group; these are your new centroids.
  4. Repeat 2) and 3) until nothing changes anymore (each loop is called an iteration).

Step 1 (\(k\) = 2)

Randomly select 2 centroid

Code
par(mar = c(4, 4, 0.1, 0.1))
k=2
x <- scale(faithful)
set.seed(2021)

#1 start with k centroids
centrs <- x[sample(1:nrow(x), k),]
plot(x)
points(centrs, pch=17, cex=2, col=1:k)

Step 2

Assign obs. to their closest centroid

Code
par(mar = c(4, 4, 0.1, 0.1))
#2a) calculate distances between all x and centers
dists <- matrix(NA, nrow(x), k)
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }

#2b) assign group memberships (you probably want to provide some overview of apply functions) 
membs <- apply(dists, 1, which.min)
plot(x, col=membs)  
points(centrs, pch=17, cex=2, col=1:2)

Step 3

Recalculate the means/centroids of each group.

Code
par(mar = c(4, 4, 0.1, 0.1))
#3) calculate new group centroids
oldcentrs <- centrs #save for convergence check!
for(j in 1:k){
  centrs[j,] <- colMeans(x[membs==j, ])
}
plot(x, col=membs)
points(centrs, pch=17, cex=2, col=1:2)

Step 4 (loop)

Code
par(mar = c(4, 4, 1, 0.1))
library(gifski)
count = 2
changing <- TRUE  
while(changing){
  #2a) calculate distances between all x and centers
  dists <- matrix(NA, nrow(x), k)
  #could write as a double loop, or could utilize other built in functions probably
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }
  #2b) assign group memberships (you probably want to provide some overview of apply functions) 
  membs <- apply(dists, 1, which.min)
  
  #3) calculate new group centroids
  oldcentrs <- centrs #save for convergence check!
  for(j in 1:k){
    centrs[j,] <- colMeans(x[membs==j, ])
  }
  plot(x, col=membs, main=paste("Iteration", count))
  points(centrs, pch=17, cex=2, col=1:2)
  count <- count+1
  
  #4) check for convergence
  if(all(centrs==oldcentrs)){
    changing <- FALSE
    text(0.75, -1, "DONE!", cex=4)
  }
}
membs2 <- membs # save 2-group solution

Step 4 (last iteration)

par(mar = c(4, 4, 1, 0.1))
count = 2
changing <- TRUE  
while(changing){
  #2a) calculate distances between all x and centers
  dists <- matrix(NA, nrow(x), k)
  #could write as a double loop, or could utilize other built in functions probably
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }
  #2b) assign group memberships (you probably want to provide some overview of apply functions) 
  membs <- apply(dists, 1, which.min)
  
  #3) calculate new group centroids
  oldcentrs <- centrs #save for convergence check!
  for(j in 1:k){
    centrs[j,] <- colMeans(x[membs==j, ])
  }
  count <- count+1
  
  #4) check for convergence
  if(all(centrs==oldcentrs)){
    plot(x, col=membs, main=paste("Iteration", count))
    points(centrs, pch=17, cex=2, col=1:2)
    changing <- FALSE
    # text(0.75, -1, "DONE!", cex=4)
  }
}

Step 4 (last iteration)

membs2 <- membs # save 2-group solution

Step 1 (\(k\) = 3)

Randomly select 3 observations ( \(k\) =3). These will be your centroids for the 3 groups

Code
par(mar = c(4, 4, 0.1, 0.1))
k=3
x <- scale(faithful)
set.seed(2021)

#1 start with k centroids
centrs <- x[sample(1:nrow(x), k),]
plot(x)
points(centrs, pch=17, cex=2, col=1:k)

Step 2

Assign obs. to their closest centroid

Code
par(mar = c(4, 4, 0.1, 0.1))
#2a) calculate distances between all x and centers
dists <- matrix(NA, nrow(x), k)
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }

#2b) assign group memberships (you probably want to provide some overview of apply functions) 
membs <- apply(dists, 1, which.min)
plot(x, col=membs)  
points(centrs, pch=17, cex=2, col=1:k)

Step 3

Recalculate the means of each group. These are your new centroids.

Code
par(mar = c(4, 4, 0.1, 0.1))
#3) calculate new group centroids
oldcentrs <- centrs #save for convergence check!
for(j in 1:k){
  centrs[j,] <- colMeans(x[membs==j, ])
}
plot(x, col=membs)
points(centrs, pch=17, cex=2, col=1:k)

Step 4 (loop)

Repeat 2) and 3) until nothing changes anymore.

Code
par(mar = c(4, 4.1, 1, 0.1))
count = 2
changing <- TRUE  
while(changing){
  #2a) calculate distances between all x and centers
  dists <- matrix(NA, nrow(x), k)
  #could write as a double loop, or could utilize other built in functions probably
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }
  #2b) assign group memberships (you probably want to provide some overview of apply functions) 
  membs <- apply(dists, 1, which.min)
  
  #3) calculate new group centroids
  oldcentrs <- centrs #save for convergence check!
  for(j in 1:k){
    centrs[j,] <- colMeans(x[membs==j, ])
  }
  plot(x, col=membs, main=paste("Iteration", count))
  points(centrs, pch=17, cex=2, col=1:k)
  count <- count+1
  
  #4) check for convergence
  if(all(centrs==oldcentrs)){
    changing <- FALSE
    text(0.75, -1, "DONE!", cex=4)
  }
}
membs3 <- membs # save 3-group solution

Step 4 (last iteration)

par(mar = c(4, 4.1, 1, 0.1))
count = 2
changing <- TRUE  
while(changing){
  #2a) calculate distances between all x and centers
  dists <- matrix(NA, nrow(x), k)
  #could write as a double loop, or could utilize other built in functions probably
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }
  #2b) assign group memberships (you probably want to provide some overview of apply functions) 
  membs <- apply(dists, 1, which.min)
  
  #3) calculate new group centroids
  oldcentrs <- centrs #save for convergence check!
  for(j in 1:k){
    centrs[j,] <- colMeans(x[membs==j, ])
  }
  count <- count+1
  
  #4) check for convergence
  if(all(centrs==oldcentrs)){
    plot(x, col=membs, main=paste("Iteration", count))
    points(centrs, pch=17, cex=2, col=1:k)
    changing <- FALSE
    # text(0.75, -1, "DONE!", cex=4)
  }
}

Step 4 (last iteration)

membs3 <- membs # save 3-group solution

Different 3-group (loop)

Let’s do the 3-group \(k\)-means again:

Different 3-group solution

(Last Iteration)

Code
par(mar = c(4, 4, 1.1, 0.1))
set.seed(1234)
#1 start with k centroids
centrs <- x[sample(1:nrow(x), k),]
#start loop
changing <- TRUE  
while(changing){
  #2a) calculate distances between all x and centers
  dists <- matrix(NA, nrow(x), k)
  #could write as a double loop, or could utilize other built in functions probably
  for(i in 1:nrow(x)){
    for(j in 1:k){
      dists[i, j] <- sqrt(sum((x[i,] - centrs[j,])^2)) 
    }
  }
  #2b) assign group memberships (you probably want to provide some overview of apply functions) 
  membs <- apply(dists, 1, which.min)
  
  #3) calculate new group centroids
  oldcentrs <- centrs #save for convergence check!
  for(j in 1:k){
    centrs[j,] <- colMeans(x[membs==j, ])
  }
  # readline(prompt="Press [enter] to continue")
  
  #4) check for convergence
  if(all(centrs==oldcentrs)){
    plot(x, col=membs)
    points(centrs, pch=17, cex=2, col=1:k)
    changing <- FALSE
    #text(mean(x), -1, "DONE!", cex=4)
  }
}
Code
membs3b <- membs

3 group comparison

Depending on our random starting points, we can get very different answers …

Why does \(k\)-means work?

  1. Randomly select \(k\) (the number of groups) points in your data. These will serve as the first centroids
  2. Assign all observations to their closest centroid (in terms of Euclidean distance). You now have \(k\) groups.
  3. Calculate the means of observations from each group; these are your new centroids.
  4. Repeat 2) and 3) until nothing changes anymore

2. and 3. are recursively finding the minimum within-group sum of squared distances between points and their centroids.

Comments

Pros

  • Computationally efficient, even for large data sets.
  • Only \(n \times k\) distance matrices needed
  • Relatively straightforward
  • Often provides clearer groups than HC.

Cons

  • stochastic, i.e. random (as opposed to deterministic)
  • can return local optimal rather than global
  • not for categorical \(X\)
  • Groups will be found no matter what.

Wine

  • Let’s return to the wine data set

  • Let’s apply HC and \(k\)-means and then investigate the results…

  • Recall that complete linkage appeared to be the most appropriate linkage choice and it suggested 2 or 3 groups.

Code
par(mar=c(0,4,1,0))
library(gclus)
data(wine)
wc <- hclust(dist(scale(wine[,-1])),method="complete")
w2 <- cutree(wc,2)
w3 <- cutree(wc,3)
# abline(h=1, col=2, lwd=2)
plot(wc, main= "Complete Linkage", labels=FALSE)

Code
set.seed(13)
wk3 <- kmeans(scale(wine[,-1]),3,nstart=25)
par(mar = c(3.9, 3.9, 1, 1))  
pairs(wine[,-1], col=wk3$clus, main = "k-means 3-group solution")

Comparing Results

  • By looking at the plots, it seems as though the results are pretty similar between \(k\)-means and HC.

  • The resolution makes it tough to tell, but the solutions are not identical.

  • Both methods are finding most of the group structure, but…

    • \(k\)-means only misclassifies 6 of the 178 wines
    • whereas HC misclassifies 29 of them

Assessing Results

  • But remember, we actually have know classes (but hid them in training)

  • The wine data set among others (e.g. Iris, crabs) are benchmark data clustering data sets that are used to assess performance

  • With the understanding that we wouldn’t be able to do this is a authentic clustering problem, in these scenarios we would be able to build confusions matrices and calcaulte classification error, etc.

Confusion Matrices

wineknown <- c("Barolo","Grignolino","Barbera")[wine[,1]]
ktab <- table(wineknown,wk3$clus)
htab <- table(wineknown,w3)

HC

library(kableExtra)

htab %>%
  kable(booktabs = T) %>%
  kable_styling()
1 2 3
Barbera 0 0 48
Barolo 51 8 0
Grignolino 18 50 3

\(k\)-means

ktab %>%
  kable(booktabs = T) %>%
  kable_styling()
1 2 3
Barbera 0 48 0
Barolo 59 0 0
Grignolino 3 3 65

Notice how the correctly labeled observations do not necessarily fall on the main diagonal!

Partitioning and Hierarchical

  • Neither HC or \(k\)-means can really be considered “statistical” in the sense of:

    • What model are we assuming?
    • How can we test whether the model is accurate?
  • They are basically ad hoc methods that mostly follow from intuition and tend to perform alright.

  • Eventually, we’ll see clustering via mixture models; the unsupervised equivalent of discriminant analysis.

Dealing with Random Starts

  • As mentioned, \(k\)-means minimizes the within group sum of squares (WSS); however, it does so locally (not globally)

  • Furthermore, since it’s initialized randomly, results change from run to run.

  • Importantly: it’s fast! So we can run it many times, and choose the solution with the smallest within-group sum of squares for those runs.

  • In fact, there’s a built in argument in R…

Iris

  • This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the following four variables:

    • sepal length and width
    • petal length and width,
  • Data set contains the measurements on 50 flowers from each of 3 species of iris.

  • The species are Iris setosa, versicolor, and virginica.

Iris plot

par(mar = c(3.9, 3.9, 0, 1))  
plot(iris[,-5], col=iris[,5])

Demo 1

set.seed(1)
kiris <- kmeans(iris[,-5], 3)
table(iris[,5], 
      kiris$cluster)
            
              1  2  3
  setosa      0  0 50
  versicolor 48  2  0
  virginica  14 36  0
set.seed(2)
kiris <- kmeans(iris[,-5], 3)
table(iris[,5], 
      kiris$cluster)
            
              1  2  3
  setosa      0 50  0
  versicolor 48  0  2
  virginica  14  0 36

These are the same solution but with swapped labels

Demo 2

set.seed(3)
kiris <- kmeans(iris[,-5], 3)
table(iris[,5], 
      kiris$cluster)
            
              1  2  3
  setosa     33  0 17
  versicolor  0 46  4
  virginica   0 50  0
set.seed(4)
kiris <- kmeans(iris[,-5], 3)
table(iris[,5], 
      kiris$cluster)
            
              1  2  3
  setosa      0  0 50
  versicolor 48  2  0
  virginica  14 36  0

These are not the same solution

Demo 3

ix <- iris[,-5]
class <- iris[,5]
ko <- kmeans(ix, 3, nstart=25)
table(class, ko$cluster)
            
class         1  2  3
  setosa      0  0 50
  versicolor  2 48  0
  virginica  36 14  0

If we run multiple starts, R will automatically report the solution which had the smallest within group sum of squares

Determining Number of Groups

  • For hierarchical, we have an argument for the number of groups present in the data (largest jump in dendrogram).

  • For \(k\)-means, we need to specify \(k\) but can still provide some guidance

  • We can plot the WSS (within sum of squares) for differing numbers of groups

WSS

clustore <- matrix(0, 
        nrow=150, ncol=10)
wsstore <- NULL
x <- scale(iris[,-5])
for(i in 1:10){
  dum <- kmeans(x, i, nstart=25)
  clustore[, i] <- dum$cluster
  wsstore[i] <- dum$tot.withinss
}
plot(wsstore)
par(mar=c(4,4,1.1,0))
plot(wsstore, ylab="Within Sum of Squares", main= "Scree Plot")

We look for the “elbow” in the so-called “scree” plot.

No groups?

Also as previously noted, \(k\)-means can seem like it finds groups, even when they may not exist.

par(mar = c(3.9, 3.9, 1, 1))  
nogroups <- cbind(rnorm(500),rnorm(500))
plot(nogroups)

k-means “solution”

K-means with \(k\) = 7 will always return a “solution”

par(mar = c(3.9, 3.9, 1, 1))  
set.seed(311)
nogk <- kmeans(scale(nogroups),7)
plot(nogroups, col= nogk$cluster)

WSS (no groups)

clustore <- matrix(0, 
        nrow=nrow(nogroups), ncol=10)
wsstore <- NULL
x <- scale(nogroups)
for(i in 1:10){
  dum <- kmeans(x, i, nstart=25)
  clustore[, i] <- dum$cluster
  wsstore[i] <- dum$tot.withinss
}
par(mar = c(3.9, 3.9, 1, 1))  

par(mar=c(4,4,1.1,0))
plot(wsstore, ylab="Within Sum of Squares", main= "Scree Plot")