eruptions <dbl> | waiting <dbl> | |
---|---|---|
1 | 3.600 | 79 |
2 | 1.800 | 54 |
3 | 3.333 | 74 |
4 | 2.283 | 62 |
5 | 4.533 | 85 |
6 | 2.883 | 55 |
Lecture 19: Mixture Models
University of British Columbia Okanagan
So far we’ve seen a few methods for clustering:
Recall that clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set.
This lecture will talk about Gaussian Mixture Models (GMM):
Recall that clustering is an unsupervised task; we do not have any labelled data (i.e. we don’t have a
It is often easier to obtain unlabled date than labeled data (which can require human intervention)
The unsupervised task is inherently more challenging than the supervised task.
The faithful
dataset in R contains the measurements on time to eruption, and length of eruption time of the Old Faithful geyser in Yellowstone National Park.
The data frame consists of 272 observations on 2 variables:
eruptions
Eruption time in mins (numeric)
waiting
Waiting time to next eruption in mins (numeric)
N.B. to see the help file for any dataset (or function in R) use ?name_of_thing
jump to the iris data set.
Many would argue that there are two clusters:
Rather than creating the clusters ourselves, we look for automated ways for computers to find these natural groupings for us.
While this is conveneient in lower dimensional space, it is absolutely crucial in higher dimensions in which we can no longer visualize our data.
For instance we could run something like mclust1 to obtain a optimal* partition of the data.
[MP] Mixture Model-Based Classification by Paul D. McNicholas
[FS] Finite Mixture and Markov Switching Models by Fruhwirth-Schnatter, S. (2006)
[MG] Finite Mixture Models by McLachlan, Geoffrey J; Peel, David/ New York: John Wiley & Sons
[LB] Mixture models: theory, geometry, and applications by Lindsay, Bruce G
As in the previous units, by way of example we will look at some labelled data and then perform cluster (pretending that we don’t know the labels)
Common “benchmark” data sets in the clustering world include: iris, crabs, wine
This exercise is quite common when trying to assess a clustering algorithm (remember in the genuine cluster problem we wouldn’t calculate a confusion matrix and accuracy measure for example)
Here i am using “clusters” as groups of data that clump together, but really these are just know classifcation corresponding to the species are Iris setosa, versicolor, and virginica.
ABCDEFGHIJ0123456789 |
Sepal.Length <dbl> | Sepal.Width <dbl> | Petal.Length <dbl> | Petal.Width <dbl> | Species <fct> |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.9 | 3.1 | 1.5 | 0.1 | setosa |
jump to the geyser data set.
Here we have fit a multivariate normal distribution to each sub-group in the data.
That is we find the maximum likelihood estimates (MLEs)
Mixture models are just a convex combination of statistical distributions. In general, for a random vector
The
The
Since the
In theory,
In practice, however,
Most commonly, we assume a a Gaussian mixture model (GMM) where
Thus
Other robust alternatives include:
Mixtures of Student-
Mixtures of skewed distributions
This can be viewed as a mixture of 2 univariate normals
par(mfrow=c(1,2))
mu1=4; mu2 = 2
sd1 = sd2 = 1
meanvec = c(mu1, mu2)
sdvec = c(sd1, sd2)
lind = which.min(meanvec)
rind = which.max(meanvec)
xrange = c(meanvec[lind]-3.5*sdvec[lind],
meanvec[rind]+3.5*sdvec[rind])
# Two-curve plot
curve(dnorm(x, mean = mu1, sd = sd1), col = 2, xlim = xrange,
lwd = 2, ylab="Density")
curve(dnorm(x, mean = mu2, sd = sd2), add = TRUE, col = 4, lwd = 2)
# One mixture curve plot
curve(0.5*dnorm(x, mean = mu1, sd = sd1) + 0.5*dnorm(x, mean = mu2, sd = sd2),
xlim = xrange,
lwd = 2, ylab="Density")
par(mfrow=c(1,2))
mu1=3; mu2 = 2
sd1 = sd2 = 1
meanvec = c(mu1, mu2)
sdvec = c(sd1, sd2)
lind = which.min(meanvec)
rind = which.max(meanvec)
xrange = c(meanvec[lind]-3.5*sdvec[lind],
meanvec[rind]+3.5*sdvec[rind])
# Two-curve plot
curve(dnorm(x, mean = mu1, sd = sd1), col = 2, xlim = xrange,
lwd = 2, ylab="Density")
curve(dnorm(x, mean = mu2, sd = sd2), add = TRUE, col = 4, lwd = 2)
# One mixture curve plot
curve(0.5*dnorm(x, mean = mu1, sd = sd1) + 0.5*dnorm(x, mean = mu2, sd = sd2),
xlim = xrange,
lwd = 2, ylab="Density")
par(mfrow=c(1,2))
mu1=5; mu2 = 2
sd1 = 1; sd2 = 2
meanvec = c(mu1, mu2)
sdvec = c(sd1, sd2)
lind = which.min(meanvec)
rind = which.max(meanvec)
xrange = c(meanvec[lind]-4*sdvec[lind],
meanvec[rind]+4*sdvec[rind])
# Two-curve plot
curve(dnorm(x, mean = mu1, sd = sd1), col = 2, xlim = xrange,
lwd = 2, ylab="Density")
curve(dnorm(x, mean = mu2, sd = sd2), add = TRUE, col = 4, lwd = 2)
# One mixture curve plot
curve(0.5*dnorm(x, mean = mu1, sd = sd1) + 0.5*dnorm(x, mean = mu2, sd = sd2),
xlim = xrange,
lwd = 2, ylab="Density")
faithful
GMM 1D fitmod1 = Mclust(faithful$waiting)
theta = summary(mod1, parameters = TRUE)
hist(faithful$waiting, freq = FALSE, ylim = c(0,0.045), main = "MCLUST fit", xlab = "waiting time (in minutes)")
#lines(density(faithful$waiting))
curve(theta$pro[1]*dnorm(x, theta$mean[1], sqrt(theta$variance[1])), col = 2, add = TRUE, lwd=4)
curve(theta$pro[2]*dnorm(x, theta$mean[2], sqrt(theta$variance[2])), col = 4, add = TRUE, lwd=4)
curve(theta$pro[1]*dnorm(x, theta$mean[1], sqrt(theta$variance[1])) + theta$pro[2]*dnorm(x, theta$mean[2], sqrt(theta$variance[2])), col = 1, add = TRUE, lty=2, lwd = 2)
Finite mixture models are ideally suited for clustering
Clusters correspond to the
To derive the MLEs of the Gaussian mixture model, we first need a likelihood function …
The likelihood function for this mixture is
Most commonly, we use the expectation-maximization (EM) algorithm, which helps us find parameter estimates with missing data (group memberships are missing).
More generally, the EM is an iterative algorithm for finding maximum likelihood parameters in statistical models, where the model depends on unobserved latent variables.
We assume that each observed data point has a corresponding unobserved latent variable specifying the mixture component to which each data point belongs.
We introduce a new latent1 variable that represents the group memberships:
We call
Now that we have that, let’s look at a non-technical summary of the EM algorithm…
We can now represent our mixture model
The “complete” likelihood takes the form:
We’ll work with univariate normal distribution for simplicity.
The resulting
For each observation
So, for example, observation
But! We usually still want a “best” guess as to which group the observation belongs to. Continuing with example above what do you think might happen …
To see this in action, lets take a closer look at the MCLUST output when it was fit to our old faithful geyser data.
To recap:
MCLUST imposes the GMM structure.
We introduce indicator
The model is fitted using the EM alogrithm
Our estimates of
GMM is a type of soft clustering.
This means that a data object can exist in more than one cluster with a certain probability or degree of membership.
This is unlike hard clustering methods which define a non-overlapping partition of the data set (eg. kmeans)
The matrix of
The rows of this matrix should necessarily sum to 0.
EM algorithm is a deterministic, monotonic optimization algorithm; however, it is susceptible to local maxima!
Hence if we begin the algorithm at different starting points, we risk getting different clustering results.
Image Source: CLP-1
One solution to problem 1 is is start your algorithm and several different initialization
Another potential solution it to start in a smart location (to be determined by some other clustering method, eg. kmeans)
The total number of “free” parameters in a GMM is:
For the EM, we have to invert these
This can get quite computationally expensive…
One potential solution to problem 2 is to assumes that the covariance are equal across groups
The constraint of
This however is a very strong assumption about the underlying data, so perhaps we could impose a softer constraint …
MCLUST exploits an eigenvalue decomposition of the group covariance matrices to give a wide range of structures that uses between 1 and
The eigenvalue decomposition has the form
We may constrain the components of the eigenvalue decomposition of
Furthermore, the matrices
Fraley & Ratery (2000) describe the mclust
software, which is available as an R package that efficiently performs this constrained model-based clustering.
ID | Volume | Shape | Orientation | Covariance Decomposition | Number of free Covariance parameters |
---|---|---|---|---|---|
EII | E | S | 1 | ||
VII | V | S | |||
EEI | E | E | AA | ||
VEI | V | E | AA | ||
EVI | E | V | AA | ||
VVI | V | V | AA | ||
EEE | E | E | E | ||
EEV | E | E | V | ||
VEV | V | E | V | ||
VVV | V | V | V |
Cluster shapes that correspond to the covariance structures on the previous slide
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3
components:
log-likelihood n df BIC ICL
-1126.326 272 11 -2314.316 -2357.824
Clustering table:
1 2 3
40 97 135
Mixing probabilities:
1 2 3
0.1656784 0.3563696 0.4779520
Means:
[,1] [,2] [,3]
eruptions 3.793066 2.037596 4.463245
waiting 77.521051 54.491158 80.833439
Variances:
[,,1]
eruptions waiting
eruptions 0.07825448 0.4801979
waiting 0.48019785 33.7671464
[,,2]
eruptions waiting
eruptions 0.07825448 0.4801979
waiting 0.48019785 33.7671464
[,,3]
eruptions waiting
eruptions 0.07825448 0.4801979
waiting 0.48019785 33.7671464
The “best” model is 3 components
This assumes the EEE covariance structure
But how was this determined?
By default, MCLUST will fit a g-component mixture model for all 14 models models for
For each of these 126 models, a BIC value is calculated.
The Bayesian Information Criterion or BIC (Schwarz, 1978) is one of the most widely known and used tools in statistical model selection. The BIC can be written:
It can be shown that the BIC gives consistent estimates of the number of components in a mixtures model under certain regulator conditions (Keribin, 1988 , 2000)
The BIC plot reveals that the model which obtains the largest BIC value is the 3-component mixture model with the ‘EEE’ covariance structure.
Why Mixture Model-based Clustering?
Cons (a.k.a. open avenues for research)