Data 311: Machine Learning

Lecture 19: Mixture Models

Dr. Irene Vrbik

University of British Columbia Okanagan

Introduction

  • So far we’ve seen a few methods for clustering: \(k\)-means, hierarchical.

  • Recall that clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set.

  • This lecture will talk about Gaussian Mixture Models (GMM):

Recall Unsupervised Learning

  • Recall that clustering is an unsupervised task; we do not have any labelled data (i.e. we don’t have a \(Y\))

  • It is often easier to obtain unlabled date than labeled data (which can require human intervention)

  • The unsupervised task is inherently more challenging than the supervised task.

Clustering methods

  • There are various clustering techniques, each with its own strengths, assumptions, and use cases.
  • For heirachical clustering and K-means, we relied on Euclidean distance.
  • In contrast to these distance-based algorithns, GMM is a probabilistic model-based clustering technique that assumes that the data is generated from a mixture of multiple Gaussian distributions.

Old Faithful Data

  • The faithful dataset in R contains the measurements on time to eruption, and length of eruption time of the Old Faithful geyser in Yellowstone National Park.

  • The data frame consists of 272 observations on 2 variables:

    • eruptions Eruption time in mins (numeric)

    • waiting Waiting time to next eruption in mins (numeric)

N.B. to see the help file for any dataset (or function in R) use ?name_of_thing

head(faithful)

jump to the iris data set.

Many would argue that there are two clusters:

  1. Short eruptions followed shortly thereafter by another eruption
  2. Longer eruptions with longer delay before the next eruption
plot(faithful, 
     ylab = "Waiting time to next eruption (in mins)", 
     xlab = "Eruption duration (mins)")

Determining Clusters

  • Rather than creating the clusters ourselves, we look for automated ways for computers to find these natural groupings for us.

  • While this is conveneient in lower dimensional space, it is absolutely crucial in higher dimensions in which we can no longer visualize our data.

  • For instance we could run something like mclust1 to obtain a optimal* partition of the data.

faithful clustered

library(mclust)
mod2 <- Mclust(faithful)
plot(mod2, what = "classification")

The points are coloured by group allocation.

Interestingly, mclust determines that there are 3 groups in the data.

But how did mclust determine that 3 clusters was optimal?

Mixture Models

Resources

[MP] Mixture Model-Based Classification by Paul D. McNicholas

[FS] Finite Mixture and Markov Switching Models by Fruhwirth-Schnatter, S. (2006)

[MG] Finite Mixture Models by McLachlan, Geoffrey J; Peel, David/ New York: John Wiley & Sons

[LB] Mixture models: theory, geometry, and applications by Lindsay, Bruce G

Motivating Example

  • As in the previous units, by way of example we will look at some labelled data and then perform cluster (pretending that we don’t know the labels)

  • Common “benchmark” data sets in the clustering world include: iris, crabs, wine

  • This exercise is quite common when trying to assess a clustering algorithm (remember in the genuine cluster problem we wouldn’t calculate a confusion matrix and accuracy measure for example)

Iris

Description

Image Source kedro

Here i am using “clusters” as groups of data that clump together, but really these are just know classifcation corresponding to the species are Iris setosa, versicolor, and virginica.

Data frame

iris

jump to the geyser data set.

Plot

Code
library(ggplot2)
library(GGally)

ggpairs(iris,                 # Data frame
        columns = 1:4,        # Columns
        aes(color = Species,  # Color by group (cat. variable)
            alpha = 0.5))     # Transparency 

We can see that these species form distinct clusters (esp in the Sepal.Width vs Petal.Width scatterplot)

Fitted GMM

Here we have fit a multivariate normal distribution to each sub-group in the data.

That is we find the maximum likelihood estimates (MLEs) \(\hat \mu_g\), \(\hat{\mathbf{\Sigma}}_g\) for \(g = 1, 2, 3\)

Mixture Models

Mixture models are just a convex combination of statistical distributions. In general, for a random vector \(\mathbf{X}\), a mixture distribution is defined by \[f(\mathbf{x}) = \sum_{g=1}^G \pi_g p_g(\mathbf{x} \mid \theta_g)\] where \(G\) is the number of groups, \(0 \leq \pi_g \leq 1\) and \(\sum_{g=1}^G \pi_g = 1\) and \(p_g(\mathbf{x} \mid \theta_g)\) are statistical distributions with parameters \(\theta_g\).

Terminology

  • The \(\pi_g\) are called mixing proportions and we will write \(\bf{\pi} = (\pi_1, \pi_2, \dots, \pi_G)\)

  • The \(p_g(\mathbf{x})\) are called component densities

  • Since the \(p_g(\mathbf{x})\) are densities, \(f(\mathbf{x})\) describes a density called a \(g\)-component finite mixture density.

Statistical Distributions

  • In theory, \(p_g(\mathbf{x} \mid \theta_g)\) could be a different distribution type for each \(g\).

  • In practice, however, \(p_g(\mathbf{x} \mid \theta_g)\) is assumed to be the same across groups, but with different \(\theta_g\).

  • Most commonly, we assume a a Gaussian mixture model (GMM) where \(p_g(\mathbf{x} \mid \theta_g)\) is assumed to be multivariate Gaussian (aka multivariate Normal), denoted \(\phi(\mathbf{x} | \mu_g, \mathbf{\Sigma}_g)\)

Gaussian Mixture model {#gmm}

Thus \(f(\mathbf{x})\) — sometimes written \(f(\mathbf{x} | \Theta)\) or \(f(\mathbf{x} ; \Theta)\) — is a Gaussian mixture model would have the form: \[\begin{align} f(\mathbf{x} | \Theta) &= \sum_{g=1}^G \pi_g \phi(\mathbf{x} \mid \boldsymbol\mu_g, \mathbf{\Sigma}_g) \\ &= \sum_{g=1}^G \pi_g \frac{|\mathbf{\Sigma}|^{-1/2}}{(2\pi)^{k/2}} \exp \left( - \frac{1}{2} (\mathbf{x} - \boldsymbol{\mu}_g)^{\mathrm {T}} \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}_g) \right) \end{align}\] where \(\phi\) denote the normal distribution, \(\boldsymbol{\mu}_g\) is the mean vector for group \(g\) and \(\mathbf{\Sigma}_g\) is the covariance matrix for group \(g\), and \(\Theta = \{\boldsymbol{\mu}_1, ..., \boldsymbol{\mu}_G, \mathbf{\Sigma}_1, ..., \mathbf{\Sigma}_G, \pi_1, ..., \pi_G\}\)

Distribution Alternatives

Other robust alternatives include:

faithful finite mixture 1D

hist(faithful$waiting, freq = FALSE)
lines(density(faithful$waiting)) 

This can be viewed as a mixture of 2 univariate normals

Two Univariate Normals I

curve(dnorm(x, mean = 2, sd = 1), col = 2, xlim = c(-2,9),
      lwd = 2, ylab="Density")
curve(dnorm(x, mean = 5, sd = 1), add = TRUE, col = 4, lwd = 2)

Mixture of 2 univariate Normals I

curve(0.5*dnorm(x, mean = 2, sd = 1) + 0.5*dnorm(x, mean = 5, sd = 1), 
      xlim = c(-2,9),
      lwd = 2, ylab="Density")

Mixture of 2 univariate Normals II

Code
par(mfrow=c(1,2))
mu1=4; mu2 = 2
sd1 = sd2 = 1
meanvec = c(mu1, mu2)
sdvec = c(sd1, sd2)
lind = which.min(meanvec)
rind = which.max(meanvec)
xrange = c(meanvec[lind]-3.5*sdvec[lind],
           meanvec[rind]+3.5*sdvec[rind])

# Two-curve plot
curve(dnorm(x, mean = mu1, sd = sd1), col = 2, xlim = xrange,
      lwd = 2, ylab="Density")
curve(dnorm(x, mean = mu2, sd = sd2), add = TRUE, col = 4, lwd = 2)

# One mixture curve plot
curve(0.5*dnorm(x, mean = mu1, sd = sd1) + 0.5*dnorm(x, mean = mu2, sd = sd2), 
      xlim = xrange,
      lwd = 2, ylab="Density")

Mixture of 2 univariate Normals II

Code
par(mfrow=c(1,2))
mu1=3; mu2 = 2
sd1 = sd2 = 1
meanvec = c(mu1, mu2)
sdvec = c(sd1, sd2)
lind = which.min(meanvec)
rind = which.max(meanvec)
xrange = c(meanvec[lind]-3.5*sdvec[lind],
           meanvec[rind]+3.5*sdvec[rind])

# Two-curve plot
curve(dnorm(x, mean = mu1, sd = sd1), col = 2, xlim = xrange,
      lwd = 2, ylab="Density")
curve(dnorm(x, mean = mu2, sd = sd2), add = TRUE, col = 4, lwd = 2)

# One mixture curve plot
curve(0.5*dnorm(x, mean = mu1, sd = sd1) + 0.5*dnorm(x, mean = mu2, sd = sd2), 
      xlim = xrange,
      lwd = 2, ylab="Density")

Mixture of 2 univariate Normals III

Code
par(mfrow=c(1,2))
mu1=5; mu2 = 2
sd1 = 1;  sd2 = 2
meanvec = c(mu1, mu2)
sdvec = c(sd1, sd2)
lind = which.min(meanvec)
rind = which.max(meanvec)
xrange = c(meanvec[lind]-4*sdvec[lind],
           meanvec[rind]+4*sdvec[rind])

# Two-curve plot
curve(dnorm(x, mean = mu1, sd = sd1), col = 2, xlim = xrange,
      lwd = 2, ylab="Density")
curve(dnorm(x, mean = mu2, sd = sd2), add = TRUE, col = 4, lwd = 2)

# One mixture curve plot
curve(0.5*dnorm(x, mean = mu1, sd = sd1) + 0.5*dnorm(x, mean = mu2, sd = sd2), 
      xlim = xrange,
      lwd = 2, ylab="Density")

faithful GMM 1D fit

Code
mod1 = Mclust(faithful$waiting)
theta = summary(mod1, parameters = TRUE)

hist(faithful$waiting, freq = FALSE, ylim = c(0,0.045), main = "MCLUST fit", xlab = "waiting time (in minutes)")
#lines(density(faithful$waiting)) 
curve(theta$pro[1]*dnorm(x, theta$mean[1], sqrt(theta$variance[1])), col = 2, add = TRUE, lwd=4) 
curve(theta$pro[2]*dnorm(x, theta$mean[2], sqrt(theta$variance[2])), col = 4, add = TRUE, lwd=4) 
curve(theta$pro[1]*dnorm(x, theta$mean[1], sqrt(theta$variance[1])) + theta$pro[2]*dnorm(x, theta$mean[2], sqrt(theta$variance[2])), col = 1, add = TRUE, lty=2, lwd = 2) 

Mixture Models for clustering

  • Finite mixture models are ideally suited for clustering

  • Clusters correspond to the \(1,...,g,...G\) sub-populations/groups/components in a finite mixtures model.

  • To derive the MLEs of the Gaussian mixture model, we first need a likelihood function …

The Likelihood

The likelihood function for this mixture is \[\begin{align} L(\Theta) &= f(\mathbf{x} | \Theta) = \Pi_i \left[ \sum_{g=1}^G \pi_g \phi(\mathbf{x} \mid \boldsymbol\mu_g, \mathbf{\Sigma}_g) \right] \end{align}\] and the so-called log-likelihood is: \[\begin{align} \ell(\Theta) &= \log(L(\Theta)) = \sum_{i=1}^N \log\left[ \sum_{g=1}^G \pi_g \phi(\mathbf{x} \mid \boldsymbol\mu_g, \mathbf{\Sigma}_g) \right] \end{align}\]

EM algorithm

  • Most commonly, we use the expectation-maximization (EM) algorithm, which helps us find parameter estimates with missing data (group memberships are missing).

  • More generally, the EM is an iterative algorithm for finding maximum likelihood parameters in statistical models, where the model depends on unobserved latent variables.

  • We assume that each observed data point has a corresponding unobserved latent variable specifying the mixture component to which each data point belongs.

Latent Variable

We introduce a new latent1 variable that represents the group memberships: \[\begin{equation} Z_{ig} = \begin{cases} 1 & \text{if } \mathbf{x}_i \text{ belongs to group }g \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

We call \(\{X, Z\}\) our “complete” data.

Now that we have that, let’s look at a non-technical summary of the EM algorithm…

Latent Variable Representation

We can now represent our mixture model \[\begin{align} f({x} | \Theta) &= \sum_{g=1}^G \pi_g \phi(\mathbf{x} \mid \boldsymbol\mu_g, \mathbf{\Sigma}_g) \\ &= \sum_{g=1}^G p(Z_{ig} = 1) p(x \mid Z_{ig} = 1) \end{align}\] This is sometimes called the latent variable representation of the mixture model.

Complete Likelihood

The “complete” likelihood takes the form: \[\begin{equation} L_c(\Theta) = \prod_{i=1}^n \prod_{g=1}^G \pi_g^{\hat{z}_{ig}} \phi(x_i|\mu_g, \sigma_g)^{\hat{z}_{ig}} \end{equation}\] so the complete log-likelihood, \(\log \left( L_c(\Theta) \right)\), takes the form: \[\begin{align} \ell_c(\Theta) &= \sum_{i=1}^n \sum_{g=1}^G \hat{z}_{ig} \left[ \log (\pi_g) + \log (\phi(x_i|\mu_g, \sigma_g) )\right] \end{align}\]

We’ll work with univariate normal distribution for simplicity.

EM Pseudocode

  1. Start the algorithm with random values for \(\hat z_{ig}\).
  2. Mstep: Assuming those \(\hat{z}_{ig}\) are correct, find the maximum likelihood estimates (MLE) of \(\boldsymbol{\mu}_g\) and \(\mathbf{\Sigma}_g\)
  3. Estep: Assuming those parameters are correct, find the expected value of group memberships \[\hat{z}_{ig} = \frac{\pi_g \phi(\mathbf{x} \mid \boldsymbol{\mu}_g, \mathbf{\Sigma}_g) }{\sum_{g=1}^G \pi_g \phi(\mathbf{x} \mid \boldsymbol{\mu}_g, \mathbf{\Sigma}_g)}\]
  4. Repeat 2. and 3. until “changes” are minimal.

Fuzzy z’s

  • The resulting \(\hat{z}_{ig}\) are not going to be 0’s and 1’s when the algorithm converges.

  • For each observation \(\mathbf{x}_i\) we end up with vector of probabilities (\(\hat{z}_{i1}\),…,\(\hat{z}_{iG}\)) which we denote by \(\hat{\mathbf{z}}_i\)

  • So, for example, observation \(\mathbf{x}_i\) might have a corresponding estimate of \(\hat{\mathbf{z}}_i = (0.25, 0.60, 0.15)\) –> in this case \(\mathbf{x}_i\) would get assigned to group 2

  • But! We usually still want a “best” guess as to which group the observation belongs to. Continuing with example above what do you think might happen …

MCLUST example

To see this in action, lets take a closer look at the MCLUST output when it was fit to our old faithful geyser data.

To recap:

  • MCLUST imposes the GMM structure.

  • We introduce indicator \(z_{ig}\) such that \(z_{ig} = 1\) is observation \(i\) belongs to group \(g\) and \(z_{ig}= 0\) otherwise.

  • The model is fitted using the EM alogrithm

Classification vector

Our estimates of \(z_{ig}\), denoted \(\hat z_{ig}\) determine our clusters: \[\begin{equation} \text{MAP}{\hat{z}_{ig}} = \begin{cases} 1 & \text{if } \max_g \{\hat z_{ig}\} \\ 0 & \text{otherwise} \end{cases} \end{equation}\]

# fuzzy z matrix
round(head(mod1$z) , 5)
        [,1]    [,2]
[1,] 0.00010 0.99990
[2,] 0.99991 0.00009
[3,] 0.00415 0.99585
[4,] 0.96757 0.03243
[5,] 0.00000 1.00000
[6,] 0.99981 0.00019
# "hard" zs = MAP(zhat)
head(cbind(ifelse(mod1$classification==1,1,0), ifelse(mod1$classification==2,1,0)))
     [,1] [,2]
[1,]    0    1
[2,]    1    0
[3,]    0    1
[4,]    1    0
[5,]    0    1
[6,]    1    0
# classification vector
head(mod1$classification)
[1] 2 1 2 1 2 1

Soft Clustering

  • GMM is a type of soft clustering.

  • This means that a data object can exist in more than one cluster with a certain probability or degree of membership.

  • This is unlike hard clustering methods which define a non-overlapping partition of the data set (eg. kmeans)

  • The matrix of \(\hat z_{ig}\) is often referred to has a “fuzzy” z matrix.

  • The rows of this matrix should necessarily sum to 0.

Problem 1

  • EM algorithm is a deterministic, monotonic optimization algorithm; however, it is susceptible to local maxima!

  • Hence if we begin the algorithm at different starting points, we risk getting different clustering results.

Image Source: CLP-1

Smart/multiple starts

  • One solution to problem 1 is is start your algorithm and several different initialization

  • Another potential solution it to start in a smart location (to be determined by some other clustering method, eg. kmeans)

Problem 2

The total number of “free” parameters in a GMM is: \[\begin{equation} (G-1) + Gp + Gp(p+1)/2 \end{equation}\]

  • \(Gp(p+1)/2\) of these are from the group covariance matrices \(\mathbf{\Sigma}_1,..., \mathbf{\Sigma}_G\)

  • For the EM, we have to invert these \(G\) covariance matrices for each iteration of the algorithm.

  • This can get quite computationally expensive…

Common Covariance

  • One potential solution to problem 2 is to assumes that the covariance are equal across groups

  • The constraint of \(\mathbf{\Sigma}_g = \mathbf{\Sigma}\) for all \(g\) this reduces to \(p(p+1)/2\) free parameters in the common covariance matrix.

  • This however is a very strong assumption about the underlying data, so perhaps we could impose a softer constraint …

Towards MCLUST

  • MCLUST exploits an eigenvalue decomposition of the group covariance matrices to give a wide range of structures that uses between 1 and \(Gp(p+1)/2\) free parameters.

  • The eigenvalue decomposition has the form \[\begin{equation} \mathbf{\Sigma}_g = \lambda_g \mathbf{D}_g \mathbf{A}_g \mathbf{D}'_g \end{equation}\] where \(\lambda_g\) is a constant, \(\mathbf{D}_g\) is a matrix consisting of the eigenvectors of \(\mathbf{\Sigma}_g\) and \(\mathbf{A}_g\) is a diagonal matrix with entries proppotional the the eigenvalues of \(\mathbf{\Sigma}_g\).

Constraints & MCLUST

  • We may constrain the components of the eigenvalue decomposition of \(\mathbf{\Sigma}_g\) across groups of the mixture model

  • Furthermore, the matrices \(\mathbf{D}_g\) and \(\mathbf{A}_g\) may be set to the identity matrix \(\mathbf{I}\)

  • Fraley & Ratery (2000) describe the mclust software, which is available as an R package that efficiently performs this constrained model-based clustering.

MCLUST models

Notes: E = equal; S=spherical; V=variable; AA=axis-aligned
ID Volume Shape Orientation Covariance Decomposition Number of free Covariance parameters
EII E S \(\lambda \mathbf{I}\) 1
VII V S \(\lambda_g \mathbf{I}\) \(G\)
EEI E E AA \(\lambda \mathbf{A}\) \(p\)
VEI V E AA \(\lambda_g \mathbf{A}\) \(p+G-1\)
EVI E V AA \(\lambda \mathbf{A}_g\) \(pG-G+1\)
VVI V V AA \(\lambda_g \mathbf{A}_g\) \(pG\)
EEE E E E \(\lambda\mathbf{D}\mathbf{A}\mathbf{D}'\) \(p(p+1)/2\)
EEV E E V \(\lambda\mathbf{D}_g\mathbf{A}\mathbf{D}'_g\) \(Gp(p+1)/2-(G-1)p\)
VEV V E V \(\lambda_g\mathbf{D}_g\mathbf{A}\mathbf{D}'_g\) \(Gp(p+1)/2-(G-1)(p-1)\)
VVV V V V \(\lambda_g\mathbf{D}_g\mathbf{A}_g\mathbf{D}'_g\) \(Gp(p+1)/2\)

MCLUST Cluster Shapes

Cluster shapes that correspond to the covariance structures on the previous slide

faithful finite mixture 2D

mod2 <- Mclust(faithful)
summary(mod2, parameters = TRUE)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust EEE (ellipsoidal, equal volume, shape and orientation) model with 3
components: 

 log-likelihood   n df       BIC       ICL
      -1126.326 272 11 -2314.316 -2357.824

Clustering table:
  1   2   3 
 40  97 135 

Mixing probabilities:
        1         2         3 
0.1656784 0.3563696 0.4779520 

Means:
               [,1]      [,2]      [,3]
eruptions  3.793066  2.037596  4.463245
waiting   77.521051 54.491158 80.833439

Variances:
[,,1]
           eruptions    waiting
eruptions 0.07825448  0.4801979
waiting   0.48019785 33.7671464
[,,2]
           eruptions    waiting
eruptions 0.07825448  0.4801979
waiting   0.48019785 33.7671464
[,,3]
           eruptions    waiting
eruptions 0.07825448  0.4801979
waiting   0.48019785 33.7671464

The “best” model is 3 components

This assumes the EEE covariance structure

But how was this determined?

BIC plot

plot(mod2$BIC)

By default, MCLUST will fit a g-component mixture model for all 14 models models for \(g\) = 1, 2, … 9.

For each of these 126 models, a BIC value is calculated.

The BIC

  • The Bayesian Information Criterion or BIC (Schwarz, 1978) is one of the most widely known and used tools in statistical model selection. The BIC can be written: \[\begin{equation} \text{BIC} = 2 \ell(x, \hat \theta) - p_f \log{n} \end{equation}\] where \(\ell\) is the log-likelihood, \(\hat \theta\) is the MLE of \(\theta\), \(p_f\) is the # of free parameters and \(n\) is the # of observations.

  • It can be shown that the BIC gives consistent estimates of the number of components in a mixtures model under certain regulator conditions (Keribin, 1988 , 2000)

BIC plot (zoomed)

plot(mod2$BIC, ylim=c(-2400, -2300))

The BIC plot reveals that the model which obtains the largest BIC value is the 3-component mixture model with the ‘EEE’ covariance structure.

Pros

Why Mixture Model-based Clustering?

  • Mathematically and statistically sound using standard theory
  • Parameter estimation is built-in, along with number of clusters via model selection
  • Provides a probabilistic answer to the question “does observation \(i\) belong to group \(G\)?”
  • Adaptable, eg. data with outliers/skewness can be handled by non-Gaussian densities

Cons

Cons (a.k.a. open avenues for research)

  • Computationally intensive (takes time to estimate and invert covariance matrices; not feasible for small \(n\) large \(p\) without some decomposition)
  • Model-selection method is still viewed as an open problem by many practitioners — BIC grudgingly accepted
  • Traditional parameter estimation via EM algorithm prone to local maxima, overconfidence in clustering probabilities, and several other issues