So far we’ve been in the supervised setting where we know the “answers” for response variable which might be continuous (regression) or categorical (classification)
Simple methods for modeling a continuous response variable include: regression, KNN-regression
Simple methods for modeling a categorical response variable include: logistic regression, discriminant analysis, KNN-classification
Introduction
We are moving towards the realm of unsupervised learning; that is, problems for which we do not have known “answers”.
Prior to diving deep on unsupervised methods, it’s important to note that the vast majority are based on distance calculations (of some form).
Distance seems like a straightforward idea, but it is actually quite complex …
While simple, it is inappropriate in many settings.
Where Euclidean Fails 1
Consider the following measurements on people:
Height (in cm),
Annual salary (in $)
A $61 difference in annual salary would be considered a minuscule difference, whereas a 61 cm difference in height (approx 2 feet) would be substantial!
Plotted Euclidean Distance
Is it reasonable to assume A is equidistant (at equal distances) with points B and C?
Scale Matters
The scale and range of possible values matters!
If we use a distance-based method, for example KNN, then large scale variables (like salary1) will have a larger effect on the distance between the observations, than variables that are on a small scale (like height).
Hence salary will drive the KNN classification results2.
Standardized Euclidean Distance
One solution is to scale the data to have mean 0, variance 1 via Then we can define standardized pairwise distances as
How it might look after scaling
After scaling those data you might have something more along the lines of this, were C and A are much closer together than A and B
Manhattan Distance
Manhattan distance1 measures pairwise distance as though one could only travel along the axes
Similarly, one might want to consider the standardized form
## Mahalanobist(A - mu)%*%solve(sigma)%*%(A - mu) # returns a 1x1 matrixt(B - mu)%*%solve(sigma)%*%(B - mu) # returns a 1x1 matrix
When to Standardize?
This brings us to an important question…when should we use a standardized measure, and when should we not?
It is often a good idea, and generally necessary if measurements are on vastly different units.
Unless you have a good reason to believe that higher variance measures should be weighted heavier, then you probably want to standardize in some form.
From Numeric to Mixed
What if some/all the predictors are not numeric?
We can consider several methods for calculating distances based on matching.
For binary data case, we can get a “match” either with a 1-1 agreement or 0-0.
Matching Binary Distance
We can actually define another matching method that is equivalent to Manhattan/city-block,
This is simple sum of disagreements is the unstandardized version of the M-coefficient.
Let’s see an example,…
President Example
Let’s look at some binary variables for US presidents:
Democrat logical indicating if they are a democrat
Govenor logical indicating if they were they formerly a governor 1 = yes
VP logical indicating if they were formerly a vice president
2nd Term logical indicating if they serve a second term
From Iowa logical indicating if they are originally from Iowa
Matching Binary Distance
Democrat
Governor
VP
2nd Term
From Iowa
GWBush
0
1
0
1
0
Obama
1
0
0
1
0
Trump
0
0
0
0
0
The Manhattan distance between GWBush vs Obama: This is equivalent to (as they do not match in two variables: Democrat and Governor).
M-coefficient
The M-coefficient (or matching coefficient) is simply the proportion of variables in disagreement in two objects over the total number of variable comparisons, p.
So in our presidents example the matchin distance can be converted to a proportion by calculating
Calculate M-coefficient
Democrat
Governor
VP
2nd Term
From Iowa
GWBush
0
1
0
1
0
Obama
1
0
0
1
0
Trump
0
0
0
0
0
Calculating the simple matching M-coefficient1
Question
Democrat
Governor
VP
2nd Term
From Iowa
GWBush
0
1
0
1
0
Obama
1
0
0
1
0
Trump
0
0
0
0
0
Should a 0-0 match in From Iowa really map to a binary distance of 0?
Since a 0-0 “match” does not imply that they are from similar places in the US, I would argue not.
Asymmetric Binary Distance
For reasons discussed on the previous slide, it makes sense to toss out those 0-0 matches.
In this case, what we would be considering is asymmetric binary distance
Calculate Asymmetric Distance
Democrat
Governor
VP
2nd Term
From Iowa
GWBush
0
1
0
1
0
Obama
1
0
0
1
0
Trump
0
0
0
0
0
The asymmetric binary distance between Bush and Obama:
Distance for Qualitative Variables
For categorical variables having levels, a common measure is essentially standardized matching once again.
If is the number of variables that match between observations and and is the total number of variables, the measure1 is calculated as:
Distance for Mixed Variables
But often data has a mix of variable types.
Gower’s distance is a common choice for computing pairwise distances in this case.
The basic idea is to standardize each variable’s contribution to the distance between 0 and 1, and then sum them up.
Gower’s Distance
Gower’s distance (or Gower’s dissimilarity) is calculated as:
where
= 1 if both and are non-missing1 (0 otherwise),
depends on variable type
Quantitative numeric
Qualitative (nominal, categorical)
Binary
Example of Gower’s Distance1
Party
Height (cm)
Eye Colour
2 Terms
From Iowa
Biden
Democratic
182
Blue
NA
No
Trump
Republican
188
Blue
No
No
Obama
Democratic
185
Brown
Yes
No
Fillmore
Third Party
175
Blue
No
No
assume we have this data for all past and present US presidents
Calculations
If we consider 0-0 to be missing we can fill out the column for Biden () and Trump () as follows:
Party
1
Height
1
EyeColour
1
2 Terms
0
Iowa
0
Party Calculations
Party
Height (cm)
Eye Colour
2 Terms
From Iowa
Biden
Democratic
182
Blue
NA
No
Trump
Republican
188
Blue
No
No
Party
1
1
1
Height
1
EyeColour
1
2 Terms
0
Iowa
0
Height Calculations
To calculate we need to know the range of the qualitative variable Height(cm)
Tallest President: Abraham Lincoln at 193 centimeters
Shortest President: James Madison at 163 centimeters
Range of variable is 193-163 = 30
Party
1
1
1
Height
1
0.1
0.1
EyeColour
1
2 Terms
0
Iowa
0
EyeColour Calculations
Party
Height (cm)
Eye Colour
2 Terms
From Iowa
Biden
Democratic
182
Blue
NA
No
Trump
Republican
188
Blue
No
No
Party
1
1
1
Height
1
0.1
0.1
EyeColour
1
0
0
2 Terms
0
Iowa
0
2 Terms Calculations
Party
Height (cm)
Eye Colour
2 Terms
From Iowa
Biden
Democratic
182
Blue
NA
No
Trump
Republican
188
Blue
No
No
Party
1
1
1
Height
1
0.1
0.1
EyeColour
1
0
0
2 Terms
0
-
0
Iowa
0
From Iowa Calculations
Party
Height (cm)
Eye Colour
2 Terms
From Iowa
Biden
Democratic
182
Blue
NA
No
Trump
Republican
188
Blue
No
No
Party
1
1
1
Height
1
0.1
0.1
EyeColour
1
0
0
2 Terms
0
-
0
Iowa
0
0
0
Gower’s Distance Calculation
Party
1
1
1
Height
1
0.1
0.1
EyeColour
1
0
0
2 Terms
0
-
0
Iowa
0
0
0
Total
3
1.1
Comments
Gower distance will always be between 0.0 and 1.0
a distance of 0.0 means the two observations are identical for all non-missing predictors
a distance of 1.0 means the two observations are as far apart as possible for that data set
The Gower distance can be used with purely numeric or purely non-numeric data, but for such scenarios there are better distance metrics available.
There are several variations of the Gower distance, so if you encounter it, you should read the documentation carefully.
Comments
Gower distance will always be between 0.0 and 1.0
a distance of 0.0 means the two observations are identical for all non-missing predictors
a distance of 1.0 means the two observations are as far apart as possible for that data set
The Gower distance can be used with purely numeric or purely non-numeric data, but for such scenarios there are better distance metrics available.
There are several variations of the Gower distance, so if you encounter it, you should read the documentation carefully.