09: Clustering Methods

General Overview

Dean Adams, Iowa State University

Outline

Clustering

Look for relationships of points in multivariate spaces

Clustering focuses on the similarities (or differences) among observations

Obtain groupings (clusters) of observations based on their similarity or difference

Provides complementary view to ordination

Clustering is algorithmic, not algebraic (i.e., a set of [repeated] rules for connecting observations)

Clustering: General Logic

Consider points in multivariate space

Clustering: General Logic

Connect objects in some way based on similarity

Clustering: General Logic

Results in clusters of points

Clustering Approaches

Clustering Approaches

Clustering: SAHN Methods

Sequential, Agglomerative, Hierarchical, Nested

Clustering: SAHN Methods

Sequential, Agglomerative, Hierarchical, Nested

Equivalent to starting with most similar objects and ‘connecting’ objects with lower and lower similarity as you progress (i.e., more different objects added later)

SAHN generates nested plot (dendrogram) whose axis is similarity

Different criteria used for determining when to include objects to clusters

Single Linkage and Complete Linkage

Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached

Single Linkage and Complete Linkage

Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached

Complete Linkage (farthest neighbor clustering): Add a new object to a cluster when its similarity to the last object is reached

Single Linkage and Complete Linkage

Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached

Complete Linkage (farthest neighbor clustering): Add a new object to a cluster when its similarity to the last object is reached

Represent the extremes of SAHN clustering

Difference largely amounts to ‘sliding’ nodes of dendogram towards tips (single linkage) or towards root (complete linkage)

Some cluster assignments can also change

Example Data

Skull shape similarity among populations of European moles

Distance matrix based population averages

T_rom T_rom2 M_touc P_bre T_occ T_stan T_cae T_eur T_eur2 T_eur3
T_rom 0.00 4.66 8.53 8.96 6.40 5.35 9.27 8.42 8.60 8.82
T_rom2 4.66 0.00 8.19 10.98 5.43 5.20 6.96 7.24 8.21 8.26
M_touc 8.53 8.19 0.00 9.88 5.67 10.28 9.10 8.75 5.63 7.58
P_bre 8.96 10.98 9.88 0.00 9.91 10.32 11.93 12.01 10.85 12.09
T_occ 6.40 5.43 5.67 9.91 0.00 6.45 5.72 7.25 6.20 6.57
T_stan 5.35 5.20 10.28 10.32 6.45 0.00 7.66 8.63 9.93 10.32
T_cae 9.27 6.96 9.10 11.93 5.72 7.66 0.00 7.17 8.13 8.21
T_eur 8.42 7.24 8.75 12.01 7.25 8.63 7.17 0.00 5.39 5.35
T_eur2 8.60 8.21 5.63 10.85 6.20 9.93 8.13 5.39 0.00 4.64
T_eur3 8.82 8.26 7.58 12.09 6.57 10.32 8.21 5.35 4.64 0.00

Example: Single Vs. Complete Linkage

Nodes shifted back in Complete Linkage (right) and some cluster changes (e.g., position of ‘M_touc’)

Single and Complete Linkage: Thoughts

Single linkage and complete linkage represent extreme criteria for clustering (include object at first instance vs. include object only when all instances reached)

Can be sensitive to noise in data (especially single linkage)

Single and Complete Linkage: Thoughts

Single linkage and complete linkage represent extreme criteria for clustering (include object at first instance vs. include object only when all instances reached)

Can be sensitive to noise in data (especially single linkage)

Possible improvement: Use a measure of central tendency for groups rather than raw values

Different measures of central tendency have been used

UPGMA

Unweighted Pair-Group Method using Arithmetic Averages

Method unweighted because it gives same weight to original similarity scores (e.g., when 3rd object added, new average found by dividing by 3, etc.)

Example: UPGMA

Nodes more intermediate as compared with SL and CL

Representing Object Relationships

Ordination: Large distances more accurately represented

Clustering: Small distances more accurately represented

Ordination & Clustering are complementary views of high-D relationships

WPGMA

Weighted pair-group method using arithmetic averages

Same as UPGMA, but weighted (new averages always found by dividing by 2 regardless of how many objects are in the cluster)

UPGMC & WPGMC

UPGMC & WPGMC

Can result in ‘reversals’ (negative branch lengths): when distance of 3rd object to centroid is smaller than distance between original pair

Examples

Note negative branch lengths in these examples

Ward’s Minimum Variance Method

Use cluster variance (TESS: total error sums of squares: pooled within-group variance)

Add object to cluster such that TESS is increased the least

Ward’s Minimum Variance Method

Use cluster variance (TESS: total error sums of squares: pooled within-group variance)

Add object to cluster such that TESS is increased the least

Clustering: K-Means

Non-hierarchical method of identifying groups

Entire process can be repeated for \(K = 2,3,4\) etc. to find optimal number of groups

Does not yield dendrogram (not hierarchical); only yields # groups and group membership

Example

Attempt 2, 3 and 4 clusters (NOTE: results can change across multiple runs)

# K-means
kclusters2<-kmeans(PC.scores,2)
kclusters3<-kmeans(PC.scores,3)
kclusters4<-kmeans(PC.scores,4)

Evaluate TESS by N-Groups

TESS<-array(NA,6)
for (i in 1:6){
  TESS[i]<-kmeans(PC.scores,i)$tot.withinss
}

TESS levels at \(K=3\) (PC plot confirms this is reasonable)

However, in this case we have 4 groups (2 overlap broadly)…

Clustering: Summary

Clustering provides ‘connected’ view of object similarity

VERY useful when combined with ordination (provide complementary views of data space)

Change of metric/distance measure may alter results

One must make sure clustering method is commensurate with ordination method (e.g., K-means uses Euclidean distance to measure TESS. Corresponds to PC plot)

Other methods exist: minimum spanning tree, flexible-link clustering, probabilistic clustering, evolutionary model-based phylogenetics (parsimony, ML, Bayesian, neighbor-joining, etc.)

Conclusions

Clustering is a useful visualization tool for multivariate data

Displays overall similarity among objects

A complement to ordination approaches we’ve used (e.g., PCA)