Dean Adams, Iowa State University
Clustering
SAHN methods: single-link, complete-link, UPGMA, etc.
K-means partitioning
Look for relationships of points in multivariate spaces
Clustering focuses on the similarities (or differences) among observations
Obtain groupings (clusters) of observations based on their similarity or difference
Provides complementary view to ordination
Clustering is algorithmic, not algebraic (i.e., a set of [repeated] rules for connecting observations)
Consider points in multivariate space
Connect objects in some way based on similarity
Results in clusters of points
Two main categories
Hierarchical: clusters are nested (higher-ranked clusters contain lower-ranked clusters)
Non-hierarchical: strategies for identifying K groups of clusters
Two main categories
Hierarchical: clusters are nested (higher-ranked clusters contain lower-ranked clusters)
Non-hierarchical: strategies for identifying K groups of clusters
Other ways to think of clustering methods
Sequential vs. simultaneous
Agglomerative vs. divisive (start with one & add vs. start with all and break apart)
Monothetic vs. polythetic (single descriptor for partitions vs. multiple descriptors)
Probabilistic vs. non-probabilistic
Sequential, Agglomerative, Hierarchical, Nested
Using similarity (or distance) matrix, cluster two most similar objects
Recalculate some value if required
Find next two most similar objects, aggregate, and repeat until all objects are clustered
Sequential, Agglomerative, Hierarchical, Nested
Using similarity (or distance) matrix, cluster two most similar objects
Recalculate some value if required
Find next two most similar objects, aggregate, and repeat until all objects are clustered
Equivalent to starting with most similar objects and ‘connecting’ objects with lower and lower similarity as you progress (i.e., more different objects added later)
SAHN generates nested plot (dendrogram) whose axis is similarity
Different criteria used for determining when to include objects to clusters
Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached
Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached
Complete Linkage (farthest neighbor clustering): Add a new object to a cluster when its similarity to the last object is reached
Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached
Complete Linkage (farthest neighbor clustering): Add a new object to a cluster when its similarity to the last object is reached
Represent the extremes of SAHN clustering
Difference largely amounts to ‘sliding’ nodes of dendogram towards tips (single linkage) or towards root (complete linkage)
Some cluster assignments can also change
Skull shape similarity among populations of European moles
Distance matrix based population averages
| T_rom | T_rom2 | M_touc | P_bre | T_occ | T_stan | T_cae | T_eur | T_eur2 | T_eur3 | |
|---|---|---|---|---|---|---|---|---|---|---|
| T_rom | 0.00 | 4.66 | 8.53 | 8.96 | 6.40 | 5.35 | 9.27 | 8.42 | 8.60 | 8.82 |
| T_rom2 | 4.66 | 0.00 | 8.19 | 10.98 | 5.43 | 5.20 | 6.96 | 7.24 | 8.21 | 8.26 |
| M_touc | 8.53 | 8.19 | 0.00 | 9.88 | 5.67 | 10.28 | 9.10 | 8.75 | 5.63 | 7.58 |
| P_bre | 8.96 | 10.98 | 9.88 | 0.00 | 9.91 | 10.32 | 11.93 | 12.01 | 10.85 | 12.09 |
| T_occ | 6.40 | 5.43 | 5.67 | 9.91 | 0.00 | 6.45 | 5.72 | 7.25 | 6.20 | 6.57 |
| T_stan | 5.35 | 5.20 | 10.28 | 10.32 | 6.45 | 0.00 | 7.66 | 8.63 | 9.93 | 10.32 |
| T_cae | 9.27 | 6.96 | 9.10 | 11.93 | 5.72 | 7.66 | 0.00 | 7.17 | 8.13 | 8.21 |
| T_eur | 8.42 | 7.24 | 8.75 | 12.01 | 7.25 | 8.63 | 7.17 | 0.00 | 5.39 | 5.35 |
| T_eur2 | 8.60 | 8.21 | 5.63 | 10.85 | 6.20 | 9.93 | 8.13 | 5.39 | 0.00 | 4.64 |
| T_eur3 | 8.82 | 8.26 | 7.58 | 12.09 | 6.57 | 10.32 | 8.21 | 5.35 | 4.64 | 0.00 |
Nodes shifted back in Complete Linkage (right) and some cluster changes (e.g., position of ‘M_touc’)
Single linkage and complete linkage represent extreme criteria for clustering (include object at first instance vs. include object only when all instances reached)
Can be sensitive to noise in data (especially single linkage)
Single linkage and complete linkage represent extreme criteria for clustering (include object at first instance vs. include object only when all instances reached)
Can be sensitive to noise in data (especially single linkage)
Possible improvement: Use a measure of central tendency for groups rather than raw values
Different measures of central tendency have been used
Unweighted Pair-Group Method using Arithmetic Averages
Uses averages of clusters to join additional objects
Method unweighted because it gives same weight to original similarity scores (e.g., when 3rd object added, new average found by dividing by 3, etc.)
Nodes more intermediate as compared with SL and CL
Ordination: Large distances more accurately represented
Clustering: Small distances more accurately represented
Ordination & Clustering are complementary views of high-D relationships
Weighted pair-group method using arithmetic averages
Same as UPGMA, but weighted (new averages always found by dividing by 2 regardless of how many objects are in the cluster)
Use centroids of clusters to join objects
Use centroids of clusters to join objects
Can result in ‘reversals’ (negative branch lengths): when distance of 3rd object to centroid is smaller than distance between original pair
Note negative branch lengths in these examples
Use cluster variance (TESS: total error sums of squares: pooled within-group variance)
Add object to cluster such that TESS is increased the least
Use cluster variance (TESS: total error sums of squares: pooled within-group variance)
Add object to cluster such that TESS is increased the least
Non-hierarchical method of identifying groups
Partition data into groups to minimize TESS
Define number of groups (K)
Assign specimens to groups, calculate centroid, and TESS
Repeat many times and choose solution with minimal TESS
Entire process can be repeated for \(K = 2,3,4\) etc. to find optimal number of groups
Does not yield dendrogram (not hierarchical); only yields # groups and group membership
Attempt 2, 3 and 4 clusters (NOTE: results can change across multiple runs)
# K-means
kclusters2<-kmeans(PC.scores,2)
kclusters3<-kmeans(PC.scores,3)
kclusters4<-kmeans(PC.scores,4)TESS levels at \(K=3\) (PC plot confirms this is reasonable)
However, in this case we have 4 groups (2 overlap broadly)…
Clustering provides ‘connected’ view of object similarity
VERY useful when combined with ordination (provide complementary views of data space)
Change of metric/distance measure may alter results
One must make sure clustering method is commensurate with ordination method (e.g., K-means uses Euclidean distance to measure TESS. Corresponds to PC plot)
Methods do NOT assume process!! Careful in interpretation
Other methods exist: minimum spanning tree, flexible-link clustering, probabilistic clustering, evolutionary model-based phylogenetics (parsimony, ML, Bayesian, neighbor-joining, etc.)
Clustering is a useful visualization tool for multivariate data
Displays overall similarity among objects
A complement to ordination approaches we’ve used (e.g., PCA)