Outline

Clustering
- SAHN methods: single-link, complete-link, UPGMA, etc.
- K-means partitioning

Clustering

Look for relationships of points in multivariate spaces

Clustering focuses on the similarities (or differences) among observations

Obtain groupings (clusters) of observations based on their similarity or difference

Provides complementary view to ordination

Clustering is algorithmic, not algebraic (i.e., a set of [repeated] rules for connecting observations)

Clustering: General Logic

Consider points in multivariate space

Clustering: General Logic

Connect objects in some way based on similarity

Clustering: General Logic

Results in clusters of points

Clustering Approaches

Two main categories
- Hierarchical: clusters are nested (higher-ranked clusters contain lower-ranked clusters)
- Non-hierarchical: strategies for identifying K groups of clusters

Clustering Approaches

Two main categories
- Hierarchical: clusters are nested (higher-ranked clusters contain lower-ranked clusters)
- Non-hierarchical: strategies for identifying K groups of clusters
Other ways to think of clustering methods
- Sequential vs. simultaneous
- Agglomerative vs. divisive (start with one & add vs. start with all and break apart)
- Monothetic vs. polythetic (single descriptor for partitions vs. multiple descriptors)
- Probabilistic vs. non-probabilistic

Clustering: SAHN Methods

Sequential, Agglomerative, Hierarchical, Nested

Procedure
- Using similarity (or distance) matrix, cluster two most similar objects
- Recalculate some value if required
- Find next two most similar objects, aggregate, and repeat until all objects are clustered

Clustering: SAHN Methods

Sequential, Agglomerative, Hierarchical, Nested

Procedure
- Using similarity (or distance) matrix, cluster two most similar objects
- Recalculate some value if required
- Find next two most similar objects, aggregate, and repeat until all objects are clustered

Equivalent to starting with most similar objects and ‘connecting’ objects with lower and lower similarity as you progress (i.e., more different objects added later)

SAHN generates nested plot (dendrogram) whose axis is similarity

Different criteria used for determining when to include objects to clusters

Single Linkage and Complete Linkage

Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached

Single Linkage and Complete Linkage

Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached

Complete Linkage (farthest neighbor clustering): Add a new object to a cluster when its similarity to the last object is reached

Single Linkage and Complete Linkage

Single Linkage (nearest neighbor clustering): Add a new object to a cluster when its similarity to the first object is reached

Complete Linkage (farthest neighbor clustering): Add a new object to a cluster when its similarity to the last object is reached

Represent the extremes of SAHN clustering

Difference largely amounts to ‘sliding’ nodes of dendogram towards tips (single linkage) or towards root (complete linkage)

Some cluster assignments can also change

Example Data

Skull shape similarity among populations of European moles

Distance matrix based population averages

	T_rom	T_rom2	M_touc	P_bre	T_occ	T_stan	T_cae	T_eur	T_eur2	T_eur3
T_rom	0.00	4.66	8.53	8.96	6.40	5.35	9.27	8.42	8.60	8.82
T_rom2	4.66	0.00	8.19	10.98	5.43	5.20	6.96	7.24	8.21	8.26
M_touc	8.53	8.19	0.00	9.88	5.67	10.28	9.10	8.75	5.63	7.58
P_bre	8.96	10.98	9.88	0.00	9.91	10.32	11.93	12.01	10.85	12.09
T_occ	6.40	5.43	5.67	9.91	0.00	6.45	5.72	7.25	6.20	6.57
T_stan	5.35	5.20	10.28	10.32	6.45	0.00	7.66	8.63	9.93	10.32
T_cae	9.27	6.96	9.10	11.93	5.72	7.66	0.00	7.17	8.13	8.21
T_eur	8.42	7.24	8.75	12.01	7.25	8.63	7.17	0.00	5.39	5.35
T_eur2	8.60	8.21	5.63	10.85	6.20	9.93	8.13	5.39	0.00	4.64
T_eur3	8.82	8.26	7.58	12.09	6.57	10.32	8.21	5.35	4.64	0.00

Example: Single Vs. Complete Linkage

Nodes shifted back in Complete Linkage (right) and some cluster changes (e.g., position of ‘M_touc’)

Single and Complete Linkage: Thoughts

Single linkage and complete linkage represent extreme criteria for clustering (include object at first instance vs. include object only when all instances reached)

Can be sensitive to noise in data (especially single linkage)

Single and Complete Linkage: Thoughts

Single linkage and complete linkage represent extreme criteria for clustering (include object at first instance vs. include object only when all instances reached)

Can be sensitive to noise in data (especially single linkage)

Possible improvement: Use a measure of central tendency for groups rather than raw values

Different measures of central tendency have been used

UPGMA

Unweighted Pair-Group Method using Arithmetic Averages

Uses averages of clusters to join additional objects
- Connect closest 2 objects in a cluster
- Calculate their average similarity to each object not in cluster
- Replace original similarity scores with averages
- Add new object to cluster when distance to average is reached
- Recalculate average for cluster and continue

Method unweighted because it gives same weight to original similarity scores (e.g., when 3^rd object added, new average found by dividing by 3, etc.)

Example: UPGMA

Nodes more intermediate as compared with SL and CL

Representing Object Relationships

Ordination: Large distances more accurately represented

Clustering: Small distances more accurately represented

Ordination & Clustering are complementary views of high-D relationships

WPGMA

Weighted pair-group method using arithmetic averages

Same as UPGMA, but weighted (new averages always found by dividing by 2 regardless of how many objects are in the cluster)

UPGMC & WPGMC

Use centroids of clusters to join objects
- Connect two closest objects
- Calculate centroid from simlarity scores & distances of centroid to objects
- Replace original distances with those to centroid
- Repeat

UPGMC & WPGMC

Use centroids of clusters to join objects
- Connect two closest objects
- Calculate centroid from simlarity scores & distances of centroid to objects
- Replace original distances with those to centroid
- Repeat

Can result in ‘reversals’ (negative branch lengths): when distance of 3^rd object to centroid is smaller than distance between original pair

Examples

Note negative branch lengths in these examples

Ward’s Minimum Variance Method

Use cluster variance (TESS: total error sums of squares: pooled within-group variance)

Add object to cluster such that TESS is increased the least

Ward’s Minimum Variance Method

Use cluster variance (TESS: total error sums of squares: pooled within-group variance)

Add object to cluster such that TESS is increased the least

Clustering: K-Means

Non-hierarchical method of identifying groups

Partition data into groups to minimize TESS
- Define number of groups (K)
- Assign specimens to groups, calculate centroid, and TESS
- Repeat many times and choose solution with minimal TESS

Entire process can be repeated for \(K = 2,3,4\) etc. to find optimal number of groups

Does not yield dendrogram (not hierarchical); only yields # groups and group membership

Example

Attempt 2, 3 and 4 clusters (NOTE: results can change across multiple runs)

# K-means
kclusters2<-kmeans(PC.scores,2)
kclusters3<-kmeans(PC.scores,3)
kclusters4<-kmeans(PC.scores,4)

Evaluate TESS by N-Groups

TESS<-array(NA,6)
for (i in 1:6){
  TESS[i]<-kmeans(PC.scores,i)$tot.withinss
}

TESS levels at \(K=3\) (PC plot confirms this is reasonable)

However, in this case we have 4 groups (2 overlap broadly)…

Clustering: Summary

Clustering provides ‘connected’ view of object similarity

VERY useful when combined with ordination (provide complementary views of data space)

Change of metric/distance measure may alter results

One must make sure clustering method is commensurate with ordination method (e.g., K-means uses Euclidean distance to measure TESS. Corresponds to PC plot)

Methods do NOT assume process!! Careful in interpretation
- Phylogenetic tree estimation is process-based clustering: not the same mathematically

Other methods exist: minimum spanning tree, flexible-link clustering, probabilistic clustering, evolutionary model-based phylogenetics (parsimony, ML, Bayesian, neighbor-joining, etc.)

Conclusions

Clustering is a useful visualization tool for multivariate data

Displays overall similarity among objects

A complement to ordination approaches we’ve used (e.g., PCA)

09: Clustering Methods

General Overview

Outline

Clustering

Clustering: General Logic

Clustering: General Logic

Clustering: General Logic

Clustering Approaches

Clustering Approaches

Clustering: SAHN Methods

Clustering: SAHN Methods

Single Linkage and Complete Linkage

Single Linkage and Complete Linkage

Single Linkage and Complete Linkage

Example Data

Example: Single Vs. Complete Linkage

Single and Complete Linkage: Thoughts

Single and Complete Linkage: Thoughts

UPGMA

Example: UPGMA

Representing Object Relationships

WPGMA

UPGMC & WPGMC

UPGMC & WPGMC

Examples

Ward’s Minimum Variance Method

Ward’s Minimum Variance Method

Clustering: K-Means

Example

Evaluate TESS by N-Groups

Clustering: Summary

Conclusions