Clustering
CMSC 173 - Module 11
Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu
What is Clustering?
Definition
Clustering is the task of grouping a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups.
\begin{exampleblock}{Key Characteristics}
- Unsupervised learning
- No labeled data required
- Discover hidden patterns
- Data-driven groupings
\end{exampleblock}
Goal
Find natural groupings in data without prior knowledge of group labels.
Supervised vs Unsupervised Learning
Supervised Learning
- Training data has labels
- Learn mapping: $f: X \rightarrow Y$
- Goal: Predict labels for new data
- Examples: Classification, regression
Example:
- Input: Email text
- Label: Spam/Not Spam
- Task: Learn to classify
Unsupervised Learning
- Training data has no labels
- Discover structure in $X$
- Goal: Find patterns, groups
- Examples: Clustering, dimensionality reduction
Example:
- Input: Customer purchase data
- No labels
- Task: Find customer segments
Clustering = Unsupervised
We discover groups without knowing what they should be in advance.
Real-World Applications
Business \& Marketing
- Customer segmentation: Group customers by behavior
- Market research: Identify consumer groups
- Recommendation systems: Group similar items
Biology \& Medicine
- Gene expression: Find related genes
- Disease diagnosis: Identify patient subgroups
- Protein structure: Analyze protein families
Image \& Vision
- Image segmentation: Group pixels by similarity
- Object recognition: Cluster visual features
- Color quantization: Reduce color palette
Text \& Web
- Document clustering: Group similar documents
- Topic modeling: Discover themes
- Social network analysis: Find communities
Common Theme
All involve finding structure in unlabeled data!
Types of Clustering
Partitional Clustering
- Divide data into K non-overlapping groups
- Each point belongs to exactly one cluster
- Flat structure
- Examples: K-Means, K-Medoids, GMM
Characteristics:
- Need to specify K
- Fast and scalable
- Sensitive to initialization
Hierarchical Clustering
- Build a tree of clusters (dendrogram)
- Can extract K clusters at any level
- Nested structure
- Examples: Agglomerative, Divisive
Characteristics:
- No need to specify K upfront
- More interpretable hierarchy
- Computationally expensive
This Lecture
Focus on Partitional (K-Means, GMM) and Hierarchical methods.
Distance Metrics
Common Distance Metrics
For $\mathbf{x} = (x_1, …, x_d)$ and $\mathbf{y} = (y_1, …, y_d)$:
1. Euclidean Distance (L2)
$$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$
2. Manhattan Distance (L1)
$$d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{d} |x_i - y_i|$$
3. Chebyshev Distance (L$\infty$)
$$d(\mathbf{x}, \mathbf{y}) = \max_{i} |x_i - y_i|$$
4. Cosine Similarity
$$\text{sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}$$
Properties of Distance Metrics
A valid distance metric must satisfy:
Metric Axioms
For all points $\mathbf{x}, \mathbf{y}, \mathbf{z}$:
- Non-negativity: $d(\mathbf{x}, \mathbf{y}) \geq 0$
- Identity: $d(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}$
- Symmetry: $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$
- Triangle inequality: $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$
\begin{exampleblock}{Choosing the Right Metric}
- Euclidean: Most common, assumes all features equally important
- Manhattan: Less sensitive to outliers, good for high dimensions
- Cosine: Good for text/document clustering (direction matters)
- Custom: Domain-specific distances (e.g., edit distance for strings)
\end{exampleblock}
K-Means Algorithm: Overview
Goal
Partition $n$ points into $K$ clusters
Key Idea
- Each cluster has centroid (mean)
- Assign points to nearest centroid
- Update centroids iteratively
- Minimize within-cluster variance
\begin{exampleblock}{Input \& Output}
Input:
Output:
- Assignments $\{C_1, …, C_K\}$
- Centroids $\{\boldsymbol{\mu}_1, …, \boldsymbol{\mu}_K\}$
\end{exampleblock}
K-Means Objective Function
Minimize within-cluster sum of squares (WCSS):
Objective
$$J = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$$
where:
- $C_k$ = set of points in cluster $k$
- $\boldsymbol{\mu}_k$ = centroid of cluster $k$
- $\|\cdot\|$ = Euclidean distance
\begin{exampleblock}{Interpretation}
- Minimize total squared distance from points to their centroids
- Encourages compact, spherical clusters
- Also called inertia or distortion
- NP-hard to minimize globally, but heuristics work well
\end{exampleblock}
Note
K-Means finds local minimum, not necessarily global!
K-Means Algorithm (Lloyd's Algorithm)
\begin{algorithm}[H]
\caption{K-Means Clustering}
\begin{algorithmic}[1]
\REQUIRE Dataset $X = \{\mathbf{x}_1, …, \mathbf{x}_n\}$, number of clusters $K$
\ENSURE Cluster assignments and centroids
\STATE Initialize $K$ centroids $\{\boldsymbol{\mu}_1, …, \boldsymbol{\mu}_K\}$ randomly
\REPEAT
\STATE Assignment Step:
\FOR{each data point $\mathbf{x}_i$}
\STATE Assign $\mathbf{x}_i$ to cluster $k^* = \arg\min_k \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$
\ENDFOR
\STATE Update Step:
\FOR{each cluster $k = 1, …, K$}
\STATE Update centroid: $\boldsymbol{\mu}_k = \frac{1}{|C_k|} \sum_{\mathbf{x}_i \in C_k} \mathbf{x}_i$
\ENDFOR
\UNTIL{centroids do not change (or max iterations reached)}
\end{algorithmic}
\end{algorithm}
Key Properties
Convergence: Guaranteed (objective always decreases).
Complexity: $O(nKdT)$ where $T$ = iterations
K-Means Example: Dataset
Let's apply K-Means to a toy dataset with $K=2$
Dataset (4 points, 2D)
\begin{tabular}{ccc}
\toprule
Point & $x_1$ & $x_2$ \\
\midrule
A & 1 & 2 \\
B & 2 & 1 \\
C & 4 & 3 \\
D & 5 & 4 \\
\bottomrule
\end{tabular}
\begin{exampleblock}{Goal}
Cluster into $K = 2$ groups
\end{exampleblock}
\begin{tikzpicture}[scale=0.9]
\draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$};
\draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$};
% Grid
\foreach \x in {1,2,...,6}
\draw[gray!30] (\x,0) -- (\x,6);
\foreach \y in {1,2,...,5}
\draw[gray!30] (0,\y) -- (7,\y);
% Axis labels
\foreach \x in {1,2,...,6}
\node at (\x,-0.3) {\tiny \x};
\foreach \y in {1,2,...,5}
\node at (-0.3,\y) {\tiny \y};
% Data points
\fill[blue] (1,2) circle (3pt) node[above left] {A};
\fill[blue] (2,1) circle (3pt) node[below right] {B};
\fill[blue] (4,3) circle (3pt) node[above left] {C};
\fill[blue] (5,4) circle (3pt) node[above right] {D};
\end{tikzpicture}
K-Means Example: Step 1 - Initialization
Step 1: Randomly initialize 2 centroids
Random Initialization
Let's choose first two points as centroids:
$\boldsymbol{\mu}_1 = \text{Point A} = (1, 2)$
$\boldsymbol{\mu}_2 = \text{Point D} = (5, 4)$
Note
In practice, use K-Means++ initialization!
\begin{tikzpicture}[scale=0.9]
\draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$};
\draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$};
% Grid
\foreach \x in {1,2,...,6}
\draw[gray!30] (\x,0) -- (\x,6);
\foreach \y in {1,2,...,5}
\draw[gray!30] (0,\y) -- (7,\y);
% Data points
\fill[blue] (1,2) circle (3pt) node[above left] {A};
\fill[blue] (2,1) circle (3pt) node[below right] {B};
\fill[blue] (4,3) circle (3pt) node[above left] {C};
\fill[blue] (5,4) circle (3pt) node[above right] {D};
% Centroids
\draw[red, line width=2pt] (1,2) -- (1.3,2) -- (1.15,2.3) -- cycle;
\node[red] at (0.5,2.5) {$\boldsymbol{\mu}_1$};
\draw[green!60!black, line width=2pt] (5,4) -- (5.3,4) -- (5.15,4.3) -- cycle;
\node[green!60!black] at (5.5,4.5) {$\boldsymbol{\mu}_2$};
\end{tikzpicture}
K-Means Example: Step 2 - Assignment
Step 2: Assign each point to nearest centroid
Distance Calculations
\small
For each point, compute distance to both centroids:
Point A (1,2):
- To $\boldsymbol{\mu}_1$: $\sqrt{(1-1)^2 + (2-2)^2} = \mathbf{0}$
- To $\boldsymbol{\mu}_2$: $\sqrt{(1-5)^2 + (2-4)^2} = 4.47$
- $\Rightarrow$ Assign to Cluster 1
Point B (2,1):
- To $\boldsymbol{\mu}_1$: $\sqrt{(2-1)^2 + (1-2)^2} = \mathbf{1.41}$
- To $\boldsymbol{\mu}_2$: $\sqrt{(2-5)^2 + (1-4)^2} = 4.24$
- $\Rightarrow$ Assign to Cluster 1
Point C (4,3):
- To $\boldsymbol{\mu}_1$: $\sqrt{(4-1)^2 + (3-2)^2} = 3.16$
- To $\boldsymbol{\mu}_2$: $\sqrt{(4-5)^2 + (3-4)^2} = \mathbf{1.41}$
- $\Rightarrow$ Assign to Cluster 2
Point D (5,4):
- To $\boldsymbol{\mu}_1$: $\sqrt{(5-1)^2 + (4-2)^2} = 4.47$
- To $\boldsymbol{\mu}_2$: $\sqrt{(5-5)^2 + (4-4)^2} = \mathbf{0}$
- $\Rightarrow$ Assign to Cluster 2
\begin{tikzpicture}[scale=0.9]
\draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$};
\draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$};
% Grid
\foreach \x in {1,2,...,6}
\draw[gray!30] (\x,0) -- (\x,6);
\foreach \y in {1,2,...,5}
\draw[gray!30] (0,\y) -- (7,\y);
% Data points colored by cluster
\fill[red!70] (1,2) circle (3pt) node[above left] {A};
\fill[red!70] (2,1) circle (3pt) node[below right] {B};
\fill[green!60!black] (4,3) circle (3pt) node[above left] {C};
\fill[green!60!black] (5,4) circle (3pt) node[above right] {D};
% Centroids
\draw[red, line width=2pt] (1,2) -- (1.3,2) -- (1.15,2.3) -- cycle;
\node[red] at (0.5,2.5) {$\boldsymbol{\mu}_1$};
\draw[green!60!black, line width=2pt] (5,4) -- (5.3,4) -- (5.15,4.3) -- cycle;
\node[green!60!black] at (5.5,4.5) {$\boldsymbol{\mu}_2$};
\end{tikzpicture}
\begin{exampleblock}{Clusters}
C1: \{A, B\} \\
C2: \{C, D\}
\end{exampleblock}
K-Means Example: Step 3 - Update Centroids
Step 3: Recompute centroids as mean of assigned points
New Centroids
Cluster 1 contains: A(1,2), B(2,1)
$$\boldsymbol{\mu}_1 = \frac{1}{2}\left[(1,2) + (2,1)\right]$$
$$= \left(\frac{1+2}{2}, \frac{2+1}{2}\right) = \mathbf{(1.5, 1.5)}$$
Cluster 2 contains: C(4,3), D(5,4)
$$\boldsymbol{\mu}_2 = \frac{1}{2}\left[(4,3) + (5,4)\right]$$
$$= \left(\frac{4+5}{2}, \frac{3+4}{2}\right) = \mathbf{(4.5, 3.5)}$$
\begin{tikzpicture}[scale=0.9]
\draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$};
\draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$};
% Grid
\foreach \x in {1,2,...,6}
\draw[gray!30] (\x,0) -- (\x,6);
\foreach \y in {1,2,...,5}
\draw[gray!30] (0,\y) -- (7,\y);
% Data points
\fill[red!70] (1,2) circle (3pt) node[above left] {A};
\fill[red!70] (2,1) circle (3pt) node[below right] {B};
\fill[green!60!black] (4,3) circle (3pt) node[above left] {C};
\fill[green!60!black] (5,4) circle (3pt) node[above right] {D};
% OLD centroids (dashed)
\draw[red, line width=1pt, dashed] (1,2) -- (1.3,2) -- (1.15,2.3) -- cycle;
\draw[green!60!black, line width=1pt, dashed] (5,4) -- (5.3,4) -- (5.15,4.3) -- cycle;
% NEW centroids (solid)
\draw[red, line width=2pt] (1.5,1.5) -- (1.8,1.5) -- (1.65,1.8) -- cycle;
\node[red] at (1.0,1.0) {$\boldsymbol{\mu}_1$};
\draw[green!60!black, line width=2pt] (4.5,3.5) -- (4.8,3.5) -- (4.65,3.8) -- cycle;
\node[green!60!black] at (5.0,3.0) {$\boldsymbol{\mu}_2$};
\end{tikzpicture}
Centroids moved!
Old (dashed) $\rightarrow$ New (solid)
K-Means Example: Iteration 2 - Assignment
Iteration 2: Re-assign points to NEW centroids
Distance Calculations
\small
Using new centroids $\boldsymbol{\mu}_1 = (1.5, 1.5)$, $\boldsymbol{\mu}_2 = (4.5, 3.5)$:
Point A (1,2):
- To $\boldsymbol{\mu}_1$: $\sqrt{(1-1.5)^2 + (2-1.5)^2} = \mathbf{0.71}$
- To $\boldsymbol{\mu}_2$: $\sqrt{(1-4.5)^2 + (2-3.5)^2} = 3.81$
- $\Rightarrow$ Cluster 1 (no change)
Point B (2,1):
- To $\boldsymbol{\mu}_1$: $\sqrt{(2-1.5)^2 + (1-1.5)^2} = \mathbf{0.71}$
- To $\boldsymbol{\mu}_2$: $\sqrt{(2-4.5)^2 + (1-3.5)^2} = 3.54$
- $\Rightarrow$ Cluster 1 (no change)
Point C (4,3):
- To $\boldsymbol{\mu}_1$: $\sqrt{(4-1.5)^2 + (3-1.5)^2} = 2.92$
- To $\boldsymbol{\mu}_2$: $\sqrt{(4-4.5)^2 + (3-3.5)^2} = \mathbf{0.71}$
- $\Rightarrow$ Cluster 2 (no change)
Point D (5,4):
- To $\boldsymbol{\mu}_1$: $\sqrt{(5-1.5)^2 + (4-1.5)^2} = 4.30$
- To $\boldsymbol{\mu}_2$: $\sqrt{(5-4.5)^2 + (4-3.5)^2} = \mathbf{0.71}$
- $\Rightarrow$ Cluster 2 (no change)
\begin{tikzpicture}[scale=0.9]
\draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$};
\draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$};
% Grid
\foreach \x in {1,2,...,6}
\draw[gray!30] (\x,0) -- (\x,6);
\foreach \y in {1,2,...,5}
\draw[gray!30] (0,\y) -- (7,\y);
% Data points
\fill[red!70] (1,2) circle (3pt) node[above left] {A};
\fill[red!70] (2,1) circle (3pt) node[below right] {B};
\fill[green!60!black] (4,3) circle (3pt) node[above left] {C};
\fill[green!60!black] (5,4) circle (3pt) node[above right] {D};
% Centroids
\draw[red, line width=2pt] (1.5,1.5) -- (1.8,1.5) -- (1.65,1.8) -- cycle;
\node[red] at (1.0,1.0) {$\boldsymbol{\mu}_1$};
\draw[green!60!black, line width=2pt] (4.5,3.5) -- (4.8,3.5) -- (4.65,3.8) -- cycle;
\node[green!60!black] at (5.0,3.0) {$\boldsymbol{\mu}_2$};
\end{tikzpicture}
\begin{exampleblock}{Result}
No changes!\\
Assignments: Same as before
\end{exampleblock}
K-Means Example: Convergence
Step 4: Check convergence
Convergence Achieved!
Since no points changed clusters, the algorithm has converged.
Final Clusters:
- Cluster 1: \{A(1,2), B(2,1)\}
- Cluster 2: \{C(4,3), D(5,4)\}
Final Centroids:
- $\boldsymbol{\mu}_1 = (1.5, 1.5)$
- $\boldsymbol{\mu}_2 = (4.5, 3.5)$
\begin{tikzpicture}[scale=0.9]
\draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$};
\draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$};
% Grid
\foreach \x in {1,2,...,6}
\draw[gray!30] (\x,0) -- (\x,6);
\foreach \y in {1,2,...,5}
\draw[gray!30] (0,\y) -- (7,\y);
% Voronoi boundaries (perpendicular bisector)
\draw[blue, dashed, thick] (3,0) -- (3,6);
% Data points
\fill[red!70] (1,2) circle (3pt) node[above left] {A};
\fill[red!70] (2,1) circle (3pt) node[below right] {B};
\fill[green!60!black] (4,3) circle (3pt) node[above left] {C};
\fill[green!60!black] (5,4) circle (3pt) node[above right] {D};
% Centroids
\draw[red, line width=2pt] (1.5,1.5) -- (1.8,1.5) -- (1.65,1.8) -- cycle;
\node[red] at (1.0,1.0) {$\boldsymbol{\mu}_1$};
\draw[green!60!black, line width=2pt] (4.5,3.5) -- (4.8,3.5) -- (4.65,3.8) -- cycle;
\node[green!60!black] at (5.0,3.0) {$\boldsymbol{\mu}_2$};
% Cluster regions
\node[red!70] at (1.5,4.5) {C1};
\node[green!60!black] at (4.5,1.5) {C2};
\end{tikzpicture}
Key Insight
K-Means partitions space with linear decision boundaries (Voronoi cells)
K-Means: Voronoi Tesselation
Geometric Interpretation
K-Means creates a
Voronoi diagram:
- Space partitioned into regions
- Points closest to one centroid
- Decision boundaries are linear
- Forms convex, polygonal cells
\begin{exampleblock}{Implications}
- Works well for spherical clusters
- Struggles with elongated shapes
- Assumes equal variance
- Sensitive to outliers
\end{exampleblock}
K-Means Initialization: The Challenge
Problem
Sensitive to initial centroids
Random Init Issues
- Poor local minima
- High variance
- Multiple runs needed
\begin{exampleblock}{Practice}
Run 10-100 times, keep best WCSS
\end{exampleblock}
K-Means++ Initialization
Smarter initialization strategy (Arthur \& Vassilvitskii, 2007)
K-Means++ Algorithm
- Choose first centroid $\boldsymbol{\mu}_1$ uniformly at random from data points
- For $k = 2, …, K$:
- For each point $\mathbf{x}_i$, compute $D(\mathbf{x}_i)$ = distance to nearest centroid
- Choose next centroid $\boldsymbol{\mu}_k$ with probability $\propto D(\mathbf{x}_i)^2$
- Run standard K-Means with these initial centroids
\begin{exampleblock}{Advantages}
- Spreads out initial centroids
- Provably better: $O(\log K)$-competitive with optimal
- Lower variance, more consistent results
- Standard in scikit-learn and most libraries
\end{exampleblock}
Recommendation
Always use K-Means++ unless you have domain knowledge for better initialization.
Choosing K: The Elbow Method
Elbow Method
- Run K-Means for $K = 1, 2, …, K_{\max}$
- Plot WCSS vs $K$
- Look for the "elbow" point
- Choose $K$ with diminishing returns
\begin{exampleblock}{Interpretation}
- WCSS decreases as $K$ increases
- Elbow = fit vs complexity trade-off
- Not always clear/unique
\end{exampleblock}
Choosing K: Silhouette Analysis
Silhouette Coefficient
For each point $\mathbf{x}_i$:
1. $a_i$ = avg distance to same cluster
2. $b_i$ = avg distance to nearest other
3. Silhouette:
$$s_i = \frac{b_i - a_i}{\max(a_i, b_i)}$$
Range: $s_i \in [-1, 1]$
- $s_i \approx 1$: Well clustered
- $s_i \approx 0$: On border
- $s_i < 0$: Wrong cluster
Usage
Choose $K$ maximizing avg score
Limitations of K-Means
Key Issues
- Hard assignments: Binary membership
- Spherical clusters: Equal variance assumed
- No uncertainty: Can't express doubt
- Outliers: Forced into clusters
\begin{exampleblock}{When K-Means Struggles}
- Elongated/elliptical clusters
- Different sizes/densities
- Overlapping clusters
- Need probability of membership
\end{exampleblock}
\begin{tipblock}{Solution}
Gaussian Mixture Models (GMM) provide soft, probabilistic clustering
\end{tipblock}
Gaussian Mixture Models: Formulation
Model data as generated from mixture of $K$ Gaussian distributions
Generative Model
$$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
where:
- $\pi_k$ = mixing coefficient (prior probability of cluster $k$), $\sum_k \pi_k = 1$
- $\boldsymbol{\mu}_k$ = mean of Gaussian $k$
- $\boldsymbol{\Sigma}_k$ = covariance matrix of Gaussian $k$
- $\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma})$ = multivariate Gaussian
\begin{exampleblock}{Soft Assignment}
Probability that point $\mathbf{x}_i$ belongs to cluster $k$:
$$\gamma_{ik} = p(z_i = k | \mathbf{x}_i) = \frac{\pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$
\end{exampleblock}
GMM: Expectation-Maximization Algorithm
Learn parameters $\{\pi_k, \boldsymbol{\mu_k, \boldsymbol{\Sigma}_k\}$ using EM}
EM Algorithm
Initialize: Random $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k = I$, $\pi_k = 1/K$
Repeat until convergence:
E-step: Compute responsibilities (soft assignments)
$$\gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$
M-step: Update parameters
$$\pi_k = \frac{1}{n}\sum_{i=1}^{n} \gamma_{ik},
\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}},
\boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{\sum_i \gamma_{ik}}$$
Properties
Monotonically increases likelihood. Guaranteed to converge to local maximum.
GMM vs K-Means Comparison
K-Means
Pros:
- Simple, fast, scalable
- Easy to implement
- Works well for spherical clusters
- Less parameters to tune
Cons:
- Hard assignments only
- Assumes spherical clusters
- Sensitive to initialization
- No measure of uncertainty
GMM
Pros:
- Soft probabilistic assignments
- Flexible cluster shapes (elliptical)
- Measures uncertainty
- Principled statistical model
Cons:
- Slower than K-Means
- More parameters ($\boldsymbol{\Sigma}_k$)
- Can overfit with full covariance
- Also sensitive to initialization
Note
K-Means is special case of GMM with $\boldsymbol{\Sigma}_k = \sigma^2 I$ and hard assignments!
Hierarchical Clustering: Overview
Agglomerative (Bottom-up)
- Start: Each point = cluster
- Merge closest clusters
- End: One cluster
Divisive (Top-down)
- Start: All in one cluster
- Split clusters iteratively
- End: Each point separate
\begin{exampleblock}{Key Advantage}
No need to specify $K$ upfront! Cut dendrogram at any height to get desired clusters.
\end{exampleblock}
Agglomerative Clustering Algorithm
Most common hierarchical method
Algorithm
- Initialize: Each of $n$ points is own cluster
- Compute: Distance matrix between all clusters
- Repeat until one cluster remains:
- Find pair of closest clusters
- Merge them into single cluster
- Update distance matrix
- Output: Dendrogram showing merge history
\begin{exampleblock}{Complexity}
Time: $O(n^2 \log n)$ with efficient data structures
Space: $O(n^2)$ for distance matrix
\end{exampleblock}
Challenge
How do we measure distance between clusters (not just points)?
Hierarchical Clustering Example: Dataset
Let's apply Agglomerative Clustering to 5 points (Single Linkage)
Dataset (5 points, 1D)
\begin{tabular}{cc}
\toprule
Point & Position \\
\midrule
A & 2 \\
B & 4 \\
C & 5 \\
D & 10 \\
E & 12 \\
\bottomrule
\end{tabular}
\begin{exampleblock}{Goal}
Build dendrogram using
Single Linkage
\end{exampleblock}
\begin{tikzpicture}[scale=1.1]
% Number line
\draw[-stealth] (0,0) -- (14,0) node[right] {$x$};
% Tick marks
\draw (0,0.1) -- (0,-0.1) node[below] {\tiny 0};
\draw (2,0.1) -- (2,-0.1) node[below] {\tiny 2};
\draw (4,0.1) -- (4,-0.1) node[below] {\tiny 4};
\draw (6,0.1) -- (6,-0.1) node[below] {\tiny 6};
\draw (8,0.1) -- (8,-0.1) node[below] {\tiny 8};
\draw (10,0.1) -- (10,-0.1) node[below] {\tiny 10};
\draw (12,0.1) -- (12,-0.1) node[below] {\tiny 12};
% Data points
\fill[blue] (2,0) circle (2.5pt) node[above=3pt] {A};
\fill[blue] (4,0) circle (2.5pt) node[above=3pt] {B};
\fill[blue] (5,0) circle (2.5pt) node[above=3pt] {C};
\fill[blue] (10,0) circle (2.5pt) node[above=3pt] {D};
\fill[blue] (12,0) circle (2.5pt) node[above=3pt] {E};
\end{tikzpicture}
Initial State
Each point = own cluster\\
5 clusters: \{A\}, \{B\}, \{C\}, \{D\}, \{E\}
Hierarchical Example: Step 1 - Distance Matrix
Step 1: Compute pairwise distance matrix
Distance Matrix
\small
\begin{tabular}{c|ccccc}
& A & B & C & D & E \\
\hline
A & 0 & 2 & 3 & 8 & 10 \\
B & 2 & 0 & 1 & 6 & 8 \\
C & 3 & 1 & 0 & 5 & 7 \\
D & 8 & 6 & 5 & 0 & 2 \\
E & 10 & 8 & 7 & 2 & 0 \\
\end{tabular}
\begin{exampleblock}{Find Minimum}
Smallest distance =
1 between B and C
$\Rightarrow$ Merge \{B\} and \{C\}
\end{exampleblock}
Dendrogram (Step 1)
\begin{tikzpicture}[scale=0.9]
% Leaves
\node (A) at (0,0) {A};
\node (B) at (2,0) {B};
\node (C) at (3,0) {C};
\node (D) at (5,0) {D};
\node (E) at (6,0) {E};
% First merge: B-C at height 1
\draw (B) -- (2.5,1);
\draw (C) -- (2.5,1);
\draw[blue, line width=1.5pt] (2.5,1) -- (2.5,1.5) node[right, black] {\tiny h=1};
% Y-axis
\draw[-stealth] (-0.5,0) -- (-0.5,5) node[above] {\tiny height};
\draw (-0.6,0) -- (-0.4,0) node[left] {\tiny 0};
\draw (-0.6,1) -- (-0.4,1) node[left] {\tiny 1};
\draw (-0.6,2) -- (-0.4,2) node[left] {\tiny 2};
\draw (-0.6,3) -- (-0.4,3) node[left] {\tiny 3};
\draw (-0.6,4) -- (-0.4,4) node[left] {\tiny 4};
\draw (-0.6,5) -- (-0.4,5) node[left] {\tiny 5};
\end{tikzpicture}
Current Clusters
\{A\}, \{B, C\}, \{D\}, \{E\}\\
4 clusters remain
Hierarchical Example: Step 2 - Update Matrix
Step 2: Update distance matrix using Single Linkage
Single Linkage Rule
$$d(\{B,C\}, X) = \min(d(B,X), d(C,X))$$
New distances to cluster \{B,C\}:
- $d(\{B,C\}, A) = \min(2, 3) = \mathbf{2}$
- $d(\{B,C\}, D) = \min(6, 5) = \mathbf{5}$
- $d(\{B,C\}, E) = \min(8, 7) = \mathbf{7}$
Updated Matrix
\small
\begin{tabular}{c|cccc}
& A & \{B,C\} & D & E \\
\hline
A & 0 & 2 & 8 & 10 \\
\{B,C\} & 2 & 0 & 5 & 7 \\
D & 8 & 5 & 0 & 2 \\
E & 10 & 7 & 2 & 0 \\
\end{tabular}
Dendrogram (Step 2)
\begin{tikzpicture}[scale=0.9]
% Leaves
\node (A) at (0,0) {A};
\node (B) at (2,0) {B};
\node (C) at (3,0) {C};
\node (D) at (5,0) {D};
\node (E) at (6,0) {E};
% First merge: B-C at height 1
\draw (B) -- (2.5,1);
\draw (C) -- (2.5,1);
\draw[blue, line width=1.5pt] (2.5,1) -- (2.5,1.5);
% Second merge: D-E at height 2
\draw (D) -- (5.5,2);
\draw (E) -- (5.5,2);
\draw[green!60!black, line width=1.5pt] (5.5,2) -- (5.5,2.5) node[right, black] {\tiny h=2};
% Y-axis
\draw[-stealth] (-0.5,0) -- (-0.5,5) node[above] {\tiny height};
\draw (-0.6,0) -- (-0.4,0) node[left] {\tiny 0};
\draw (-0.6,1) -- (-0.4,1) node[left] {\tiny 1};
\draw (-0.6,2) -- (-0.4,2) node[left] {\tiny 2};
\draw (-0.6,3) -- (-0.4,3) node[left] {\tiny 3};
\draw (-0.6,4) -- (-0.4,4) node[left] {\tiny 4};
\draw (-0.6,5) -- (-0.4,5) node[left] {\tiny 5};
\end{tikzpicture}
\begin{exampleblock}{Next Merge}
Min distance = 2\\
Merge \{D\} and \{E\} at height 2
\end{exampleblock}
Hierarchical Example: Steps 3-4
Continue merging until one cluster remains
Step 3
Clusters: \{A\}, \{B,C\}, \{D,E\}
Updated distances:
- $d(A, \{B,C\}) = 2$
- $d(A, \{D,E\}) = 8$
- $d(\{B,C\}, \{D,E\}) = 5$
Min = 2: Merge A with \{B,C\}\\
New cluster: \{A, B, C\} at height 2
Step 4 (Final)
Clusters: \{A,B,C\}, \{D,E\}
Distance: $d(\{A,B,C\}, \{D,E\}) = 5$
Final merge at height 5\\
One cluster: \{A, B, C, D, E\}
Complete Dendrogram
\begin{tikzpicture}[scale=0.85]
% Leaves
\node (A) at (0,0) {A};
\node (B) at (2,0) {B};
\node (C) at (3,0) {C};
\node (D) at (5,0) {D};
\node (E) at (6,0) {E};
% Merge B-C at h=1
\draw (B) -- (2.5,1);
\draw (C) -- (2.5,1);
\draw[blue, line width=1.2pt] (2.5,1) -- (2.5,2);
% Merge D-E at h=2
\draw (D) -- (5.5,2);
\draw (E) -- (5.5,2);
\draw[green!60!black, line width=1.2pt] (5.5,2) -- (5.5,5);
% Merge A with {B,C} at h=2
\draw (A) -- (0,2) -- (1.25,2);
\draw (2.5,2) -- (1.25,2);
\draw[red, line width=1.2pt] (1.25,2) -- (1.25,5);
% Final merge at h=5
\draw[purple, line width=1.5pt] (1.25,5) -- (3.375,5) -- (5.5,5);
% Height annotations
\node[right] at (3.375,5) {\tiny h=5};
\node[left] at (1.25,2) {\tiny h=2};
\node[left] at (2.5,1) {\tiny h=1};
\node[right] at (5.5,2) {\tiny h=2};
% Y-axis
\draw[-stealth] (-0.8,0) -- (-0.8,5.5) node[above] {\tiny height};
\draw (-0.9,0) -- (-0.7,0) node[left] {\tiny 0};
\draw (-0.9,1) -- (-0.7,1) node[left] {\tiny 1};
\draw (-0.9,2) -- (-0.7,2) node[left] {\tiny 2};
\draw (-0.9,3) -- (-0.7,3) node[left] {\tiny 3};
\draw (-0.9,4) -- (-0.7,4) node[left] {\tiny 4};
\draw (-0.9,5) -- (-0.7,5) node[left] {\tiny 5};
\end{tikzpicture}
Interpretation
Cut at different heights to get different K clusters
Hierarchical Example: Cutting the Dendrogram
Extract different numbers of clusters by cutting at different heights
\begin{tikzpicture}[scale=1.0]
% Dendrogram
\node (A) at (0,0) {A};
\node (B) at (2,0) {B};
\node (C) at (3,0) {C};
\node (D) at (5,0) {D};
\node (E) at (6,0) {E};
\draw (B) -- (2.5,1);
\draw (C) -- (2.5,1);
\draw (2.5,1) -- (2.5,2);
\draw (D) -- (5.5,2);
\draw (E) -- (5.5,2);
\draw (5.5,2) -- (5.5,5);
\draw (A) -- (0,2) -- (1.25,2);
\draw (2.5,2) -- (1.25,2);
\draw (1.25,2) -- (1.25,5);
\draw (1.25,5) -- (3.375,5) -- (5.5,5);
% Cut lines
\draw[red, dashed, line width=2pt] (-0.5,4.5) -- (6.5,4.5) node[right] {K=2: \{A,B,C\}, \{D,E\}};
\draw[blue, dashed, line width=2pt] (-0.5,2.5) -- (6.5,2.5) node[right] {K=3: \{A,B,C\}, \{D\}, \{E\}};
\draw[orange, dashed, line width=2pt] (-0.5,1.5) -- (6.5,1.5) node[right] {K=4: \{A\}, \{B,C\}, \{D\}, \{E\}};
% Y-axis
\draw[-stealth] (-0.8,0) -- (-0.8,5.5) node[above] {height};
\draw (-0.9,0) -- (-0.7,0) node[left] {\tiny 0};
\draw (-0.9,1) -- (-0.7,1) node[left] {\tiny 1};
\draw (-0.9,2) -- (-0.7,2) node[left] {\tiny 2};
\draw (-0.9,3) -- (-0.7,3) node[left] {\tiny 3};
\draw (-0.9,4) -- (-0.7,4) node[left] {\tiny 4};
\draw (-0.9,5) -- (-0.7,5) node[left] {\tiny 5};
\end{tikzpicture}
Key Advantage of Hierarchical Clustering
No need to pre-specify K! The dendrogram shows the full hierarchy.
Linkage Criteria
Different ways to measure inter-cluster distance
Common Linkage Methods
For clusters $C_i$ and $C_j$:
1. Single Linkage (MIN):
$$d(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$$
2. Complete Linkage (MAX):
$$d(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$$
3. Average Linkage:
$$d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{\mathbf{x} \in C_i} \sum_{\mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$$
4. Ward's Linkage:
$$d(C_i, C_j) = \text{increase in WCSS when merging } C_i \text{ and } C_j$$
Linkage Methods: Comparison
Single
- Elongated
- Noise sensitive
Complete
- Compact
- Outlier sensitive
Ward's
- Min variance
- Most popular
Dendrograms: Interpretation
Reading Dendrograms
- Leaves: Individual points
- Height: Merge distance
- Branches: Relationships
- Cut: Extract $K$ clusters
\begin{exampleblock}{Extracting Clusters}
Cut at height $h$:
- Higher $\rightarrow$ fewer clusters
- Lower $\rightarrow$ more clusters
- Choose via validation
\end{exampleblock}
Advantage
Explore different $K$ without re-running!
Types of Cluster Validation
How do we assess clustering quality?
Internal Validation
Use only the data itself
Measures:
- Silhouette coefficient
- Davies-Bouldin index
- Calinski-Harabasz score
- Dunn index
Idea: Good clusters are compact and well-separated
External Validation
Compare to ground truth labels
Measures:
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
- V-measure
- Purity
Idea: Good clustering agrees with true labels
When to Use Each
Internal: Unsupervised setting (no labels)
External: When ground truth available (benchmarking)
Internal Metrics: Formulas
1. Silhouette Coefficient
$$s = \frac{1}{n} \sum_{i=1}^{n} \frac{b_i - a_i}{\max(a_i, b_i)}$$
Range: $[-1, 1]$, higher is better
2. Davies-Bouldin Index
$$DB = \frac{1}{K} \sum_{k=1}^{K} \max_{k' \neq k} \frac{\sigma_k + \sigma_{k'}}{d(\boldsymbol{\mu}_k, \boldsymbol{\mu}_{k'})}$$
Range: $[0, \infty)$, lower is better
3. Calinski-Harabasz Score (Variance Ratio)
$$CH = \frac{\text{Between-cluster variance}}{\text{Within-cluster variance}} \times \frac{n - K}{K - 1}$$
Range: $[0, \infty)$, higher is better
Usage
Use multiple metrics! Different metrics may favor different clusterings.
Internal Validation: Visualization
Observation
Different metrics may suggest different optimal $K$. Use domain knowledge!
External Metrics: Formulas
Given true labels $Y$ and predicted labels $C$:
1. Adjusted Rand Index (ARI)
$$ARI = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}$$
- Range: $[-1, 1]$, higher is better
- $ARI = 1$: Perfect agreement
- $ARI \approx 0$: Random labeling
- Adjusted for chance
2. Normalized Mutual Information (NMI)
$$NMI(Y, C) = \frac{2 \cdot I(Y; C)}{H(Y) + H(C)}$$
- Range: $[0, 1]$, higher is better
- $NMI = 1$: Perfect agreement
- Based on information theory
- Normalized for different $K$
External Validation: Example
Note
External validation only for benchmarking. In real unsupervised tasks, no ground truth!
Application: Customer Segmentation
\begin{exampleblock}{Business Impact}
- Targeted marketing: Strategies per segment
- Product development: Tailor to groups
- Resource allocation: Focus on high-value
- Customer retention: Identify at-risk
\end{exampleblock}
Application: Image Color Quantization
\begin{exampleblock}{Use Cases}
- Image compression: Reduce file size
- Color palette: Identify dominant colors
- Segmentation: Group similar pixels
- Artistic effects: Posterization
\end{exampleblock}
Application: Biological Data (Iris Dataset)
\begin{exampleblock}{Bioinformatics Applications}
- Species classification: Taxonomic groups
- Gene expression: Co-expressed genes
- Protein structure: Protein families
- Disease subtyping: Patient subtypes
\end{exampleblock}
Application: Document Clustering
\begin{exampleblock}{Text Mining Applications}
- Topic discovery: Find themes in document collections
- News organization: Group similar articles
- Search results: Organize by topic clusters
- Recommendation: Find similar documents
\end{exampleblock}
Choosing a Clustering Algorithm
Decision Guide
Use K-Means when:
- You know $K$ (or can estimate it)
- Data has roughly spherical clusters
- Large dataset (scalability important)
- Want fast, simple method
Use GMM when:
- Need probabilistic assignments
- Clusters have different shapes/variances
- Want to measure uncertainty
- Have computational resources
Use Hierarchical when:
- Don't know $K$ in advance
- Want to explore multiple granularities
- Need interpretable hierarchy
- Small to medium dataset ($n < 10,000$)
Common Pitfalls \& Solutions
Pitfall 1: Not Scaling Features
Problem: Features with large ranges dominate distance
Solution: Standardize features: $z = \frac{x - \mu}{\sigma}$
Pitfall 2: Using Wrong Distance Metric
Problem: Euclidean not always appropriate
Solution: Match metric to data type (cosine for text, custom for categorical)
Pitfall 3: Ignoring Outliers
Problem: Outliers can distort clusters (especially K-Means)
Solution: Detect and remove outliers, or use robust methods (DBSCAN, K-Medoids)
Pitfall 4: Blindly Trusting One Metric
Problem: Single validation metric may be misleading
Solution: Use multiple metrics + visual inspection + domain knowledge
Best Practices: Practical Tips
\begin{exampleblock}{Data Preprocessing}
- Scale features: Use StandardScaler or MinMaxScaler
- Handle missing values: Impute or remove
- Remove duplicates: Can bias cluster centers
- Consider dimensionality reduction: PCA for high-dimensional data
\end{exampleblock}
\begin{exampleblock}{Model Selection}
- Try multiple K values: Use elbow + silhouette + domain knowledge
- Run multiple times: Different initializations for K-Means
- Validate results: Use both internal and visual validation
- Compare algorithms: K-Means, GMM, Hierarchical
\end{exampleblock}
\begin{exampleblock}{Interpretation}
- Visualize clusters: 2D projections (PCA, t-SNE)
- Inspect cluster centers: What characterizes each cluster?
- Verify with domain experts: Do clusters make sense?
- Iterate: Clustering is exploratory - refine based on insights
\end{exampleblock}
Key Takeaways
Clustering Fundamentals
- Unsupervised learning: Discover structure without labels
- Distance metrics: Foundation of clustering (Euclidean, cosine, etc.)
- Two main types: Partitional vs Hierarchical
Key Algorithms
- K-Means: Fast, simple, hard assignments, spherical clusters
- GMM: Soft assignments, flexible shapes, probabilistic
- Hierarchical: No need for K, produces dendrogram, $O(n^2)$
Validation
- Internal: Silhouette, Davies-Bouldin, Calinski-Harabasz
- External: ARI, NMI (when ground truth available)
- Selection: Elbow method, silhouette analysis, domain knowledge
What We Covered
- Introduction: Motivation, applications, clustering types
- Distance Metrics: Euclidean, Manhattan, cosine similarity
- K-Means: Algorithm, initialization (K-Means++), choosing K
- GMM: Soft clustering, EM algorithm, comparison with K-Means
- Hierarchical: Agglomerative, linkage methods, dendrograms
- Validation: Internal and external metrics
- Applications: Customer segmentation, image processing, bioinformatics
- Best Practices: Algorithm selection, preprocessing, pitfalls
Next Steps
- Practice: Try clustering on real datasets
- Experiment: Compare different algorithms and parameters
- Read: Advanced topics - DBSCAN, spectral clustering, etc.
Further Reading
Textbooks
- Bishop: Pattern Recognition \& Machine Learning (Ch. 9)
- Murphy: Probabilistic ML (Ch. 21)
- Hastie et al.: Elements of Statistical Learning (Ch. 14)
Key Papers
- Arthur \& Vassilvitskii (2007): K-Means++
- Dempster et al. (1977): EM Algorithm
- Rousseeuw (1987): Silhouette Coefficient
Implementations
- scikit-learn: KMeans, GaussianMixture, AgglomerativeClustering
- scipy: Hierarchical clustering (linkage, dendrogram)
- R: kmeans, hclust, cluster package
End of Module 11
Clustering
Questions?