Clustering

CMSC 173 - Module 11

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What is Clustering?

DefinitionClustering is the task of grouping a set of objects such that objects in the same group (cluster) are more similar to each other than to those in other groups.

\begin{exampleblock}{Key Characteristics}

Unsupervised learning
No labeled data required
Discover hidden patterns
Data-driven groupings

\end{exampleblock}

[Figure: ../figures/clustering_motivation.png]

Goal

Find natural groupings in data without prior knowledge of group labels.

Supervised vs Unsupervised Learning

Supervised LearningTraining data has labels
Learn mapping: $f: X \rightarrow Y$
Goal: Predict labels for new data
Examples: Classification, regression

Example:

Input: Email text
Label: Spam/Not Spam
Task: Learn to classify

Unsupervised LearningTraining data has no labels
Discover structure in $X$
Goal: Find patterns, groups
Examples: Clustering, dimensionality reduction

Example:

Input: Customer purchase data
No labels
Task: Find customer segments

Clustering = Unsupervised

We discover groups without knowing what they should be in advance.

Real-World Applications

Business \& MarketingCustomer segmentation: Group customers by behavior
Market research: Identify consumer groups
Recommendation systems: Group similar items

Biology \& MedicineGene expression: Find related genes
Disease diagnosis: Identify patient subgroups
Protein structure: Analyze protein families

Image \& VisionImage segmentation: Group pixels by similarity
Object recognition: Cluster visual features
Color quantization: Reduce color palette

Text \& WebDocument clustering: Group similar documents
Topic modeling: Discover themes
Social network analysis: Find communities

Common Theme

All involve finding structure in unlabeled data!

Types of Clustering

Partitional ClusteringDivide data into K non-overlapping groups
Each point belongs to exactly one cluster
Flat structure
Examples: K-Means, K-Medoids, GMM

Characteristics:

Need to specify K
Fast and scalable
Sensitive to initialization

Hierarchical ClusteringBuild a tree of clusters (dendrogram)
Can extract K clusters at any level
Nested structure
Examples: Agglomerative, Divisive

Characteristics:

No need to specify K upfront
More interpretable hierarchy
Computationally expensive

This Lecture

Focus on Partitional (K-Means, GMM) and Hierarchical methods.

Distance Metrics

Common Distance MetricsFor $\mathbf{x} = (x_1, …, x_d)$ and $\mathbf{y} = (y_1, …, y_d)$:

1. Euclidean Distance (L2)
$$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}$$

2. Manhattan Distance (L1)
$$d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{d} |x_i - y_i|$$

3. Chebyshev Distance (L$\infty$)
$$d(\mathbf{x}, \mathbf{y}) = \max_{i} |x_i - y_i|$$

4. Cosine Similarity
$$\text{sim}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\mathbf{y}\|}$$

[Figure: ../figures/distance_metrics.png]

Properties of Distance Metrics

A valid distance metric must satisfy:

Metric AxiomsFor all points $\mathbf{x}, \mathbf{y}, \mathbf{z}$:



Non-negativity: $d(\mathbf{x}, \mathbf{y}) \geq 0$
Identity: $d(\mathbf{x}, \mathbf{y}) = 0 \iff \mathbf{x} = \mathbf{y}$
Symmetry: $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$
Triangle inequality: $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$

\begin{exampleblock}{Choosing the Right Metric}

Euclidean: Most common, assumes all features equally important
Manhattan: Less sensitive to outliers, good for high dimensions
Cosine: Good for text/document clustering (direction matters)
Custom: Domain-specific distances (e.g., edit distance for strings)

\end{exampleblock}

K-Means Algorithm: Overview

Goal

Partition $n$ points into $K$ clusters

Key IdeaEach cluster has centroid (mean)
Assign points to nearest centroid
Update centroids iteratively
Minimize within-cluster variance

\begin{exampleblock}{Input \& Output} Input:

Dataset $X$, Number $K$

Output:

Assignments $\{C_1, …, C_K\}$
Centroids $\{\boldsymbol{\mu}_1, …, \boldsymbol{\mu}_K\}$

\end{exampleblock}

[Figure: ../figures/kmeans_iterations.png]

K-Means Objective Function

Minimize within-cluster sum of squares (WCSS):

Objective$$J = \sum_{k=1}^{K} \sum_{\mathbf{x}_i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$$

where:
$C_k$ = set of points in cluster $k$
$\boldsymbol{\mu}_k$ = centroid of cluster $k$
$\|\cdot\|$ = Euclidean distance

\begin{exampleblock}{Interpretation}

Minimize total squared distance from points to their centroids
Encourages compact, spherical clusters
Also called inertia or distortion
NP-hard to minimize globally, but heuristics work well

\end{exampleblock}

Note

K-Means finds local minimum, not necessarily global!

K-Means Algorithm (Lloyd's Algorithm)

\begin{algorithm}[H] \caption{K-Means Clustering} \begin{algorithmic}[1] \REQUIRE Dataset $X = \{\mathbf{x}_1, …, \mathbf{x}_n\}$, number of clusters $K$ \ENSURE Cluster assignments and centroids \STATE Initialize $K$ centroids $\{\boldsymbol{\mu}_1, …, \boldsymbol{\mu}_K\}$ randomly \REPEAT \STATE Assignment Step: \FOR{each data point $\mathbf{x}_i$} \STATE Assign $\mathbf{x}_i$ to cluster $k^* = \arg\min_k \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2$ \ENDFOR \STATE Update Step: \FOR{each cluster $k = 1, …, K$} \STATE Update centroid: $\boldsymbol{\mu}_k = \frac{1}{|C_k|} \sum_{\mathbf{x}_i \in C_k} \mathbf{x}_i$ \ENDFOR \UNTIL{centroids do not change (or max iterations reached)} \end{algorithmic} \end{algorithm}

Key Properties

Convergence: Guaranteed (objective always decreases). Complexity: $O(nKdT)$ where $T$ = iterations

K-Means Example: Dataset

Let's apply K-Means to a toy dataset with $K=2$

Dataset (4 points, 2D)
\begin{tabular}{ccc}
\toprule
Point & $x_1$ & $x_2$ \\
\midrule
A & 1 & 2 \\
B & 2 & 1 \\
C & 4 & 3 \\
D & 5 & 4 \\
\bottomrule
\end{tabular}

\begin{exampleblock}{Goal} Cluster into $K = 2$ groups \end{exampleblock}

K-Means Example: Step 1 - Initialization

Step 1: Randomly initialize 2 centroids

Random InitializationLet's choose first two points as centroids:

$\boldsymbol{\mu}_1 = \text{Point A} = (1, 2)$

$\boldsymbol{\mu}_2 = \text{Point D} = (5, 4)$

Note

In practice, use K-Means++ initialization!

\begin{tikzpicture}[scale=0.9] \draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$}; \draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$}; % Grid \foreach \x in {1,2,...,6} \draw[gray!30] (\x,0) -- (\x,6); \foreach \y in {1,2,...,5} \draw[gray!30] (0,\y) -- (7,\y); % Data points \fill[blue] (1,2) circle (3pt) node[above left] {A}; \fill[blue] (2,1) circle (3pt) node[below right] {B}; \fill[blue] (4,3) circle (3pt) node[above left] {C}; \fill[blue] (5,4) circle (3pt) node[above right] {D}; % Centroids \draw[red, line width=2pt] (1,2) -- (1.3,2) -- (1.15,2.3) -- cycle; \node[red] at (0.5,2.5) {$\boldsymbol{\mu}_1$}; \draw[green!60!black, line width=2pt] (5,4) -- (5.3,4) -- (5.15,4.3) -- cycle; \node[green!60!black] at (5.5,4.5) {$\boldsymbol{\mu}_2$}; \end{tikzpicture}

K-Means Example: Step 2 - Assignment

Step 2: Assign each point to nearest centroid

Distance Calculations\small
For each point, compute distance to both centroids:



Point A (1,2):
To $\boldsymbol{\mu}_1$: $\sqrt{(1-1)^2 + (2-2)^2} = \mathbf{0}$
To $\boldsymbol{\mu}_2$: $\sqrt{(1-5)^2 + (2-4)^2} = 4.47$
$\Rightarrow$ Assign to Cluster 1


Point B (2,1):
To $\boldsymbol{\mu}_1$: $\sqrt{(2-1)^2 + (1-2)^2} = \mathbf{1.41}$
To $\boldsymbol{\mu}_2$: $\sqrt{(2-5)^2 + (1-4)^2} = 4.24$
$\Rightarrow$ Assign to Cluster 1


Point C (4,3):
To $\boldsymbol{\mu}_1$: $\sqrt{(4-1)^2 + (3-2)^2} = 3.16$
To $\boldsymbol{\mu}_2$: $\sqrt{(4-5)^2 + (3-4)^2} = \mathbf{1.41}$
$\Rightarrow$ Assign to Cluster 2


Point D (5,4):
To $\boldsymbol{\mu}_1$: $\sqrt{(5-1)^2 + (4-2)^2} = 4.47$
To $\boldsymbol{\mu}_2$: $\sqrt{(5-5)^2 + (4-4)^2} = \mathbf{0}$
$\Rightarrow$ Assign to Cluster 2

\begin{tikzpicture}[scale=0.9] \draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$}; \draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$}; % Grid \foreach \x in {1,2,...,6} \draw[gray!30] (\x,0) -- (\x,6); \foreach \y in {1,2,...,5} \draw[gray!30] (0,\y) -- (7,\y); % Data points colored by cluster \fill[red!70] (1,2) circle (3pt) node[above left] {A}; \fill[red!70] (2,1) circle (3pt) node[below right] {B}; \fill[green!60!black] (4,3) circle (3pt) node[above left] {C}; \fill[green!60!black] (5,4) circle (3pt) node[above right] {D}; % Centroids \draw[red, line width=2pt] (1,2) -- (1.3,2) -- (1.15,2.3) -- cycle; \node[red] at (0.5,2.5) {$\boldsymbol{\mu}_1$}; \draw[green!60!black, line width=2pt] (5,4) -- (5.3,4) -- (5.15,4.3) -- cycle; \node[green!60!black] at (5.5,4.5) {$\boldsymbol{\mu}_2$}; \end{tikzpicture} \begin{exampleblock}{Clusters} C1: \{A, B\} \\ C2: \{C, D\} \end{exampleblock}

K-Means Example: Step 3 - Update Centroids

Step 3: Recompute centroids as mean of assigned points

New CentroidsCluster 1 contains: A(1,2), B(2,1)

$$\boldsymbol{\mu}_1 = \frac{1}{2}\left[(1,2) + (2,1)\right]$$
$$= \left(\frac{1+2}{2}, \frac{2+1}{2}\right) = \mathbf{(1.5, 1.5)}$$

Cluster 2 contains: C(4,3), D(5,4)

$$\boldsymbol{\mu}_2 = \frac{1}{2}\left[(4,3) + (5,4)\right]$$
$$= \left(\frac{4+5}{2}, \frac{3+4}{2}\right) = \mathbf{(4.5, 3.5)}$$

\begin{tikzpicture}[scale=0.9] \draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$}; \draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$}; % Grid \foreach \x in {1,2,...,6} \draw[gray!30] (\x,0) -- (\x,6); \foreach \y in {1,2,...,5} \draw[gray!30] (0,\y) -- (7,\y); % Data points \fill[red!70] (1,2) circle (3pt) node[above left] {A}; \fill[red!70] (2,1) circle (3pt) node[below right] {B}; \fill[green!60!black] (4,3) circle (3pt) node[above left] {C}; \fill[green!60!black] (5,4) circle (3pt) node[above right] {D}; % OLD centroids (dashed) \draw[red, line width=1pt, dashed] (1,2) -- (1.3,2) -- (1.15,2.3) -- cycle; \draw[green!60!black, line width=1pt, dashed] (5,4) -- (5.3,4) -- (5.15,4.3) -- cycle; % NEW centroids (solid) \draw[red, line width=2pt] (1.5,1.5) -- (1.8,1.5) -- (1.65,1.8) -- cycle; \node[red] at (1.0,1.0) {$\boldsymbol{\mu}_1$}; \draw[green!60!black, line width=2pt] (4.5,3.5) -- (4.8,3.5) -- (4.65,3.8) -- cycle; \node[green!60!black] at (5.0,3.0) {$\boldsymbol{\mu}_2$}; \end{tikzpicture}

Centroids moved!

Old (dashed) $\rightarrow$ New (solid)

K-Means Example: Iteration 2 - Assignment

Iteration 2: Re-assign points to NEW centroids

Distance Calculations\small
Using new centroids $\boldsymbol{\mu}_1 = (1.5, 1.5)$, $\boldsymbol{\mu}_2 = (4.5, 3.5)$:



Point A (1,2):
To $\boldsymbol{\mu}_1$: $\sqrt{(1-1.5)^2 + (2-1.5)^2} = \mathbf{0.71}$
To $\boldsymbol{\mu}_2$: $\sqrt{(1-4.5)^2 + (2-3.5)^2} = 3.81$
$\Rightarrow$ Cluster 1 (no change)


Point B (2,1):
To $\boldsymbol{\mu}_1$: $\sqrt{(2-1.5)^2 + (1-1.5)^2} = \mathbf{0.71}$
To $\boldsymbol{\mu}_2$: $\sqrt{(2-4.5)^2 + (1-3.5)^2} = 3.54$
$\Rightarrow$ Cluster 1 (no change)


Point C (4,3):
To $\boldsymbol{\mu}_1$: $\sqrt{(4-1.5)^2 + (3-1.5)^2} = 2.92$
To $\boldsymbol{\mu}_2$: $\sqrt{(4-4.5)^2 + (3-3.5)^2} = \mathbf{0.71}$
$\Rightarrow$ Cluster 2 (no change)


Point D (5,4):
To $\boldsymbol{\mu}_1$: $\sqrt{(5-1.5)^2 + (4-1.5)^2} = 4.30$
To $\boldsymbol{\mu}_2$: $\sqrt{(5-4.5)^2 + (4-3.5)^2} = \mathbf{0.71}$
$\Rightarrow$ Cluster 2 (no change)

\begin{tikzpicture}[scale=0.9] \draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$}; \draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$}; % Grid \foreach \x in {1,2,...,6} \draw[gray!30] (\x,0) -- (\x,6); \foreach \y in {1,2,...,5} \draw[gray!30] (0,\y) -- (7,\y); % Data points \fill[red!70] (1,2) circle (3pt) node[above left] {A}; \fill[red!70] (2,1) circle (3pt) node[below right] {B}; \fill[green!60!black] (4,3) circle (3pt) node[above left] {C}; \fill[green!60!black] (5,4) circle (3pt) node[above right] {D}; % Centroids \draw[red, line width=2pt] (1.5,1.5) -- (1.8,1.5) -- (1.65,1.8) -- cycle; \node[red] at (1.0,1.0) {$\boldsymbol{\mu}_1$}; \draw[green!60!black, line width=2pt] (4.5,3.5) -- (4.8,3.5) -- (4.65,3.8) -- cycle; \node[green!60!black] at (5.0,3.0) {$\boldsymbol{\mu}_2$}; \end{tikzpicture} \begin{exampleblock}{Result} No changes!\\ Assignments: Same as before \end{exampleblock}

K-Means Example: Convergence

Step 4: Check convergence

Convergence Achieved!Since no points changed clusters, the algorithm has converged.

Final Clusters:
Cluster 1: \{A(1,2), B(2,1)\}
Cluster 2: \{C(4,3), D(5,4)\}

Final Centroids:
$\boldsymbol{\mu}_1 = (1.5, 1.5)$
$\boldsymbol{\mu}_2 = (4.5, 3.5)$

\begin{tikzpicture}[scale=0.9] \draw[-stealth] (0,0) -- (7,0) node[right] {$x_1$}; \draw[-stealth] (0,0) -- (0,6) node[above] {$x_2$}; % Grid \foreach \x in {1,2,...,6} \draw[gray!30] (\x,0) -- (\x,6); \foreach \y in {1,2,...,5} \draw[gray!30] (0,\y) -- (7,\y); % Voronoi boundaries (perpendicular bisector) \draw[blue, dashed, thick] (3,0) -- (3,6); % Data points \fill[red!70] (1,2) circle (3pt) node[above left] {A}; \fill[red!70] (2,1) circle (3pt) node[below right] {B}; \fill[green!60!black] (4,3) circle (3pt) node[above left] {C}; \fill[green!60!black] (5,4) circle (3pt) node[above right] {D}; % Centroids \draw[red, line width=2pt] (1.5,1.5) -- (1.8,1.5) -- (1.65,1.8) -- cycle; \node[red] at (1.0,1.0) {$\boldsymbol{\mu}_1$}; \draw[green!60!black, line width=2pt] (4.5,3.5) -- (4.8,3.5) -- (4.65,3.8) -- cycle; \node[green!60!black] at (5.0,3.0) {$\boldsymbol{\mu}_2$}; % Cluster regions \node[red!70] at (1.5,4.5) {C1}; \node[green!60!black] at (4.5,1.5) {C2}; \end{tikzpicture}

Key Insight

K-Means partitions space with linear decision boundaries (Voronoi cells)

K-Means: Voronoi Tesselation

Geometric InterpretationK-Means creates a Voronoi diagram:
Space partitioned into regions
Points closest to one centroid
Decision boundaries are linear
Forms convex, polygonal cells

\begin{exampleblock}{Implications}

Works well for spherical clusters
Struggles with elongated shapes
Assumes equal variance
Sensitive to outliers

\end{exampleblock}

[Figure: ../figures/voronoi_diagram.png]

K-Means Initialization: The Challenge

[Figure: ../figures/kmeans_initialization_comparison.png]

Problem

Sensitive to initial centroids

Random Init IssuesPoor local minima
High variance
Multiple runs needed

\begin{exampleblock}{Practice} Run 10-100 times, keep best WCSS \end{exampleblock}

K-Means++ Initialization

Smarter initialization strategy (Arthur \& Vassilvitskii, 2007)

K-Means++ AlgorithmChoose first centroid $\boldsymbol{\mu}_1$ uniformly at random from data points
For $k = 2, …, K$:
For each point $\mathbf{x}_i$, compute $D(\mathbf{x}_i)$ = distance to nearest centroid
Choose next centroid $\boldsymbol{\mu}_k$ with probability $\propto D(\mathbf{x}_i)^2$

Run standard K-Means with these initial centroids

\begin{exampleblock}{Advantages}

Spreads out initial centroids
Provably better: $O(\log K)$-competitive with optimal
Lower variance, more consistent results
Standard in scikit-learn and most libraries

\end{exampleblock}

Recommendation

Always use K-Means++ unless you have domain knowledge for better initialization.

Choosing K: The Elbow Method

Elbow MethodRun K-Means for $K = 1, 2, …, K_{\max}$
Plot WCSS vs $K$
Look for the "elbow" point
Choose $K$ with diminishing returns

\begin{exampleblock}{Interpretation}

WCSS decreases as $K$ increases
Elbow = fit vs complexity trade-off
Not always clear/unique

\end{exampleblock}

[Figure: ../figures/elbow_method.png]

Choosing K: Silhouette Analysis

Silhouette CoefficientFor each point $\mathbf{x}_i$:

1. $a_i$ = avg distance to same cluster

2. $b_i$ = avg distance to nearest other

3. Silhouette:
$$s_i = \frac{b_i - a_i}{\max(a_i, b_i)}$$

Range: $s_i \in [-1, 1]$
$s_i \approx 1$: Well clustered
$s_i \approx 0$: On border
$s_i < 0$: Wrong cluster

Usage

Choose $K$ maximizing avg score

[Figure: ../figures/silhouette_analysis.png]

Limitations of K-Means

[Figure: ../figures/gmm_soft_clustering.png]

Key Issues

Hard assignments: Binary membership
Spherical clusters: Equal variance assumed
No uncertainty: Can't express doubt
Outliers: Forced into clusters

\begin{exampleblock}{When K-Means Struggles}

Elongated/elliptical clusters
Different sizes/densities
Overlapping clusters
Need probability of membership

\end{exampleblock}

\begin{tipblock}{Solution} Gaussian Mixture Models (GMM) provide soft, probabilistic clustering \end{tipblock}

Gaussian Mixture Models: Formulation

Model data as generated from mixture of $K$ Gaussian distributions

Generative Model$$p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

where:
$\pi_k$ = mixing coefficient (prior probability of cluster $k$), $\sum_k \pi_k = 1$
$\boldsymbol{\mu}_k$ = mean of Gaussian $k$
$\boldsymbol{\Sigma}_k$ = covariance matrix of Gaussian $k$
$\mathcal{N}(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma})$ = multivariate Gaussian

\begin{exampleblock}{Soft Assignment} Probability that point $\mathbf{x}_i$ belongs to cluster $k$: $$\gamma_{ik} = p(z_i = k | \mathbf{x}_i) = \frac{\pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$ \end{exampleblock}

GMM: Expectation-Maximization Algorithm

Learn parameters $\{\pi_k, \boldsymbol{\mu_k, \boldsymbol{\Sigma}_k\}$ using EM}

EM AlgorithmInitialize: Random $\boldsymbol{\mu}_k$, $\boldsymbol{\Sigma}_k = I$, $\pi_k = 1/K$

Repeat until convergence:

E-step: Compute responsibilities (soft assignments)
$$\gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\mathbf{x}_i | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$

M-step: Update parameters
$$\pi_k = \frac{1}{n}\sum_{i=1}^{n} \gamma_{ik},  
\boldsymbol{\mu}_k = \frac{\sum_i \gamma_{ik} \mathbf{x}_i}{\sum_i \gamma_{ik}},  
\boldsymbol{\Sigma}_k = \frac{\sum_i \gamma_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k)(\mathbf{x}_i - \boldsymbol{\mu}_k)^T}{\sum_i \gamma_{ik}}$$

Properties

Monotonically increases likelihood. Guaranteed to converge to local maximum.

GMM vs K-Means Comparison

K-MeansPros:
Simple, fast, scalable
Easy to implement
Works well for spherical clusters
Less parameters to tune




Cons:
Hard assignments only
Assumes spherical clusters
Sensitive to initialization
No measure of uncertainty

GMMPros:
Soft probabilistic assignments
Flexible cluster shapes (elliptical)
Measures uncertainty
Principled statistical model




Cons:
Slower than K-Means
More parameters ($\boldsymbol{\Sigma}_k$)
Can overfit with full covariance
Also sensitive to initialization

Note

K-Means is special case of GMM with $\boldsymbol{\Sigma}_k = \sigma^2 I$ and hard assignments!

Hierarchical Clustering: Overview

[Figure: ../figures/hierarchical_dendrogram.png]

Agglomerative (Bottom-up)Start: Each point = cluster
Merge closest clusters
End: One cluster

Divisive (Top-down)Start: All in one cluster
Split clusters iteratively
End: Each point separate

\begin{exampleblock}{Key Advantage} No need to specify $K$ upfront! Cut dendrogram at any height to get desired clusters. \end{exampleblock}

Agglomerative Clustering Algorithm

Most common hierarchical method

AlgorithmInitialize: Each of $n$ points is own cluster
Compute: Distance matrix between all clusters
Repeat until one cluster remains:
Find pair of closest clusters
Merge them into single cluster
Update distance matrix

Output: Dendrogram showing merge history

\begin{exampleblock}{Complexity} Time: $O(n^2 \log n)$ with efficient data structures Space: $O(n^2)$ for distance matrix \end{exampleblock}

Challenge

How do we measure distance between clusters (not just points)?

Hierarchical Clustering Example: Dataset

Let's apply Agglomerative Clustering to 5 points (Single Linkage)

Dataset (5 points, 1D)
\begin{tabular}{cc}
\toprule
Point & Position \\
\midrule
A & 2 \\
B & 4 \\
C & 5 \\
D & 10 \\
E & 12 \\
\bottomrule
\end{tabular}

\begin{exampleblock}{Goal} Build dendrogram using Single Linkage \end{exampleblock}

\begin{tikzpicture}[scale=1.1] % Number line \draw[-stealth] (0,0) -- (14,0) node[right] {$x$}; % Tick marks \draw (0,0.1) -- (0,-0.1) node[below] {\tiny 0}; \draw (2,0.1) -- (2,-0.1) node[below] {\tiny 2}; \draw (4,0.1) -- (4,-0.1) node[below] {\tiny 4}; \draw (6,0.1) -- (6,-0.1) node[below] {\tiny 6}; \draw (8,0.1) -- (8,-0.1) node[below] {\tiny 8}; \draw (10,0.1) -- (10,-0.1) node[below] {\tiny 10}; \draw (12,0.1) -- (12,-0.1) node[below] {\tiny 12}; % Data points \fill[blue] (2,0) circle (2.5pt) node[above=3pt] {A}; \fill[blue] (4,0) circle (2.5pt) node[above=3pt] {B}; \fill[blue] (5,0) circle (2.5pt) node[above=3pt] {C}; \fill[blue] (10,0) circle (2.5pt) node[above=3pt] {D}; \fill[blue] (12,0) circle (2.5pt) node[above=3pt] {E}; \end{tikzpicture}

Initial State

Each point = own cluster\\ 5 clusters: \{A\}, \{B\}, \{C\}, \{D\}, \{E\}

Hierarchical Example: Step 1 - Distance Matrix

Step 1: Compute pairwise distance matrix

Distance Matrix\small

\begin{tabular}{c|ccccc}
 & A & B & C & D & E \\
\hline
A & 0 & 2 & 3 & 8 & 10 \\
B & 2 & 0 & 1 & 6 & 8 \\
C & 3 & 1 & 0 & 5 & 7 \\
D & 8 & 6 & 5 & 0 & 2 \\
E & 10 & 8 & 7 & 2 & 0 \\
\end{tabular}

\begin{exampleblock}{Find Minimum} Smallest distance = 1 between B and C $\Rightarrow$ Merge \{B\} and \{C\} \end{exampleblock}

Dendrogram (Step 1) \begin{tikzpicture}[scale=0.9] % Leaves \node (A) at (0,0) {A}; \node (B) at (2,0) {B}; \node (C) at (3,0) {C}; \node (D) at (5,0) {D}; \node (E) at (6,0) {E}; % First merge: B-C at height 1 \draw (B) -- (2.5,1); \draw (C) -- (2.5,1); \draw[blue, line width=1.5pt] (2.5,1) -- (2.5,1.5) node[right, black] {\tiny h=1}; % Y-axis \draw[-stealth] (-0.5,0) -- (-0.5,5) node[above] {\tiny height}; \draw (-0.6,0) -- (-0.4,0) node[left] {\tiny 0}; \draw (-0.6,1) -- (-0.4,1) node[left] {\tiny 1}; \draw (-0.6,2) -- (-0.4,2) node[left] {\tiny 2}; \draw (-0.6,3) -- (-0.4,3) node[left] {\tiny 3}; \draw (-0.6,4) -- (-0.4,4) node[left] {\tiny 4}; \draw (-0.6,5) -- (-0.4,5) node[left] {\tiny 5}; \end{tikzpicture}

Current Clusters

\{A\}, \{B, C\}, \{D\}, \{E\}\\ 4 clusters remain

Hierarchical Example: Step 2 - Update Matrix

Step 2: Update distance matrix using Single Linkage

Single Linkage Rule$$d(\{B,C\}, X) = \min(d(B,X), d(C,X))$$



New distances to cluster \{B,C\}:
$d(\{B,C\}, A) = \min(2, 3) = \mathbf{2}$
$d(\{B,C\}, D) = \min(6, 5) = \mathbf{5}$
$d(\{B,C\}, E) = \min(8, 7) = \mathbf{7}$

Updated Matrix\small

\begin{tabular}{c|cccc}
 & A & \{B,C\} & D & E \\
\hline
A & 0 & 2 & 8 & 10 \\
\{B,C\} & 2 & 0 & 5 & 7 \\
D & 8 & 5 & 0 & 2 \\
E & 10 & 7 & 2 & 0 \\
\end{tabular}

Dendrogram (Step 2) \begin{tikzpicture}[scale=0.9] % Leaves \node (A) at (0,0) {A}; \node (B) at (2,0) {B}; \node (C) at (3,0) {C}; \node (D) at (5,0) {D}; \node (E) at (6,0) {E}; % First merge: B-C at height 1 \draw (B) -- (2.5,1); \draw (C) -- (2.5,1); \draw[blue, line width=1.5pt] (2.5,1) -- (2.5,1.5); % Second merge: D-E at height 2 \draw (D) -- (5.5,2); \draw (E) -- (5.5,2); \draw[green!60!black, line width=1.5pt] (5.5,2) -- (5.5,2.5) node[right, black] {\tiny h=2}; % Y-axis \draw[-stealth] (-0.5,0) -- (-0.5,5) node[above] {\tiny height}; \draw (-0.6,0) -- (-0.4,0) node[left] {\tiny 0}; \draw (-0.6,1) -- (-0.4,1) node[left] {\tiny 1}; \draw (-0.6,2) -- (-0.4,2) node[left] {\tiny 2}; \draw (-0.6,3) -- (-0.4,3) node[left] {\tiny 3}; \draw (-0.6,4) -- (-0.4,4) node[left] {\tiny 4}; \draw (-0.6,5) -- (-0.4,5) node[left] {\tiny 5}; \end{tikzpicture} \begin{exampleblock}{Next Merge} Min distance = 2\\ Merge \{D\} and \{E\} at height 2 \end{exampleblock}

Hierarchical Example: Steps 3-4

Continue merging until one cluster remains

Step 3Clusters: \{A\}, \{B,C\}, \{D,E\}

Updated distances:
$d(A, \{B,C\}) = 2$
$d(A, \{D,E\}) = 8$
$d(\{B,C\}, \{D,E\}) = 5$

Min = 2: Merge A with \{B,C\}\\
New cluster: \{A, B, C\} at height 2

Step 4 (Final)Clusters: \{A,B,C\}, \{D,E\}

Distance: $d(\{A,B,C\}, \{D,E\}) = 5$

Final merge at height 5\\
One cluster: \{A, B, C, D, E\}

Complete Dendrogram \begin{tikzpicture}[scale=0.85] % Leaves \node (A) at (0,0) {A}; \node (B) at (2,0) {B}; \node (C) at (3,0) {C}; \node (D) at (5,0) {D}; \node (E) at (6,0) {E}; % Merge B-C at h=1 \draw (B) -- (2.5,1); \draw (C) -- (2.5,1); \draw[blue, line width=1.2pt] (2.5,1) -- (2.5,2); % Merge D-E at h=2 \draw (D) -- (5.5,2); \draw (E) -- (5.5,2); \draw[green!60!black, line width=1.2pt] (5.5,2) -- (5.5,5); % Merge A with {B,C} at h=2 \draw (A) -- (0,2) -- (1.25,2); \draw (2.5,2) -- (1.25,2); \draw[red, line width=1.2pt] (1.25,2) -- (1.25,5); % Final merge at h=5 \draw[purple, line width=1.5pt] (1.25,5) -- (3.375,5) -- (5.5,5); % Height annotations \node[right] at (3.375,5) {\tiny h=5}; \node[left] at (1.25,2) {\tiny h=2}; \node[left] at (2.5,1) {\tiny h=1}; \node[right] at (5.5,2) {\tiny h=2}; % Y-axis \draw[-stealth] (-0.8,0) -- (-0.8,5.5) node[above] {\tiny height}; \draw (-0.9,0) -- (-0.7,0) node[left] {\tiny 0}; \draw (-0.9,1) -- (-0.7,1) node[left] {\tiny 1}; \draw (-0.9,2) -- (-0.7,2) node[left] {\tiny 2}; \draw (-0.9,3) -- (-0.7,3) node[left] {\tiny 3}; \draw (-0.9,4) -- (-0.7,4) node[left] {\tiny 4}; \draw (-0.9,5) -- (-0.7,5) node[left] {\tiny 5}; \end{tikzpicture}

Interpretation

Cut at different heights to get different K clusters

Hierarchical Example: Cutting the Dendrogram

Extract different numbers of clusters by cutting at different heights \begin{tikzpicture}[scale=1.0] % Dendrogram \node (A) at (0,0) {A}; \node (B) at (2,0) {B}; \node (C) at (3,0) {C}; \node (D) at (5,0) {D}; \node (E) at (6,0) {E}; \draw (B) -- (2.5,1); \draw (C) -- (2.5,1); \draw (2.5,1) -- (2.5,2); \draw (D) -- (5.5,2); \draw (E) -- (5.5,2); \draw (5.5,2) -- (5.5,5); \draw (A) -- (0,2) -- (1.25,2); \draw (2.5,2) -- (1.25,2); \draw (1.25,2) -- (1.25,5); \draw (1.25,5) -- (3.375,5) -- (5.5,5); % Cut lines \draw[red, dashed, line width=2pt] (-0.5,4.5) -- (6.5,4.5) node[right] {K=2: \{A,B,C\}, \{D,E\}}; \draw[blue, dashed, line width=2pt] (-0.5,2.5) -- (6.5,2.5) node[right] {K=3: \{A,B,C\}, \{D\}, \{E\}}; \draw[orange, dashed, line width=2pt] (-0.5,1.5) -- (6.5,1.5) node[right] {K=4: \{A\}, \{B,C\}, \{D\}, \{E\}}; % Y-axis \draw[-stealth] (-0.8,0) -- (-0.8,5.5) node[above] {height}; \draw (-0.9,0) -- (-0.7,0) node[left] {\tiny 0}; \draw (-0.9,1) -- (-0.7,1) node[left] {\tiny 1}; \draw (-0.9,2) -- (-0.7,2) node[left] {\tiny 2}; \draw (-0.9,3) -- (-0.7,3) node[left] {\tiny 3}; \draw (-0.9,4) -- (-0.7,4) node[left] {\tiny 4}; \draw (-0.9,5) -- (-0.7,5) node[left] {\tiny 5}; \end{tikzpicture}

Key Advantage of Hierarchical Clustering

No need to pre-specify K! The dendrogram shows the full hierarchy.

Linkage Criteria

Different ways to measure inter-cluster distance

Common Linkage MethodsFor clusters $C_i$ and $C_j$:



1. Single Linkage (MIN):
$$d(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$$

2. Complete Linkage (MAX):
$$d(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$$

3. Average Linkage:
$$d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{\mathbf{x} \in C_i} \sum_{\mathbf{y} \in C_j} d(\mathbf{x}, \mathbf{y})$$

4. Ward's Linkage:
$$d(C_i, C_j) = \text{increase in WCSS when merging } C_i \text{ and } C_j$$

Linkage Methods: Comparison

[Figure: ../figures/linkage_methods_comparison.png]

SingleElongated
Noise sensitive

CompleteCompact
Outlier sensitive

AverageCompromise
Robust

Ward'sMin variance
Most popular

Dendrograms: Interpretation

Reading DendrogramsLeaves: Individual points
Height: Merge distance
Branches: Relationships
Cut: Extract $K$ clusters

\begin{exampleblock}{Extracting Clusters} Cut at height $h$:

Higher $\rightarrow$ fewer clusters
Lower $\rightarrow$ more clusters
Choose via validation

\end{exampleblock}

Advantage

Explore different $K$ without re-running!

[Figure: ../figures/agglomerative_steps.png]

Types of Cluster Validation

How do we assess clustering quality?

Internal ValidationUse only the data itself

Measures:
Silhouette coefficient
Davies-Bouldin index
Calinski-Harabasz score
Dunn index

Idea: Good clusters are compact and well-separated

External ValidationCompare to ground truth labels

Measures:
Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)
V-measure
Purity

Idea: Good clustering agrees with true labels

When to Use Each

Internal: Unsupervised setting (no labels) External: When ground truth available (benchmarking)

Internal Metrics: Formulas

1. Silhouette Coefficient$$s = \frac{1}{n} \sum_{i=1}^{n} \frac{b_i - a_i}{\max(a_i, b_i)}$$
Range: $[-1, 1]$, higher is better

2. Davies-Bouldin Index$$DB = \frac{1}{K} \sum_{k=1}^{K} \max_{k' \neq k} \frac{\sigma_k + \sigma_{k'}}{d(\boldsymbol{\mu}_k, \boldsymbol{\mu}_{k'})}$$
Range: $[0, \infty)$, lower is better

3. Calinski-Harabasz Score (Variance Ratio)$$CH = \frac{\text{Between-cluster variance}}{\text{Within-cluster variance}} \times \frac{n - K}{K - 1}$$
Range: $[0, \infty)$, higher is better

Usage

Use multiple metrics! Different metrics may favor different clusterings.

Internal Validation: Visualization

Metric Values vs K

[Figure: ../figures/internal_validation_metrics.png]

Optimal K Comparison

[Figure: ../figures/optimal_k_comparison.png]

Observation

Different metrics may suggest different optimal $K$. Use domain knowledge!

External Metrics: Formulas

Given true labels $Y$ and predicted labels $C$:

1. Adjusted Rand Index (ARI)$$ARI = \frac{\text{RI} - E[\text{RI}]}{\max(\text{RI}) - E[\text{RI}]}$$

Range: $[-1, 1]$, higher is better
$ARI = 1$: Perfect agreement
$ARI \approx 0$: Random labeling
Adjusted for chance

2. Normalized Mutual Information (NMI)$$NMI(Y, C) = \frac{2 \cdot I(Y; C)}{H(Y) + H(C)}$$

Range: $[0, 1]$, higher is better
$NMI = 1$: Perfect agreement
Based on information theory
Normalized for different $K$

External Validation: Example

Validation Metrics

[Figure: ../figures/external_validation_metrics.png]

Confusion Matrix

[Figure: ../figures/clustering_confusion_matrix.png]

Note

External validation only for benchmarking. In real unsupervised tasks, no ground truth!

Application: Customer Segmentation

[Figure: ../figures/customer_segmentation.png]

\begin{exampleblock}{Business Impact}

Targeted marketing: Strategies per segment
Product development: Tailor to groups
Resource allocation: Focus on high-value
Customer retention: Identify at-risk

\end{exampleblock}

Application: Image Color Quantization

[Figure: ../figures/image_color_quantization.png]

\begin{exampleblock}{Use Cases}

Image compression: Reduce file size
Color palette: Identify dominant colors
Segmentation: Group similar pixels
Artistic effects: Posterization

\end{exampleblock}

Application: Biological Data (Iris Dataset)

[Figure: ../figures/iris_species_clustering.png]

\begin{exampleblock}{Bioinformatics Applications}

Species classification: Taxonomic groups
Gene expression: Co-expressed genes
Protein structure: Protein families
Disease subtyping: Patient subtypes

\end{exampleblock}

Application: Document Clustering

[Figure: ../figures/document_clustering.png]

\begin{exampleblock}{Text Mining Applications}

Topic discovery: Find themes in document collections
News organization: Group similar articles
Search results: Organize by topic clusters
Recommendation: Find similar documents

\end{exampleblock}

Choosing a Clustering Algorithm

Decision GuideUse K-Means when:
You know $K$ (or can estimate it)
Data has roughly spherical clusters
Large dataset (scalability important)
Want fast, simple method




Use GMM when:
Need probabilistic assignments
Clusters have different shapes/variances
Want to measure uncertainty
Have computational resources




Use Hierarchical when:
Don't know $K$ in advance
Want to explore multiple granularities
Need interpretable hierarchy
Small to medium dataset ($n < 10,000$)

Common Pitfalls \& Solutions

Pitfall 1: Not Scaling FeaturesProblem: Features with large ranges dominate distance

Solution: Standardize features: $z = \frac{x - \mu}{\sigma}$

Pitfall 2: Using Wrong Distance MetricProblem: Euclidean not always appropriate

Solution: Match metric to data type (cosine for text, custom for categorical)

Pitfall 3: Ignoring OutliersProblem: Outliers can distort clusters (especially K-Means)

Solution: Detect and remove outliers, or use robust methods (DBSCAN, K-Medoids)

Pitfall 4: Blindly Trusting One MetricProblem: Single validation metric may be misleading

Solution: Use multiple metrics + visual inspection + domain knowledge

Best Practices: Practical Tips

\begin{exampleblock}{Data Preprocessing}

Scale features: Use StandardScaler or MinMaxScaler
Handle missing values: Impute or remove
Remove duplicates: Can bias cluster centers
Consider dimensionality reduction: PCA for high-dimensional data

\end{exampleblock} \begin{exampleblock}{Model Selection}

Try multiple K values: Use elbow + silhouette + domain knowledge
Run multiple times: Different initializations for K-Means
Validate results: Use both internal and visual validation
Compare algorithms: K-Means, GMM, Hierarchical

\end{exampleblock} \begin{exampleblock}{Interpretation}

Visualize clusters: 2D projections (PCA, t-SNE)
Inspect cluster centers: What characterizes each cluster?
Verify with domain experts: Do clusters make sense?
Iterate: Clustering is exploratory - refine based on insights

\end{exampleblock}

Key Takeaways

Clustering FundamentalsUnsupervised learning: Discover structure without labels
Distance metrics: Foundation of clustering (Euclidean, cosine, etc.)
Two main types: Partitional vs Hierarchical

Key AlgorithmsK-Means: Fast, simple, hard assignments, spherical clusters
GMM: Soft assignments, flexible shapes, probabilistic
Hierarchical: No need for K, produces dendrogram, $O(n^2)$

ValidationInternal: Silhouette, Davies-Bouldin, Calinski-Harabasz
External: ARI, NMI (when ground truth available)
Selection: Elbow method, silhouette analysis, domain knowledge

What We Covered

Introduction: Motivation, applications, clustering types
Distance Metrics: Euclidean, Manhattan, cosine similarity
K-Means: Algorithm, initialization (K-Means++), choosing K
GMM: Soft clustering, EM algorithm, comparison with K-Means
Hierarchical: Agglomerative, linkage methods, dendrograms
Validation: Internal and external metrics
Applications: Customer segmentation, image processing, bioinformatics
Best Practices: Algorithm selection, preprocessing, pitfalls