Finding Hidden Groups in Unlabeled Data
Department of Computer Science
University of the Philippines Cebu
Lecture 17: K-Means & Hierarchical
"The goal is to turn data into information, and information into insight."
— Carly Fiorina
Distinguish labeled from unlabeled learning paradigms.
Apply the K-Means algorithm and choose optimal K.
Build dendrograms and compare linkage methods.
Profile clusters and translate them into business actions.
In Weeks 7-8 we built supervised models (regression, trees) that learn from labeled outcomes. This week we remove the labels entirely.
| Supervised | Unsupervised | |
|---|---|---|
| Labels | Has target variable (y) | No labels at all |
| Goal | Predict an outcome | Discover hidden structure |
| Methods | Regression, Classification | Clustering, Dimensionality Reduction |
| Question | "What will happen?" | "What patterns exist?" |
| Evaluation | Accuracy, RMSE, AUC | Silhouette (cluster quality), Inertia (cluster tightness) |
Clustering is the task of partitioning data points into groups (clusters) so that points within the same cluster are more similar to each other than to points in other clusters.
Unlike classification, no one tells the algorithm what the groups should be — it discovers them from the data alone.
| Cluster | A group of similar data points |
| Centroid | The "center" of a cluster (mean of all its points) |
| Distance metric | How we measure similarity (e.g., Euclidean distance) |
| K | The number of clusters (chosen by the analyst) |
Points naturally group by proximity.
The × marks each cluster's centroid.
Clustering is the most widely deployed unsupervised learning technique. Every major tech company uses it daily.
We'll use this tiny dataset throughout both lectures. Small enough to trace by hand, big enough to show real patterns.
Look at the table. Before any algorithm, can you spot which customers are similar? That's exactly what clustering will automate.
| ID | Name | Income (k₱) | Age | Spending |
|---|---|---|---|---|
| 1 | Ana | 15 | 22 | 82 |
| 2 | Ben | 18 | 25 | 78 |
| 3 | Cris | 20 | 28 | 88 |
| 4 | Dina | 50 | 35 | 55 |
| 5 | Erik | 55 | 32 | 60 |
| 6 | Faye | 70 | 45 | 28 |
| 7 | Gino | 75 | 48 | 22 |
| 8 | Hana | 90 | 55 | 15 |
Partition n observations into K clusters by minimizing within-cluster variance.
\begin{algorithm}
\caption{K-Means Clustering}
\begin{algorithmic}
\REQUIRE Dataset $X = \{x_1, \ldots, x_n\}$, number of clusters $K$
\STATE Initialize centroids $\mu_1, \ldots, \mu_K$ randomly from $X$
\REPEAT
\FOR{each $x_i \in X$}
\STATE $c_i \leftarrow \arg\min_j \|x_i - \mu_j\|^2$
\ENDFOR
\FOR{$j = 1$ \TO $K$}
\STATE $\mu_j \leftarrow \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$
\ENDFOR
\UNTIL{centroids do not change}
\end{algorithmic}
\end{algorithm}
Minimize Within-Cluster Sum of Squares (WCSS):
J = Σk Σi ∈
Ck ||xi −
μk||²
Let's trace one complete iteration with K=2 on 6 points. Grab a pen!
| Point | Coords | Dist to μ1 | Dist to μ2 | Assign |
|---|---|---|---|---|
| A | (1, 2) | 1.0 | 5.0 | C1 |
| B | (2, 1) | 1.4 | 4.2 | C1 |
| C | (2, 3) | 1.4 | 3.6 | C1 |
| D | (5, 4) | 4.1 | 1.4 | C2 |
| E | (6, 5) | 5.4 | 1.0 | C2 |
| F | (5, 6) | 5.0 | 1.4 | C2 |
Distances = √((x−cx)² + (y−cy)²), rounded to 1 decimal.
μ1 = (1, 1) μ2 = (6, 6)
Each point goes to its nearest centroid (lower distance wins).
μ1new = mean of A,B,C = (1.7, 2.0)
μ2new = mean of D,E,F = (5.3, 5.0)
Centroids shifted! Recalculate distances with new centroids and reassign. Continue until centroids stop moving.
| Name | Inc | Spd | d(μ1) | d(μ2) | d(μ3) | Cluster |
|---|
K-Means uses Euclidean distance (straight-line distance: √Σ(ai − bi)²). Features with larger scales will dominate the distance calculation and distort the clusters.
Income (0-1M) vs Age (18-80). Income dominates: clusters formed by income alone, age ignored. In plain English: K-Means groups by income and ignores age entirely.
Both features centered at 0 with unit variance. Equal contribution to distance.
KMeans(n_clusters=K) on scaled datan_clusters — number of clusters (K)n_init=10 — runs with different random startsrandom_state — seed for reproducibilityfrom sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])
km = KMeans(n_clusters=3, random_state=42)
df['cluster'] = km.fit_predict(X_scaled)
Plot inertia (WCSS) against K. Look for the "elbow" where the rate of decrease sharply levels off.
Sum of squared distances from each point to its assigned centroid. Think of it as "total cluster tightness" — lower = tighter, but it always decreases with more K (at K=8 every customer is its own cluster, inertia=0).
import matplotlib.pyplot as plt
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
# Our 8 customers — inertia values:
# K=1: 24.0 K=2: 10.2 K=3: 3.8
# K=4: 2.9 K=5: 2.1 K=6: 1.4
#
# K=1→2: -57% K=2→3: -63% ← BIG drops
# K=3→4: -24% K=4→5: -28% ← smaller
# → Elbow at K=3 (3 natural groups!)
The "elbow" is where adding more clusters stops giving meaningful improvement. Here's how to spot it.
| K | Inertia | Drop | % Improved | Verdict |
|---|---|---|---|---|
| 1 | 850 | — | — | |
| 2 | 500 | −350 | 41% | Big gain |
| 3 | 280 | −220 | 44% | Big gain |
| 4 | 265 | −15 | 5% | Elbow! |
| 5 | 260 | −5 | 2% | Diminishing |
From K=3 to K=4, improvement drops from 44% to just 5%. That's the elbow → choose K=3.
Real data often has a smooth curve instead of a sharp bend. When this happens:
Always combine Elbow + Silhouette + Business sense. No single metric gives you the "correct" K — clustering is exploratory, not a test with one right answer.
| K | Inertia | Drop | % | Silhouette | Verdict |
|---|
Measures how similar a point is to its own cluster vs. the nearest other cluster. Higher = better separated clusters.
s(i) = (b(i) − a(i)) / max(a(i), b(i))
If a=2 (avg distance within cluster) and b=5 (avg distance to nearest other cluster):
s = (5−2) / max(2,5) = 3/5 = 0.6 → reasonable structure.
from sklearn.metrics import silhouette_score
scores = []
for k in range(2, 11):
km = KMeans(n_clusters=k, random_state=42)
labels = km.fit_predict(X_scaled)
score = silhouette_score(X_scaled, labels)
scores.append(score)
print(f"K={k}: Silhouette = {score:.3f}")
# Pick K with highest silhouette score
best_k = range(2, 11)[scores.index(max(scores))]
print(f"\nBest K = {best_k}")
# Example output:
# K=2: Silhouette = 0.42
# K=3: Silhouette = 0.58
# K=4: Silhouette = 0.61 ← BEST
# K=5: Silhouette = 0.55
# Best K = 4
# Interpretation guide:
# > 0.7 Strong structure
# 0.5-0.7 Reasonable structure
# 0.25-0.5 Weak, may be artificial
# < 0.25 No substantial structure
Silhouette Score is not the only option. Use multiple metrics together for a more robust evaluation of clustering quality.
Ratio of within-cluster scatter to between-cluster separation. Lower = better (0 is perfect).
Ratio of between-cluster variance to within-cluster variance. Higher = better (no upper bound).
from sklearn.metrics import (silhouette_score,
davies_bouldin_score, calinski_harabasz_score)
for k in range(2, 8):
km = KMeans(n_clusters=k, random_state=42)
labels = km.fit_predict(X_scaled)
print(f"K={k}: Silhouette={silhouette_score(X_scaled, labels):.3f}"
f" DB={davies_bouldin_score(X_scaled, labels):.3f}"
f" CH={calinski_harabasz_score(X_scaled, labels):.0f}")
Clusters are only useful when you can interpret and name them. Compute summary statistics per cluster.
If you can't name it, the clustering may not be meaningful.
# Cluster profiles (using our data)
profile = df.groupby('cluster').agg({
'income': 'mean',
'age': 'mean',
'spending': 'mean',
'name': 'count'
}).round(1)
profile.columns = ['Avg Income', 'Avg Age',
'Avg Spending', 'Count']
print(profile)
# Output:
# Avg Income Avg Age Avg Spending Count
# 0 17.7 25.0 82.7 3
# 1 52.5 33.5 57.5 2
# 2 78.3 49.3 21.7 3
# Name them based on patterns:
names = {
0: 'Young Big Spenders',
1: 'Mid-Range Balanced',
2: 'Wealthy & Frugal'
}
df['segment'] = df['cluster'].map(names)
print(df[['name', 'segment']])
# Ana → Young Big Spenders
# Dina → Mid-Range Balanced
# Hana → Wealthy & Frugal
When features exceed 2 dimensions, use PCA (Principal Component Analysis) to project clusters onto a 2D plane for plotting.
PCA finds the directions along which your data spreads out the most (has the most variance). Think of choosing the best camera angle to photograph a 3D object in 2D — you lose some depth, but keep the most informative view.
explained_variance_ratio_ tells how much info each component captures (e.g., 72% + 18% = 90% retained)from sklearn.decomposition import PCA
# Our 8 customers have 3 features (3D)
# PCA squishes 3D → 2D for plotting
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# How much info did we keep?
print(pca.explained_variance_ratio_)
# [0.92, 0.07] → PC1=92%, PC2=7%
# Total: 99% of info kept in 2D!
# Plot clusters in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1],
c=df['cluster'], cmap='viridis', s=80)
# Add customer names as labels
for i, name in enumerate(df['name']):
plt.annotate(name,
(X_pca[i,0], X_pca[i,1]))
plt.xlabel('PC1 (92% of variance)')
plt.ylabel('PC2 (7% of variance)')
# → 3 clear groups visible in 2D!
K-Means is fast and intuitive but makes strong assumptions that don't always hold.
Assumes circular/spherical clusters. Fails on elongated or irregular shapes.
Mitigated by k-means++ (smart init that spreads centroids apart — sklearn default) and n_init=10, but still possible.
You must choose K before fitting. No automatic detection.
Outliers pull centroids, distorting cluster boundaries.
Assumes clusters have similar size and density. If one group has 500 members and another 20, K-Means may split the big group instead.
K-Means draws straight-line boundaries between centroids. This works for round blobs, but fails spectacularly on other shapes.
K-Means: Correct!
K-Means: Wrong split!
K-Means: Total failure!
Use DBSCAN (Lecture 18) for non-spherical clusters — it follows density, not distance to centroids.
Builds a tree of clusters without needing to specify K upfront. Agglomerative (bottom-up) is the most common approach.
\begin{algorithm}
\caption{Agglomerative Clustering}
\begin{algorithmic}
\REQUIRE Dataset $X = \{x_1, \ldots, x_n\}$, linkage function $d$
\STATE Initialize each $x_i$ as its own cluster $C_i$
\WHILE{number of clusters $> 1$}
\STATE $(C_a, C_b) \leftarrow \arg\min_{i \neq j} \, d(C_i, C_j)$
\STATE Merge $C_a$ and $C_b$ into a single cluster
\STATE Record merge distance (for dendrogram)
\ENDWHILE
\RETURN Dendrogram; cut at desired height for $K$ clusters
\end{algorithmic}
\end{algorithm}
The dendrogram lets you visually choose K after fitting by cutting at a chosen height.
Dendrogram (Y-axis = merge distance; longer lines = bigger gaps)
Each leaf at the bottom is a data point (or sample). The order doesn't matter much — it's rearranged to avoid crossing lines.
The height where two branches merge = how dissimilar those clusters were. Higher merge = less similar. The tallest vertical line represents the biggest "jump" in dissimilarity.
Draw a horizontal cut line. The number of vertical lines it crosses = K. Tip: Cut where vertical lines are longest (biggest gap = most natural groupings).
Cut at height 4 → crosses 3 lines → K=3. Cut at height 2 → 5 clusters. Cut at height 7 → everything merges into 1.
Reading the Dendrogram
The linkage criterion determines how the "distance" between two clusters is computed during merging.
| Method | Distance Computed As | Best For | Sensitivity |
|---|---|---|---|
| Single | Minimum distance between any two points | Chain-like / elongated clusters | Very sensitive to noise |
| Complete | Maximum distance between any two points | Compact, spherical clusters | Moderate |
| Average | Mean pairwise distance between all points | Balanced general-purpose | Moderate |
| Ward | Minimizes total within-cluster variance | Similar-size clusters (most popular) | Low (most robust) |
AgglomerativeClustering with chosen Klinkage + dendrogramAgglomerativeClusteringfrom scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
Z = linkage(X_scaled, method='ward')
dendrogram(Z, truncate_mode='level', p=5)
hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['cluster'] = hc.fit_predict(X_scaled)
With 94+ million users, GCash uses clustering to segment customers by transaction behavior and tailor financial products.
# Build RFM features from transactions
from datetime import datetime
today = datetime.now()
rfm = transactions.groupby('user_id').agg({
'txn_date': lambda x: (today - x.max()).days,
'txn_id': 'count',
'amount': 'sum'
}).rename(columns={
'txn_date': 'recency',
'txn_id': 'frequency',
'amount': 'monetary'
})
# Scale and cluster
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)
kmeans = KMeans(n_clusters=4, random_state=42)
rfm['segment'] = kmeans.fit_predict(rfm_scaled)
# Profile
rfm.groupby('segment').mean().round(1)
After clustering, profile each segment and assign actionable marketing strategies.
| Segment | Recency | Frequency | Monetary | Strategy |
|---|---|---|---|---|
| Champions | Low (<15 days) | High (>10/mo) | High (>₱50k) | Reward loyalty, early access to GCredit |
| Potential | Low | Medium | Medium | Upsell GInvest, push GCash Mastercard |
| At Risk | High (lapsed) | Low | Medium | Win-back campaign, cashback offers |
| Lost | Very High | Low | Low | Exit survey, final reactivation attempt |
1. Clustering finds natural groups in data without labels.
2. Always scale features before running K-Means.
3. Use elbow + silhouette together to choose K.
4. Hierarchical clustering shows the full merge tree via dendrograms.
DBSCAN, Market Basket Analysis, and RFM Segmentation
Department of Computer Science
University of the Philippines Cebu
Lecture 18: Beyond K-Means
Apply density-based clustering for arbitrary shapes and outlier detection.
Use clustering and Isolation Forest to flag unusual observations.
Find product associations with Apriori (support, confidence, lift).
Score and segment customers by Recency, Frequency, Monetary value.
Density-Based Spatial Clustering of Applications with Noise. Groups dense regions and labels sparse points as noise.
\begin{algorithm}
\caption{DBSCAN}
\begin{algorithmic}
\REQUIRE Dataset $X$, radius $\varepsilon$, minimum points $m$
\FOR{each unvisited point $x_i \in X$}
\STATE Mark $x_i$ as visited
\STATE $N_i \leftarrow$ neighbors of $x_i$ within $\varepsilon$
\IF{$|N_i| \geq m$}
\STATE Create new cluster $C$; add $x_i$
\STATE Expand $C$: recursively add all density-reachable points
\ELSE
\STATE Label $x_i$ as \textbf{noise}
\ENDIF
\ENDFOR
\end{algorithmic}
\end{algorithm}
Point Classification in DBSCAN
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=1.2, min_samples=2)
labels = db.fit_predict(X_scaled)
n_clusters = len(set(labels) - {-1})
n_noise = list(labels).count(-1)
| Name | Inc | Spd | Nbrs | Type | Cluster |
|---|
| Aspect | K-Means | DBSCAN |
|---|---|---|
| Cluster shape | Spherical / convex only | Any arbitrary shape |
| Number of clusters | Must specify K | Discovered automatically |
| Outlier handling | Forces into nearest cluster | Labels as noise (−1) |
| Cluster density | Assumes uniform | Handles varying density |
| Scalability | Fast (O(nKt)) | Slower (O(n log n) with index) |
| Reproducibility | Varies by initialization | Nearly deterministic* |
| Best for | Customer segmentation, compression | Anomaly detection, spatial data |
| *Border points reachable from multiple clusters may vary by processing order. | ||
Use clustering to identify observations that don't belong to any group — potential fraud, errors, or rare events.
contamination to the expected anomaly proportion (e.g., 0.05 = 5%)# Method 1: DBSCAN noise detection
dbscan = DBSCAN(eps=0.5, min_samples=10)
labels = dbscan.fit_predict(X_scaled)
anomalies_db = df[labels == -1]
print(f"DBSCAN anomalies: {len(anomalies_db)}")
# Method 2: Centroid distance threshold
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_scaled)
distances = kmeans.transform(X_scaled).min(axis=1)
# Flag top 5% as anomalies
threshold = np.percentile(distances, 95)
anomalies_km = df[distances > threshold]
print(f"K-Means anomalies: {len(anomalies_km)}")
# Method 3: Isolation Forest
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05)
pred = iso.fit_predict(X_scaled)
anomalies_if = df[pred == -1] # -1 = anomaly
Goal: Discover products that are frequently purchased together. Powers product recommendations, store layouts, and promotional bundles.
Example Transaction Data
| TXN | Items Purchased |
|---|---|
| 1 | Rice, Eggs, Milk |
| 2 | Rice, Eggs |
| 3 | Bread, Butter, Coffee |
| 4 | Rice, Milk, Coffee |
| 5 | Eggs, Milk, Bread |
Which items are often bought together?
How often items appear together in all transactions.
Support(A→B) = P(A ∩ B)
Rice & Eggs both in TXN 1, 2 → 2 out of 5 = 0.4
How often the rule is correct (conditional probability).
Conf(A→B) = P(B|A)
Rice in TXN 1,2,4 (3 total). Rice+Eggs in TXN 1,2. → 2/3 = 0.67
Strength of association (vs random chance).
Lift = Conf(A→B) / P(B)
Lift = 0.67 / P(Eggs). P(Eggs)=3/5=0.6. Lift = 0.67/0.6 = 1.11 (slight positive). Lift=1 means independent; Lift=2 means 2× more likely together.
pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
te = TransactionEncoder()
basket = pd.DataFrame(
te.fit_transform(transactions), columns=te.columns_)
freq = apriori(basket, min_support=0.3, use_colnames=True)
rules = association_rules(
freq, num_itemsets=len(basket),
metric='lift', min_threshold=1.0)
The Philippines has over 1 million sari-sari stores. Basket analysis helps small retailers optimize their limited shelf space and bundling strategies.
Place eggs next to instant noodles, and bundle coffee + bread for "breakfast packs."
# Load sari-sari store transactions
baskets = pd.read_csv('ph_sarisari_txns.csv')
# One-hot encode
encoded = baskets.groupby(
['txn_id', 'product']
)['qty'].sum().unstack().fillna(0)
encoded = (encoded > 0).astype(int)
# Find associations
freq = apriori(encoded, min_support=0.02,
use_colnames=True)
rules = association_rules(
freq, num_itemsets=len(encoded),
metric='lift', min_threshold=1.5
)
# Top actionable rules
top = rules[
(rules['confidence'] > 0.5) &
(rules['lift'] > 2)
].sort_values('lift', ascending=False)
print(top.head(10))
RFM segments customers by three behavioral dimensions, each answering a critical business question.
"How recently did they buy?"
Days since last purchase. Lower = better.
"How often do they buy?"
Total number of purchases. Higher = better.
"How much do they spend?"
Total revenue from customer. Higher = better.
Invert R: fewer days = higher score (labels=[5,4,3,2,1]).
snapshot = df['order_date'].max() + timedelta(1)
rfm = df.groupby('customer_id').agg({
'order_date': lambda x: (snapshot - x.max()).days,
'order_id': 'nunique',
'amount': 'sum'
}).rename(columns={
'order_date': 'recency',
'order_id': 'frequency',
'amount': 'monetary'
})
rfm['R'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
rfm['F'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM'] = rfm['R'].astype(str) + rfm['F'].astype(str) + rfm['M'].astype(str)
| Segment | R | F | M | Description | Marketing Action |
|---|---|---|---|---|---|
| Champions | 5 | 5 | 5 | Best customers, recent & frequent | Loyalty rewards, referral program |
| Loyal | 3-5 | 4-5 | 3-5 | Consistent, reliable buyers | Upsell, premium access |
| Potential | 4-5 | 1-3 | 1-3 | Recent but low engagement | Onboarding, education |
| At Risk | 1-2 | 3-5 | 3-5 | Were loyal, now lapsing | Win-back campaign, discounts |
| Hibernating | 1-2 | 1-2 | 2-3 | Long inactive, moderate value | Reactivation offer |
| Lost | 1 | 1 | 1 | Inactive, low value | Exit survey, deprioritize |
Instead of manually defining segments with quintile thresholds, let K-Means discover natural groupings in the RFM space.
Manual RFM uses arbitrary thresholds. Clustering finds the actual boundaries in your data — often more meaningful and data-driven.
Monetary values are often heavily right-skewed. Apply np.log1p() before scaling to prevent a few
high-spenders from dominating the clusters. Without log: [100, 500, 50000] has huge range. After log1p: [4.6, 6.2, 10.8] — much more evenly spaced.
# Use K-Means on RFM features
rfm_features = rfm[['recency', 'frequency',
'monetary']].copy()
# Log-transform skewed monetary column
rfm_features['monetary'] = np.log1p(
rfm_features['monetary'])
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_features)
# Find optimal K
from sklearn.metrics import silhouette_score
for k in range(2, 8):
km = KMeans(n_clusters=k, random_state=42)
labels = km.fit_predict(rfm_scaled)
sil = silhouette_score(rfm_scaled, labels)
print(f"K={k}: Silhouette={sil:.3f}")
# Fit best K (from silhouette loop above)
best_k = max(range(2, 8), key=lambda k:
silhouette_score(rfm_scaled,
KMeans(n_clusters=k, random_state=42)
.fit_predict(rfm_scaled)))
best_km = KMeans(n_clusters=best_k, random_state=42)
rfm['cluster'] = best_km.fit_predict(rfm_scaled)
# Profile and name clusters
profile = rfm.groupby('cluster').agg({
'recency': 'mean',
'frequency': 'mean',
'monetary': ['mean', 'count']
}).round(1)
print(profile)
Every algorithm in sklearn.cluster has trade-offs. Use this table to pick the right one.
| Algorithm | Parameters | Scalability | Use Case | Geometry |
|---|---|---|---|---|
| K-Means | n_clusters | Very large n, medium K | General-purpose, even clusters | Euclidean (spherical) |
| Mini Batch K-Means | n_clusters, batch_size | Huge datasets (streaming) | Same as K-Means, faster | Euclidean (spherical) |
| Agglomerative | n_clusters, linkage | Large n with connectivity | Hierarchies, dendrograms | Any pairwise distance |
| DBSCAN | eps, min_samples | Very large n, medium K | Irregular shapes, outliers | Nearest-point distance |
| HDBSCAN | min_cluster_size | Large n, medium K | Variable density, no eps tuning | Nearest-point distance |
| OPTICS | min_samples, xi | Very large n, large K | Variable density, reachability plots | Point distances |
| Gaussian Mixture | n_components, cov_type | Not scalable | Soft clustering, density estimation | Mahalanobis |
| Spectral | n_clusters, affinity | Medium n, small K | Non-flat manifolds, graphs | Graph distance |
Start with K-Means. If clusters aren't spherical, try DBSCAN. If density varies, use HDBSCAN. Need probabilities? Use GMM.
K-Means, Hierarchical, and DBSCAN cover most use cases. These four solve specific pain points.
Soft clustering — each point gets a probability of belonging to each cluster.
Hierarchical DBSCAN
— no eps to tune.
min_cluster_sizeStreaming K-Means — uses random mini-batches instead of the full dataset per iteration.
sklearn.cluster.MiniBatchKMeansGeneralized DBSCAN — one run replaces many DBSCAN runs at different eps values.
Some clustering algorithms can predict labels for unseen data. Others can only label the data they were trained on.
Has a .predict() method. Learn once, apply to new data forever.
K-Means, Mini Batch K-Means, GMM, BIRCH
Only has .fit_predict(). Must re-fit if new data arrives.
DBSCAN, HDBSCAN, OPTICS, Hierarchical, Spectral
| Scenario | Need |
|---|---|
| New customer signs up → assign to segment | Inductive |
| One-time analysis of existing data | Either works |
| Real-time fraud detection on streaming data | Inductive |
| Exploring spatial patterns in a fixed dataset | Transductive OK |
.predict() but want DBSCAN-like behavior, train HDBSCAN then use its approximate_predict() method.
1. DBSCAN finds arbitrary-shaped clusters and naturally identifies outliers as noise.
2. Anomaly detection uses distance from normal patterns (DBSCAN noise, centroid distance, Isolation Forest).
3. Market basket analysis reveals product co-occurrence patterns via support, confidence, and lift.
4. RFM + Clustering creates data-driven customer segments with clear marketing actions.
A comprehensive end-to-end customer analytics exercise combining everything from this week.
Next Week: Time Series Analytics — Forecasting trends, seasonality, and building ARIMA models.
Interpretation: The elbow occurs at K=3 (inertia drops sharply before, flattens after). Silhouette score peaks at K=3 (0.45), confirming 3 well-separated clusters in the PSA regional data.
NCR, CALABARZON, Region III — High GDP, low poverty, high tourism
Region VI, VII, XI, X — Moderate GDP, mixed poverty rates
BARMM, Region V, VIII, IX, XII — High poverty, low GDP
Reading the dendrogram: Cutting at distance ~4 (red dashed line) yields 3 clusters that match K-Means results. BARMM merges last — it's the most distinct region, driven by 63% poverty rate and lowest literacy.