Week 09 · Session 1

Clustering & Segmentation

Finding Hidden Groups in Unlabeled Data

Department of Computer Science

University of the Philippines Cebu

Lecture 17: K-Means & Hierarchical

The Power of Grouping

"The goal is to turn data into information, and information into insight."

— Carly Fiorina

Agenda

Lecture 17 Objectives

Supervised vs Unsupervised

Distinguish labeled from unlabeled learning paradigms.

K-Means

Apply the K-Means algorithm and choose optimal K.

Hierarchical

Build dendrograms and compare linkage methods.

Interpretation

Profile clusters and translate them into business actions.

Foundations

Two Paradigms of Machine Learning

In Weeks 7-8 we built supervised models (regression, trees) that learn from labeled outcomes. This week we remove the labels entirely.

Supervised Unsupervised
Labels Has target variable (y) No labels at all
Goal Predict an outcome Discover hidden structure
Methods Regression, Classification Clustering, Dimensionality Reduction
Question "What will happen?" "What patterns exist?"
Evaluation Accuracy, RMSE, AUC Silhouette (cluster quality), Inertia (cluster tightness)
Definition

What is Clustering?

Clustering is the task of partitioning data points into groups (clusters) so that points within the same cluster are more similar to each other than to points in other clusters.

Unlike classification, no one tells the algorithm what the groups should be — it discovers them from the data alone.

Key Terms

Cluster A group of similar data points
Centroid The "center" of a cluster (mean of all its points)
Distance metric How we measure similarity (e.g., Euclidean distance)
K The number of clusters (chosen by the analyst)
Cluster A Cluster B Cluster C = centroid

Points naturally group by proximity.
The × marks each cluster's centroid.

Motivation

Why Clustering Matters in Industry

Clustering is the most widely deployed unsupervised learning technique. Every major tech company uses it daily.

Running Example

Meet Our 8 Customers

We'll use this tiny dataset throughout both lectures. Small enough to trace by hand, big enough to show real patterns.

Three Features

  • Income — annual income in thousands (₱)
  • Age — years old
  • Spending — monthly spending score (0-100)
Can you see the groups?

Look at the table. Before any algorithm, can you spot which customers are similar? That's exactly what clustering will automate.

ID Name Income (k₱) Age Spending
1Ana152282
2Ben182578
3Cris202888
4Dina503555
5Erik553260
6Faye704528
7Gino754822
8Hana905515
Young & High-Spending Mid-Range Wealthy & Low-Spending
Core Algorithm

K-Means Clustering

Partition n observations into K clusters by minimizing within-cluster variance.

\begin{algorithm}
\caption{K-Means Clustering}
\begin{algorithmic}
\REQUIRE Dataset $X = \{x_1, \ldots, x_n\}$, number of clusters $K$
\STATE Initialize centroids $\mu_1, \ldots, \mu_K$ randomly from $X$
\REPEAT
    \FOR{each $x_i \in X$}
        \STATE $c_i \leftarrow \arg\min_j \|x_i - \mu_j\|^2$
    \ENDFOR
    \FOR{$j = 1$ \TO $K$}
        \STATE $\mu_j \leftarrow \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$
    \ENDFOR
\UNTIL{centroids do not change}
\end{algorithmic}
\end{algorithm}
                        

Objective Function

Minimize Within-Cluster Sum of Squares (WCSS):
J = Σk Σi ∈ Ck ||xi − μk||²

  • xi = a data point
  • μk = centroid of cluster k
  • Ck = set of points in cluster k
  • ||·||² = squared Euclidean distance
+ + + Cluster 1 Cluster 2 Cluster 3 K-Means with K=3 + = Centroid
Worked Example

K-Means Iteration — By Hand

Let's trace one complete iteration with K=2 on 6 points. Grab a pen!

Point Coords Dist to μ1 Dist to μ2 Assign
A(1, 2)1.05.0C1
B(2, 1)1.44.2C1
C(2, 3)1.43.6C1
D(5, 4)4.11.4C2
E(6, 5)5.41.0C2
F(5, 6)5.01.4C2

Distances = √((x−cx)² + (y−cy)²), rounded to 1 decimal.

Step 1: Pick initial centroids

μ1 = (1, 1)   μ2 = (6, 6)

Step 2: Assign → See table

Each point goes to its nearest centroid (lower distance wins).

Step 3: Update centroids

μ1new = mean of A,B,C = (1.7, 2.0)
μ2new = mean of D,E,F = (5.3, 5.0)

Step 4: Repeat

Centroids shifted! Recalculate distances with new centroids and reassign. Continue until centroids stop moving.

Live Demo

K-Means Simulator

NameIncSpdd(μ1)d(μ2)d(μ3)Cluster
Press "1. Init" to start.
Algorithm: K-Means
1. choose K centroids μ1, μ2, …, μK
2. for each point xi:
d(x, μ) = √(Σ(xjμj)²)
assign xi → cluster with min distance
3. for each cluster Ck:
μk = (1/|Ck|) Σ xi
4. if any μ changed → goto 2
5. else → converged ✓
Income Spending
Critical Step

Always Scale Your Features First

K-Means uses Euclidean distance (straight-line distance: √Σ(ai − bi)²). Features with larger scales will dominate the distance calculation and distort the clusters.

Implementation

K-Means in Python

Steps

  1. Scale all features to mean=0, std=1
  2. Run KMeans(n_clusters=K) on scaled data
  3. Each point gets a cluster label

Key Parameters

  • n_clusters — number of clusters (K)
  • n_init=10 — runs with different random starts
  • random_state — seed for reproducibility
python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])

km = KMeans(n_clusters=3, random_state=42)
df['cluster'] = km.fit_predict(X_scaled)
Model Selection

Choosing K: Elbow Method

Plot inertia (WCSS) against K. Look for the "elbow" where the rate of decrease sharply levels off.

What is Inertia?

Sum of squared distances from each point to its assigned centroid. Think of it as "total cluster tightness" — lower = tighter, but it always decreases with more K (at K=8 every customer is its own cluster, inertia=0).

Pseudocode

  1. For K = 1, 2, 3, …, 10:
  2.    Run K-Means, record its inertia
  3. Plot K vs inertia → find the "bend"
python
import matplotlib.pyplot as plt

inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')

# Our 8 customers — inertia values:
# K=1: 24.0  K=2: 10.2  K=3: 3.8
# K=4:  2.9  K=5:  2.1  K=6: 1.4
#
# K=1→2: -57%  K=2→3: -63%  ← BIG drops
# K=3→4: -24%  K=4→5: -28%  ← smaller
# → Elbow at K=3 (3 natural groups!)
Elbow (K=3) K Inertia
Model Selection

Reading the Elbow: A Worked Example

The "elbow" is where adding more clusters stops giving meaningful improvement. Here's how to spot it.

Step-by-Step: Calculate % Improvement

KInertiaDrop% ImprovedVerdict
1850
2500−35041%Big gain
3280−22044%Big gain
4265−155%Elbow!
5260−52%Diminishing

From K=3 to K=4, improvement drops from 44% to just 5%. That's the elbow → choose K=3.

What If There's No Clear Elbow?

Real data often has a smooth curve instead of a sharp bend. When this happens:

  1. Don't rely on the elbow alone — use Silhouette Score (next slide) as a second opinion
  2. Try the % improvement rule: stop when the drop is <10% of the previous drop
  3. Consider domain knowledge — does K=4 make sense for your business? (e.g., 4 customer tiers)

Common Mistakes

  • Picking K=1: Inertia is always highest at K=1 — that's just "no clustering"
  • Picking K=N: Inertia is 0 when K = number of points (each point is its own cluster) — useless
  • Ignoring the context: K=15 may have slightly better inertia than K=4, but can you actually explain 15 customer segments to a stakeholder?
Rule of Thumb

Always combine Elbow + Silhouette + Business sense. No single metric gives you the "correct" K — clustering is exploratory, not a test with one right answer.

Live Demo

Find the Best K

KInertiaDrop%SilhouetteVerdict
Press "Add Next K" to start.
Algorithm: Model Selection
1. for K = 1, 2, 3, …, 8:
run KMeans(K), record inertia
J = Σ || xiμk ||²  (WCSS)
if K ≥ 2: compute silhouette
s = avg[(ba) / max(a, b)]
2. pick K at elbow (inertia) or max (silhouette)
K Inertia Elbow Chart
K Score Silhouette Score
Model Selection

Choosing K: Silhouette Score

Measures how similar a point is to its own cluster vs. the nearest other cluster. Higher = better separated clusters.

Pseudocode

  1. For each K from 2 to 10: run K-Means
  2. For each point: compute a (avg dist to own cluster) and b (avg dist to nearest other cluster)
  3. Silhouette = average of all points' scores
  4. Pick K with the highest silhouette

s(i) = (b(i) − a(i)) / max(a(i), b(i))

  • a(i) — mean distance to points in your own cluster
  • b(i) — mean distance to the nearest other cluster
  • Range: −1 to +1 (higher = better)
Worked Example

If a=2 (avg distance within cluster) and b=5 (avg distance to nearest other cluster):
s = (5−2) / max(2,5) = 3/5 = 0.6 → reasonable structure.

python
from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    scores.append(score)
    print(f"K={k}: Silhouette = {score:.3f}")

# Pick K with highest silhouette score
best_k = range(2, 11)[scores.index(max(scores))]
print(f"\nBest K = {best_k}")

# Example output:
# K=2: Silhouette = 0.42
# K=3: Silhouette = 0.58
# K=4: Silhouette = 0.61  ← BEST
# K=5: Silhouette = 0.55
# Best K = 4

# Interpretation guide:
# > 0.7  Strong structure
# 0.5-0.7  Reasonable structure
# 0.25-0.5 Weak, may be artificial
# < 0.25  No substantial structure
Validation

Beyond Silhouette: More Cluster Validation Metrics

Silhouette Score is not the only option. Use multiple metrics together for a more robust evaluation of clustering quality.

Davies-Bouldin Index

Ratio of within-cluster scatter to between-cluster separation. Lower = better (0 is perfect).

  • Does not require distance matrix (fast)
  • Penalizes clusters with similar centroids
  • Sensitive to outliers

Calinski-Harabasz Index

Ratio of between-cluster variance to within-cluster variance. Higher = better (no upper bound).

  • Also called Variance Ratio Criterion
  • Very fast to compute
  • Favors convex, dense clusters
python — compare all three metrics
from sklearn.metrics import (silhouette_score,
    davies_bouldin_score, calinski_harabasz_score)

for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    print(f"K={k}:  Silhouette={silhouette_score(X_scaled, labels):.3f}"
          f"  DB={davies_bouldin_score(X_scaled, labels):.3f}"
          f"  CH={calinski_harabasz_score(X_scaled, labels):.0f}")
Analysis

Profiling Your Clusters

Clusters are only useful when you can interpret and name them. Compute summary statistics per cluster.

Pseudocode

  1. Group the DataFrame by cluster label
  2. Compute mean of each feature per group
  3. Look at the averages → name each group
  4. Ask: "Can I explain this to a non-technical person?"
Key Insight

If you can't name it, the clustering may not be meaningful.

python — our 8 customers
# Cluster profiles (using our data)
profile = df.groupby('cluster').agg({
    'income': 'mean',
    'age': 'mean',
    'spending': 'mean',
    'name': 'count'
}).round(1)
profile.columns = ['Avg Income', 'Avg Age',
                    'Avg Spending', 'Count']
print(profile)

# Output:
#   Avg Income  Avg Age  Avg Spending  Count
# 0      17.7     25.0          82.7      3
# 1      52.5     33.5          57.5      2
# 2      78.3     49.3          21.7      3

# Name them based on patterns:
names = {
    0: 'Young Big Spenders',
    1: 'Mid-Range Balanced',
    2: 'Wealthy & Frugal'
}
df['segment'] = df['cluster'].map(names)
print(df[['name', 'segment']])
# Ana   → Young Big Spenders
# Dina  → Mid-Range Balanced
# Hana  → Wealthy & Frugal
Visualization

2D Cluster Visualization with PCA

When features exceed 2 dimensions, use PCA (Principal Component Analysis) to project clusters onto a 2D plane for plotting.

Why PCA?

PCA finds the directions along which your data spreads out the most (has the most variance). Think of choosing the best camera angle to photograph a 3D object in 2D — you lose some depth, but keep the most informative view.

  • Variance = how much the data spreads along a direction
  • PC1 captures the biggest spread, PC2 the second-biggest
  • explained_variance_ratio_ tells how much info each component captures (e.g., 72% + 18% = 90% retained)
python
from sklearn.decomposition import PCA

# Our 8 customers have 3 features (3D)
# PCA squishes 3D → 2D for plotting
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# How much info did we keep?
print(pca.explained_variance_ratio_)
# [0.92, 0.07] → PC1=92%, PC2=7%
# Total: 99% of info kept in 2D!

# Plot clusters in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1],
    c=df['cluster'], cmap='viridis', s=80)

# Add customer names as labels
for i, name in enumerate(df['name']):
    plt.annotate(name,
        (X_pca[i,0], X_pca[i,1]))

plt.xlabel('PC1 (92% of variance)')
plt.ylabel('PC2 (7% of variance)')
# → 3 clear groups visible in 2D!
Limitations

When K-Means Fails

K-Means is fast and intuitive but makes strong assumptions that don't always hold.

Spherical Only

Assumes circular/spherical clusters. Fails on elongated or irregular shapes.

Random Init

Mitigated by k-means++ (smart init that spreads centroids apart — sklearn default) and n_init=10, but still possible.

Must Specify K

You must choose K before fitting. No automatic detection.

Outlier Sensitive

Outliers pull centroids, distorting cluster boundaries.

Equal Variance

Assumes clusters have similar size and density. If one group has 500 members and another 20, K-Means may split the big group instead.

Visual Intuition

What "Spherical Only" Actually Looks Like

K-Means draws straight-line boundaries between centroids. This works for round blobs, but fails spectacularly on other shapes.

Solution

Use DBSCAN (Lecture 18) for non-spherical clusters — it follows density, not distance to centroids.

Alternative Algorithm

Hierarchical Clustering

Builds a tree of clusters without needing to specify K upfront. Agglomerative (bottom-up) is the most common approach.

\begin{algorithm}
\caption{Agglomerative Clustering}
\begin{algorithmic}
\REQUIRE Dataset $X = \{x_1, \ldots, x_n\}$, linkage function $d$
\STATE Initialize each $x_i$ as its own cluster $C_i$
\WHILE{number of clusters $> 1$}
    \STATE $(C_a, C_b) \leftarrow \arg\min_{i \neq j} \, d(C_i, C_j)$
    \STATE Merge $C_a$ and $C_b$ into a single cluster
    \STATE Record merge distance (for dendrogram)
\ENDWHILE
\RETURN Dendrogram; cut at desired height for $K$ clusters
\end{algorithmic}
\end{algorithm}
                        
Advantage

The dendrogram lets you visually choose K after fitting by cutting at a chosen height.

Dendrogram (Y-axis = merge distance; longer lines = bigger gaps)

A B C D E F Cut here 0 2 3 7 Height Cluster 1 Cluster 2 Cluster 3
Beginner Guide

How to Read a Dendrogram

X-axis

Each leaf at the bottom is a data point (or sample). The order doesn't matter much — it's rearranged to avoid crossing lines.

Y-axis (Height)

The height where two branches merge = how dissimilar those clusters were. Higher merge = less similar. The tallest vertical line represents the biggest "jump" in dissimilarity.

Choosing K

Draw a horizontal cut line. The number of vertical lines it crosses = K. Tip: Cut where vertical lines are longest (biggest gap = most natural groupings).

Example from Previous Slide

Cut at height 4 → crosses 3 lines → K=3. Cut at height 2 → 5 clusters. Cut at height 7 → everything merges into 1.

Reading the Dendrogram

A B C D E F Cut → K=3 0 2 3 7 Tall line = big gap Height (dissimilarity)
Hierarchical Clustering

Linkage Methods: How to Measure Cluster Distance

The linkage criterion determines how the "distance" between two clusters is computed during merging.

Method Distance Computed As Best For Sensitivity
Single Minimum distance between any two points Chain-like / elongated clusters Very sensitive to noise
Complete Maximum distance between any two points Compact, spherical clusters Moderate
Average Mean pairwise distance between all points Balanced general-purpose Moderate
Ward Minimizes total within-cluster variance Similar-size clusters (most popular) Low (most robust)
Implementation

Hierarchical Clustering in Python

Steps

  1. Compute linkage matrix (merge history)
  2. Plot dendrogram → visually pick K
  3. Fit AgglomerativeClustering with chosen K

Two Libraries

  • scipylinkage + dendrogram
  • sklearnAgglomerativeClustering
python
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

Z = linkage(X_scaled, method='ward')
dendrogram(Z, truncate_mode='level', p=5)

hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['cluster'] = hc.fit_predict(X_scaled)
Philippine Case Study

GCash Customer Segmentation

With 94+ million users, GCash uses clustering to segment customers by transaction behavior and tailor financial products.

RFM Features

  • Recency — days since last transaction
  • Frequency — number of transactions
  • Monetary — total amount transacted
python — GCash-style segmentation
# Build RFM features from transactions
from datetime import datetime
today = datetime.now()

rfm = transactions.groupby('user_id').agg({
    'txn_date': lambda x: (today - x.max()).days,
    'txn_id': 'count',
    'amount': 'sum'
}).rename(columns={
    'txn_date': 'recency',
    'txn_id': 'frequency',
    'amount': 'monetary'
})

# Scale and cluster
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)

kmeans = KMeans(n_clusters=4, random_state=42)
rfm['segment'] = kmeans.fit_predict(rfm_scaled)

# Profile
rfm.groupby('segment').mean().round(1)
Philippine Case Study

Interpreting Customer Segments

After clustering, profile each segment and assign actionable marketing strategies.

Segment Recency Frequency Monetary Strategy
Champions Low (<15 days) High (>10/mo) High (>₱50k) Reward loyalty, early access to GCredit
Potential Low Medium Medium Upsell GInvest, push GCash Mastercard
At Risk High (lapsed) Low Medium Win-back campaign, cashback offers
Lost Very High Low Low Exit survey, final reactivation attempt
Activity

In-Class Exercise: Retail Segmentation

Tasks

  1. Load the provided retail transaction dataset
  2. Create RFM features from raw transactions
  3. Scale features with StandardScaler
  4. Run elbow + silhouette analysis to find optimal K
  5. Fit K-Means and profile each cluster
  6. Name your segments and propose strategies

Deliverables

  • Elbow plot and silhouette plot
  • PCA scatter plot colored by cluster
  • Summary table of cluster profiles
  • 1-paragraph marketing recommendation per segment
Time: 30 minutes · Mode: Pairs
Animation

K-Means on 60 Points — Watch It Converge

Iteration: 0   WCSS: —
Lecture 17 Summary

Key Takeaways

1. Clustering finds natural groups in data without labels.

2. Always scale features before running K-Means.

3. Use elbow + silhouette together to choose K.

4. Hierarchical clustering shows the full merge tree via dendrograms.

Week 09 · Session 2

Advanced Clustering & Customer Analytics

DBSCAN, Market Basket Analysis, and RFM Segmentation

Department of Computer Science

University of the Philippines Cebu

Lecture 18: Beyond K-Means

Agenda

Lecture 18 Objectives

DBSCAN

Apply density-based clustering for arbitrary shapes and outlier detection.

Anomaly Detection

Use clustering and Isolation Forest to flag unusual observations.

Market Basket

Find product associations with Apriori (support, confidence, lift).

RFM Analysis

Score and segment customers by Recency, Frequency, Monetary value.

Density-Based Clustering

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Groups dense regions and labels sparse points as noise.

Two Parameters

  • eps (ε) — radius of neighborhood
  • min_samples — minimum points to form a dense region
\begin{algorithm}
\caption{DBSCAN}
\begin{algorithmic}
\REQUIRE Dataset $X$, radius $\varepsilon$, minimum points $m$
\FOR{each unvisited point $x_i \in X$}
    \STATE Mark $x_i$ as visited
    \STATE $N_i \leftarrow$ neighbors of $x_i$ within $\varepsilon$
    \IF{$|N_i| \geq m$}
        \STATE Create new cluster $C$; add $x_i$
        \STATE Expand $C$: recursively add all density-reachable points
    \ELSE
        \STATE Label $x_i$ as \textbf{noise}
    \ENDIF
\ENDFOR
\end{algorithmic}
\end{algorithm}
                        

Point Classification in DBSCAN

Core Point (≥min_samples in ε) Border Point (near a core point) Noise Point (outlier) ε
Implementation

DBSCAN in Python

Steps

  1. Pick eps (radius) and min_samples
  2. Points with ≥ min_samples neighbors → core
  3. Expand clusters from core points; leftovers = noise (−1)

Choosing Parameters

  • eps too small → everything is noise
  • eps too large → one giant cluster
  • min_samples ≈ 2 × n_features
python
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=1.2, min_samples=2)
labels = db.fit_predict(X_scaled)

n_clusters = len(set(labels) - {-1})
n_noise = list(labels).count(-1)
Live Demo

DBSCAN Explorer

NameIncSpdNbrsTypeCluster
Algorithm: DBSCAN
1. for each point p:
Nε(p) = {q : dist(p, q) ≤ ε}
2. if |Nε| ≥ min_samples → Core
expand cluster to all reachable points
3. elif pNε(core) → Border
4. elseNoise (label = −1)
Income Spending
Comparison

DBSCAN vs K-Means: When to Use Which?

Aspect K-Means DBSCAN
Cluster shape Spherical / convex only Any arbitrary shape
Number of clusters Must specify K Discovered automatically
Outlier handling Forces into nearest cluster Labels as noise (−1)
Cluster density Assumes uniform Handles varying density
Scalability Fast (O(nKt)) Slower (O(n log n) with index)
Reproducibility Varies by initialization Nearly deterministic*
Best for Customer segmentation, compression Anomaly detection, spatial data
*Border points reachable from multiple clusters may vary by processing order.
Application

Anomaly Detection with Clustering

Use clustering to identify observations that don't belong to any group — potential fraud, errors, or rare events.

Pseudocode

  1. Run a clustering algorithm on your data
  2. Points that are far from everyone = anomalies
  3. Flag the top 5% most isolated points
  4. Investigate manually — fraud? data error? rare event?

Three Approaches

  1. DBSCAN noise points: Points labeled −1 are anomalies
  2. Distance from centroid: Points far from their K-Means centroid are suspicious
  3. Isolation Forest: Isolates points via random partitions — anomalies need fewer splits because they have unusual values. Set contamination to the expected anomaly proportion (e.g., 0.05 = 5%)
python
# Method 1: DBSCAN noise detection
dbscan = DBSCAN(eps=0.5, min_samples=10)
labels = dbscan.fit_predict(X_scaled)
anomalies_db = df[labels == -1]
print(f"DBSCAN anomalies: {len(anomalies_db)}")

# Method 2: Centroid distance threshold
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_scaled)
distances = kmeans.transform(X_scaled).min(axis=1)

# Flag top 5% as anomalies
threshold = np.percentile(distances, 95)
anomalies_km = df[distances > threshold]
print(f"K-Means anomalies: {len(anomalies_km)}")

# Method 3: Isolation Forest
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05)
pred = iso.fit_predict(X_scaled)
anomalies_if = df[pred == -1]  # -1 = anomaly
Association Mining

Market Basket Analysis

Goal: Discover products that are frequently purchased together. Powers product recommendations, store layouts, and promotional bundles.

Real-World Applications

  • "Customers who bought X also bought Y"
  • Optimize shelf placement in grocery stores
  • Design cross-selling promotions
  • Bundle pricing strategies

Example Transaction Data

TXN Items Purchased
1 Rice, Eggs, Milk
2 Rice, Eggs
3 Bread, Butter, Coffee
4 Rice, Milk, Coffee
5 Eggs, Milk, Bread

Which items are often bought together?

Association Rules

Three Key Metrics: Support, Confidence, Lift

Implementation

Apriori Algorithm in Python

Steps

  1. Convert transactions to a boolean table
  2. Find itemsets with support ≥ min_support
  3. Generate rules (A → B) with lift > 1
Install

pip install mlxtend

python
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
basket = pd.DataFrame(
    te.fit_transform(transactions), columns=te.columns_)

freq = apriori(basket, min_support=0.3, use_colnames=True)
rules = association_rules(
    freq, num_itemsets=len(basket),
    metric='lift', min_threshold=1.0)
Philippine Case Study

Sari-Sari Store Basket Analysis

The Philippines has over 1 million sari-sari stores. Basket analysis helps small retailers optimize their limited shelf space and bundling strategies.

Illustrative Rules (Hypothetical Data)

  • Instant noodles → Eggs (Lift: 2.3)
  • Rice → Cooking oil (Lift: 1.9)
  • Coffee sachets → Bread (Lift: 2.1)
  • Shampoo sachets → Soap (Lift: 1.7)
Business Insight

Place eggs next to instant noodles, and bundle coffee + bread for "breakfast packs."

python — sari-sari analysis
# Load sari-sari store transactions
baskets = pd.read_csv('ph_sarisari_txns.csv')

# One-hot encode
encoded = baskets.groupby(
    ['txn_id', 'product']
)['qty'].sum().unstack().fillna(0)
encoded = (encoded > 0).astype(int)

# Find associations
freq = apriori(encoded, min_support=0.02,
               use_colnames=True)
rules = association_rules(
    freq, num_itemsets=len(encoded),
    metric='lift', min_threshold=1.5
)

# Top actionable rules
top = rules[
    (rules['confidence'] > 0.5) &
    (rules['lift'] > 2)
].sort_values('lift', ascending=False)
print(top.head(10))
Customer Analytics

RFM Analysis: The Gold Standard for Customer Segmentation

RFM segments customers by three behavioral dimensions, each answering a critical business question.

Implementation

RFM Scoring in Python

Steps

  1. Compute R (days since last buy), F (total orders), M (total spent)
  2. Divide each into 5 quintiles scored 1–5
  3. Concatenate → "555" = Champion, "111" = Lost
Note

Invert R: fewer days = higher score (labels=[5,4,3,2,1]).

python
snapshot = df['order_date'].max() + timedelta(1)
rfm = df.groupby('customer_id').agg({
    'order_date': lambda x: (snapshot - x.max()).days,
    'order_id': 'nunique',
    'amount': 'sum'
}).rename(columns={
    'order_date': 'recency',
    'order_id': 'frequency',
    'amount': 'monetary'
})
rfm['R'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
rfm['F'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM'] = rfm['R'].astype(str) + rfm['F'].astype(str) + rfm['M'].astype(str)
Customer Analytics

RFM Segment Definitions & Actions

Segment R F M Description Marketing Action
Champions 5 5 5 Best customers, recent & frequent Loyalty rewards, referral program
Loyal 3-5 4-5 3-5 Consistent, reliable buyers Upsell, premium access
Potential 4-5 1-3 1-3 Recent but low engagement Onboarding, education
At Risk 1-2 3-5 3-5 Were loyal, now lapsing Win-back campaign, discounts
Hibernating 1-2 1-2 2-3 Long inactive, moderate value Reactivation offer
Lost 1 1 1 Inactive, low value Exit survey, deprioritize
Advanced Technique

Clustering + RFM: Best of Both

Instead of manually defining segments with quintile thresholds, let K-Means discover natural groupings in the RFM space.

Why Combine?

Manual RFM uses arbitrary thresholds. Clustering finds the actual boundaries in your data — often more meaningful and data-driven.

Pro Tip

Monetary values are often heavily right-skewed. Apply np.log1p() before scaling to prevent a few high-spenders from dominating the clusters. Without log: [100, 500, 50000] has huge range. After log1p: [4.6, 6.2, 10.8] — much more evenly spaced.

python
# Use K-Means on RFM features
rfm_features = rfm[['recency', 'frequency',
                     'monetary']].copy()

# Log-transform skewed monetary column
rfm_features['monetary'] = np.log1p(
    rfm_features['monetary'])

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_features)

# Find optimal K
from sklearn.metrics import silhouette_score
for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(rfm_scaled)
    sil = silhouette_score(rfm_scaled, labels)
    print(f"K={k}: Silhouette={sil:.3f}")

# Fit best K (from silhouette loop above)
best_k = max(range(2, 8), key=lambda k:
    silhouette_score(rfm_scaled,
    KMeans(n_clusters=k, random_state=42)
    .fit_predict(rfm_scaled)))
best_km = KMeans(n_clusters=best_k, random_state=42)
rfm['cluster'] = best_km.fit_predict(rfm_scaled)

# Profile and name clusters
profile = rfm.groupby('cluster').agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary': ['mean', 'count']
}).round(1)
print(profile)
Decision Guide

Which Clustering Algorithm Should I Use?

K-Means

  • You know (or can estimate) K
  • Clusters are roughly spherical
  • Large datasets (fast)
  • Customer segmentation
  • Image compression

Hierarchical

  • Want to explore different K values
  • Need a visual hierarchy (dendrogram)
  • Small-medium datasets (<10K)
  • Taxonomy / phylogenetics
  • Organizational structure

DBSCAN

  • Don't know K at all
  • Non-spherical / irregular clusters
  • Need outlier detection
  • Spatial / geographic data
  • Fraud / anomaly detection
Reference

The sklearn Clustering Landscape

Every algorithm in sklearn.cluster has trade-offs. Use this table to pick the right one.

Algorithm Parameters Scalability Use Case Geometry
K-Means n_clusters Very large n, medium K General-purpose, even clusters Euclidean (spherical)
Mini Batch K-Means n_clusters, batch_size Huge datasets (streaming) Same as K-Means, faster Euclidean (spherical)
Agglomerative n_clusters, linkage Large n with connectivity Hierarchies, dendrograms Any pairwise distance
DBSCAN eps, min_samples Very large n, medium K Irregular shapes, outliers Nearest-point distance
HDBSCAN min_cluster_size Large n, medium K Variable density, no eps tuning Nearest-point distance
OPTICS min_samples, xi Very large n, large K Variable density, reachability plots Point distances
Gaussian Mixture n_components, cov_type Not scalable Soft clustering, density estimation Mahalanobis
Spectral n_clusters, affinity Medium n, small K Non-flat manifolds, graphs Graph distance
Rule of Thumb

Start with K-Means. If clusters aren't spherical, try DBSCAN. If density varies, use HDBSCAN. Need probabilities? Use GMM.

Going Further

Beyond the Basics: Four More Algorithms

K-Means, Hierarchical, and DBSCAN cover most use cases. These four solve specific pain points.

Gaussian Mixture Models (GMM)

Soft clustering — each point gets a probability of belonging to each cluster.

  • Models clusters as Gaussian distributions
  • Handles elliptical (non-spherical) clusters
  • BIC/AIC to select number of components
Best for: Overlapping clusters, uncertainty estimation, anomaly scoring.

HDBSCAN

Hierarchical DBSCAN — no eps to tune.

  • Finds clusters at varying densities automatically
  • Only needs min_cluster_size
  • Provides cluster stability scores
Best for: Mixed-density clusters, when DBSCAN's eps is hard to set.

Mini Batch K-Means

Streaming K-Means — uses random mini-batches instead of the full dataset per iteration.

  • Much faster convergence on large data
  • Slightly lower quality than standard K-Means
  • sklearn.cluster.MiniBatchKMeans
Best for: Datasets too large for standard K-Means (>100K rows).

OPTICS

Generalized DBSCAN — one run replaces many DBSCAN runs at different eps values.

  • Produces a reachability plot showing density hierarchy
  • Extract clusters at any density threshold
  • O(n) memory (better than HDBSCAN for huge data)
Best for: Exploring cluster structure at multiple scales, very large datasets.
Key Concept

Can Your Model Label New Data?

Some clustering algorithms can predict labels for unseen data. Others can only label the data they were trained on.

Inductive

Has a .predict() method. Learn once, apply to new data forever.

K-Means, Mini Batch K-Means, GMM, BIRCH

Transductive

Only has .fit_predict(). Must re-fit if new data arrives.

DBSCAN, HDBSCAN, OPTICS, Hierarchical, Spectral

When does it matter?

Scenario Need
New customer signs up → assign to segment Inductive
One-time analysis of existing data Either works
Real-time fraud detection on streaming data Inductive
Exploring spatial patterns in a fixed dataset Transductive OK
Tip: If you need .predict() but want DBSCAN-like behavior, train HDBSCAN then use its approximate_predict() method.
Activity

In-Class Exercise: Complete Customer Analytics

Tasks

  1. Load the Philippine e-commerce dataset
  2. Calculate RFM metrics from raw transactions
  3. Cluster customers with K-Means (find best K)
  4. Compare with DBSCAN (identify anomalous customers)
  5. Run basket analysis on product purchases
  6. Create a 1-page segment recommendation report

Expected Outputs

  • Elbow + silhouette plots
  • PCA scatter plot by cluster
  • RFM cluster profile table
  • Top 5 association rules with interpretation
  • DBSCAN anomaly count and examples
  • Marketing recommendations per segment
Time: 35 minutes · Mode: Pairs
Animation

DBSCAN on Crescents — Where K-Means Fails

Points processed: 0/0
Lecture 18 Summary

Key Takeaways

1. DBSCAN finds arbitrary-shaped clusters and naturally identifies outliers as noise.

2. Anomaly detection uses distance from normal patterns (DBSCAN noise, centroid distance, Isolation Forest).

3. Market basket analysis reveals product co-occurrence patterns via support, confidence, and lift.

4. RFM + Clustering creates data-driven customer segments with clear marketing actions.

Coming Up

Lab 9: Customer Segmentation Project

A comprehensive end-to-end customer analytics exercise combining everything from this week.

Next Week: Time Series Analytics — Forecasting trends, seasonality, and building ARIMA models.

Lab 9 Results

Expected Output: Elbow & Silhouette Analysis

Elbow Method (Inertia vs K)

Silhouette Score vs K

Interpretation: The elbow occurs at K=3 (inertia drops sharply before, flattens after). Silhouette score peaks at K=3 (0.45), confirming 3 well-separated clusters in the PSA regional data.

Lab 9 Results

Expected Output: PCA Cluster Visualization

Cluster 0: Metro Powerhouses

NCR, CALABARZON, Region III — High GDP, low poverty, high tourism

Cluster 1: Mid-Tier Regions

Region VI, VII, XI, X — Moderate GDP, mixed poverty rates

Cluster 2: Rural/Underserved

BARMM, Region V, VIII, IX, XII — High poverty, low GDP

Lab 9 Results

Expected Output: Hierarchical Dendrogram

Ward Distance 0 2 4 6 8 Cut K=3 NCR CLBRZN Reg III VII VI XI X CAR I, II V VIII IX,XII BARMM

Reading the dendrogram: Cutting at distance ~4 (red dashed line) yields 3 clusters that match K-Means results. BARMM merges last — it's the most distinct region, driven by 63% poverty rate and lowest literacy.