Week 09 · Session 1

Clustering & Segmentation

Finding Hidden Groups in Unlabeled Data

Department of Computer Science

University of the Philippines Cebu

Lecture 17: K-Means & Hierarchical

The Power of Grouping

"The goal is to turn data into information, and information into insight."

— Carly Fiorina

Agenda

Lecture 17 Objectives

Supervised vs Unsupervised

Distinguish labeled from unlabeled learning paradigms.

K-Means

Apply the K-Means algorithm and choose optimal K.

Hierarchical

Build dendrograms and compare linkage methods.

Interpretation

Profile clusters and translate them into business actions.

Foundations

Two Paradigms of Machine Learning

In Weeks 7-8 we built supervised models (regression, trees) that learn from labeled outcomes. This week we remove the labels entirely.

	Supervised	Unsupervised
Labels	Has target variable (y)	No labels at all
Goal	Predict an outcome	Discover hidden structure
Methods	Regression, Classification	Clustering, Dimensionality Reduction
Question	"What will happen?"	"What patterns exist?"
Evaluation	Accuracy, RMSE, AUC	Silhouette (cluster quality), Inertia (cluster tightness)

Definition

What is Clustering?

Clustering is the task of partitioning data points into groups (clusters) so that points within the same cluster are more similar to each other than to points in other clusters.

Unlike classification, no one tells the algorithm what the groups should be — it discovers them from the data alone.

Key Terms

Cluster	A group of similar data points
Centroid	The "center" of a cluster (mean of all its points)
Distance metric	How we measure similarity (e.g., Euclidean distance)
K	The number of clusters (chosen by the analyst)

Points naturally group by proximity.
The × marks each cluster's centroid.

Motivation

Why Clustering Matters in Industry

Clustering is the most widely deployed unsupervised learning technique. Every major tech company uses it daily.

Customer Segmentation

Group customers by purchase behavior
Personalize marketing campaigns
Identify high-value segments

Anomaly Detection

Credit card fraud detection
Network intrusion alerts
Manufacturing quality control

Beyond Business

Image compression (color quantization — reducing thousands of colors to a handful)
Document topic grouping
Gene expression analysis

Running Example

Meet Our 8 Customers

We'll use this tiny dataset throughout both lectures. Small enough to trace by hand, big enough to show real patterns.

Three Features

Income — annual income in thousands (₱)
Age — years old
Spending — monthly spending score (0-100)

Can you see the groups?

Look at the table. Before any algorithm, can you spot which customers are similar? That's exactly what clustering will automate.

ID	Name	Income (k₱)	Age	Spending
1	Ana	15	22	82
2	Ben	18	25	78
3	Cris	20	28	88
4	Dina	50	35	55
5	Erik	55	32	60
6	Faye	70	45	28
7	Gino	75	48	22
8	Hana	90	55	15

Young & High-Spending Mid-Range Wealthy & Low-Spending

Core Algorithm

K-Means Clustering

Partition n observations into K clusters by minimizing within-cluster variance.

\begin{algorithm}
\caption{K-Means Clustering}
\begin{algorithmic}
\REQUIRE Dataset $X = \{x_1, \ldots, x_n\}$, number of clusters $K$
\STATE Initialize centroids $\mu_1, \ldots, \mu_K$ randomly from $X$
\REPEAT
    \FOR{each $x_i \in X$}
        \STATE $c_i \leftarrow \arg\min_j \|x_i - \mu_j\|^2$
    \ENDFOR
    \FOR{$j = 1$ \TO $K$}
        \STATE $\mu_j \leftarrow \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$
    \ENDFOR
\UNTIL{centroids do not change}
\end{algorithmic}
\end{algorithm}

Objective Function

Minimize Within-Cluster Sum of Squares (WCSS):
J = Σ_k Σ_{i ∈
C_k} ||x_i − μ_k||²

x_i = a data point
μ_k = centroid of cluster k
C_k = set of points in cluster k
||·||² = squared Euclidean distance

Worked Example

K-Means Iteration — By Hand

Let's trace one complete iteration with K=2 on 6 points. Grab a pen!

Point	Coords	Dist to μ1	Dist to μ2	Assign
A	(1, 2)	1.0	5.0	C1
B	(2, 1)	1.4	4.2	C1
C	(2, 3)	1.4	3.6	C1
D	(5, 4)	4.1	1.4	C2
E	(6, 5)	5.4	1.0	C2
F	(5, 6)	5.0	1.4	C2

Distances = √((x−cx)² + (y−cy)²), rounded to 1 decimal.

Step 1: Pick initial centroids

μ1 = (1, 1) μ2 = (6, 6)

Step 2: Assign → See table

Each point goes to its nearest centroid (lower distance wins).

Step 3: Update centroids

μ1_new = mean of A,B,C = (1.7, 2.0)
μ2_new = mean of D,E,F = (5.3, 5.0)

Step 4: Repeat

Centroids shifted! Recalculate distances with new centroids and reassign. Continue until centroids stop moving.

Live Demo

K-Means Simulator

Name	Inc	Spd	d(μ1)	d(μ2)	d(μ3)	Cluster

Press "1. Init" to start.

Algorithm: K-Means

1. choose K centroids μ₁, μ₂, …, μ_K

2. for each point x_i:

d(x, μ) = √(Σ(x_j − μ_j)²)

assign x_i → cluster with min distance

3. for each cluster C_k:

μ_k = (1/|C_k|) Σ x_i

4. if any μ changed → goto 2

5. else → converged ✓

Critical Step

Always Scale Your Features First

K-Means uses Euclidean distance (straight-line distance: √Σ(a_i − b_i)²). Features with larger scales will dominate the distance calculation and distort the clusters.

Without Scaling

Income (0-1M) vs Age (18-80). Income dominates: clusters formed by income alone, age ignored. In plain English: K-Means groups by income and ignores age entirely.

Distance = sqrt((500000-200000)² + (25-30)²) ≈ 300,000

With StandardScaler

Both features centered at 0 with unit variance. Equal contribution to distance.

Distance = sqrt((1.2-(-0.5))² + (0.3-(-0.2))²) ≈ 1.77

Implementation

K-Means in Python

Steps

Scale all features to mean=0, std=1
Run KMeans(n_clusters=K) on scaled data
Each point gets a cluster label

Key Parameters

n_clusters — number of clusters (K)
n_init=10 — runs with different random starts
random_state — seed for reproducibility

python

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[features])

km = KMeans(n_clusters=3, random_state=42)
df['cluster'] = km.fit_predict(X_scaled)

Model Selection

Choosing K: Elbow Method

Plot inertia (WCSS) against K. Look for the "elbow" where the rate of decrease sharply levels off.

What is Inertia?

Sum of squared distances from each point to its assigned centroid. Think of it as "total cluster tightness" — lower = tighter, but it always decreases with more K (at K=8 every customer is its own cluster, inertia=0).

Pseudocode

For K = 1, 2, 3, …, 10:
Run K-Means, record its inertia
Plot K vs inertia → find the "bend"

python

import matplotlib.pyplot as plt

inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')

# Our 8 customers — inertia values:
# K=1: 24.0  K=2: 10.2  K=3: 3.8
# K=4:  2.9  K=5:  2.1  K=6: 1.4
#
# K=1→2: -57%  K=2→3: -63%  ← BIG drops
# K=3→4: -24%  K=4→5: -28%  ← smaller
# → Elbow at K=3 (3 natural groups!)

Model Selection

Reading the Elbow: A Worked Example

The "elbow" is where adding more clusters stops giving meaningful improvement. Here's how to spot it.

Step-by-Step: Calculate % Improvement

K	Inertia	Drop	% Improved	Verdict
1	850	—	—
2	500	−350	41%	Big gain
3	280	−220	44%	Big gain
4	265	−15	5%	Elbow!
5	260	−5	2%	Diminishing

From K=3 to K=4, improvement drops from 44% to just 5%. That's the elbow → choose K=3.

What If There's No Clear Elbow?

Real data often has a smooth curve instead of a sharp bend. When this happens:

Don't rely on the elbow alone — use Silhouette Score (next slide) as a second opinion
Try the % improvement rule: stop when the drop is <10% of the previous drop
Consider domain knowledge — does K=4 make sense for your business? (e.g., 4 customer tiers)

Common Mistakes

Picking K=1: Inertia is always highest at K=1 — that's just "no clustering"
Picking K=N: Inertia is 0 when K = number of points (each point is its own cluster) — useless
Ignoring the context: K=15 may have slightly better inertia than K=4, but can you actually explain 15 customer segments to a stakeholder?

Rule of Thumb

Always combine Elbow + Silhouette + Business sense. No single metric gives you the "correct" K — clustering is exploratory, not a test with one right answer.

Live Demo

Find the Best K

K	Inertia	Drop	%	Silhouette	Verdict

Press "Add Next K" to start.

Algorithm: Model Selection

1. for K = 1, 2, 3, …, 8:

run KMeans(K), record inertia

J = Σ || x_i − μ_k ||² (WCSS)

if K ≥ 2: compute silhouette

s = avg[(b−a) / max(a, b)]

2. pick K at elbow (inertia) or max (silhouette)

Model Selection

Choosing K: Silhouette Score

Measures how similar a point is to its own cluster vs. the nearest other cluster. Higher = better separated clusters.

Pseudocode

For each K from 2 to 10: run K-Means
For each point: compute a (avg dist to own cluster) and b (avg dist to nearest other cluster)
Silhouette = average of all points' scores
Pick K with the highest silhouette

s (i) = (b (i) - a (i)) / max(a (i), b (i))

a(i) — mean distance to points in your own cluster
b(i) — mean distance to the nearest other cluster
Range: −1 to +1 (higher = better)

Worked Example

If a=2 (avg distance within cluster) and b=5 (avg distance to nearest other cluster):
s = (5−2) / max(2,5) = 3/5 = 0.6 → reasonable structure.

python

from sklearn.metrics import silhouette_score

scores = []
for k in range(2, 11):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    scores.append(score)
    print(f"K={k}: Silhouette = {score:.3f}")

# Pick K with highest silhouette score
best_k = range(2, 11)[scores.index(max(scores))]
print(f"\nBest K = {best_k}")

# Example output:
# K=2: Silhouette = 0.42
# K=3: Silhouette = 0.58
# K=4: Silhouette = 0.61  ← BEST
# K=5: Silhouette = 0.55
# Best K = 4

# Interpretation guide:
# > 0.7  Strong structure
# 0.5-0.7  Reasonable structure
# 0.25-0.5 Weak, may be artificial
# < 0.25  No substantial structure

Validation

Beyond Silhouette: More Cluster Validation Metrics

Silhouette Score is not the only option. Use multiple metrics together for a more robust evaluation of clustering quality.

Davies-Bouldin Index

Ratio of within-cluster scatter to between-cluster separation. Lower = better (0 is perfect).

Does not require distance matrix (fast)
Penalizes clusters with similar centroids
Sensitive to outliers

Calinski-Harabasz Index

Ratio of between-cluster variance to within-cluster variance. Higher = better (no upper bound).

Also called Variance Ratio Criterion
Very fast to compute
Favors convex, dense clusters

python — compare all three metrics

from sklearn.metrics import (silhouette_score,
    davies_bouldin_score, calinski_harabasz_score)

for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X_scaled)
    print(f"K={k}:  Silhouette={silhouette_score(X_scaled, labels):.3f}"
          f"  DB={davies_bouldin_score(X_scaled, labels):.3f}"
          f"  CH={calinski_harabasz_score(X_scaled, labels):.0f}")

Analysis

Profiling Your Clusters

Clusters are only useful when you can interpret and name them. Compute summary statistics per cluster.

Pseudocode

Group the DataFrame by cluster label
Compute mean of each feature per group
Look at the averages → name each group
Ask: "Can I explain this to a non-technical person?"

Key Insight

If you can't name it, the clustering may not be meaningful.

python — our 8 customers

# Cluster profiles (using our data)
profile = df.groupby('cluster').agg({
    'income': 'mean',
    'age': 'mean',
    'spending': 'mean',
    'name': 'count'
}).round(1)
profile.columns = ['Avg Income', 'Avg Age',
                    'Avg Spending', 'Count']
print(profile)

# Output:
#   Avg Income  Avg Age  Avg Spending  Count
# 0      17.7     25.0          82.7      3
# 1      52.5     33.5          57.5      2
# 2      78.3     49.3          21.7      3

# Name them based on patterns:
names = {
    0: 'Young Big Spenders',
    1: 'Mid-Range Balanced',
    2: 'Wealthy & Frugal'
}
df['segment'] = df['cluster'].map(names)
print(df[['name', 'segment']])
# Ana   → Young Big Spenders
# Dina  → Mid-Range Balanced
# Hana  → Wealthy & Frugal

Visualization

2D Cluster Visualization with PCA

When features exceed 2 dimensions, use PCA (Principal Component Analysis) to project clusters onto a 2D plane for plotting.

Why PCA?

PCA finds the directions along which your data spreads out the most (has the most variance). Think of choosing the best camera angle to photograph a 3D object in 2D — you lose some depth, but keep the most informative view.

Variance = how much the data spreads along a direction
PC1 captures the biggest spread, PC2 the second-biggest
explained_variance_ratio_ tells how much info each component captures (e.g., 72% + 18% = 90% retained)

python

from sklearn.decomposition import PCA

# Our 8 customers have 3 features (3D)
# PCA squishes 3D → 2D for plotting
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# How much info did we keep?
print(pca.explained_variance_ratio_)
# [0.92, 0.07] → PC1=92%, PC2=7%
# Total: 99% of info kept in 2D!

# Plot clusters in 2D
plt.scatter(X_pca[:, 0], X_pca[:, 1],
    c=df['cluster'], cmap='viridis', s=80)

# Add customer names as labels
for i, name in enumerate(df['name']):
    plt.annotate(name,
        (X_pca[i,0], X_pca[i,1]))

plt.xlabel('PC1 (92% of variance)')
plt.ylabel('PC2 (7% of variance)')
# → 3 clear groups visible in 2D!

Limitations

When K-Means Fails

K-Means is fast and intuitive but makes strong assumptions that don't always hold.

Spherical Only

Assumes circular/spherical clusters. Fails on elongated or irregular shapes.

Random Init

Mitigated by k-means++ (smart init that spreads centroids apart — sklearn default) and n_init=10, but still possible.

Must Specify K

You must choose K before fitting. No automatic detection.

Outlier Sensitive

Outliers pull centroids, distorting cluster boundaries.

Equal Variance

Assumes clusters have similar size and density. If one group has 500 members and another 20, K-Means may split the big group instead.

Visual Intuition

What "Spherical Only" Actually Looks Like

K-Means draws straight-line boundaries between centroids. This works for round blobs, but fails spectacularly on other shapes.

Solution

Use DBSCAN (Lecture 18) for non-spherical clusters — it follows density, not distance to centroids.

Alternative Algorithm

Hierarchical Clustering

Builds a tree of clusters without needing to specify K upfront. Agglomerative (bottom-up) is the most common approach.

\begin{algorithm}
\caption{Agglomerative Clustering}
\begin{algorithmic}
\REQUIRE Dataset $X = \{x_1, \ldots, x_n\}$, linkage function $d$
\STATE Initialize each $x_i$ as its own cluster $C_i$
\WHILE{number of clusters $> 1$}
    \STATE $(C_a, C_b) \leftarrow \arg\min_{i \neq j} \, d(C_i, C_j)$
    \STATE Merge $C_a$ and $C_b$ into a single cluster
    \STATE Record merge distance (for dendrogram)
\ENDWHILE
\RETURN Dendrogram; cut at desired height for $K$ clusters
\end{algorithmic}
\end{algorithm}

Advantage

The dendrogram lets you visually choose K after fitting by cutting at a chosen height.

Dendrogram (Y-axis = merge distance; longer lines = bigger gaps)

Beginner Guide

How to Read a Dendrogram

X-axis

Each leaf at the bottom is a data point (or sample). The order doesn't matter much — it's rearranged to avoid crossing lines.

Y-axis (Height)

The height where two branches merge = how dissimilar those clusters were. Higher merge = less similar. The tallest vertical line represents the biggest "jump" in dissimilarity.

Choosing K

Draw a horizontal cut line. The number of vertical lines it crosses = K. Tip: Cut where vertical lines are longest (biggest gap = most natural groupings).

Example from Previous Slide

Cut at height 4 → crosses 3 lines → K=3. Cut at height 2 → 5 clusters. Cut at height 7 → everything merges into 1.

Reading the Dendrogram

Hierarchical Clustering

Linkage Methods: How to Measure Cluster Distance

The linkage criterion determines how the "distance" between two clusters is computed during merging.

Method	Distance Computed As	Best For	Sensitivity
Single	Minimum distance between any two points	Chain-like / elongated clusters	Very sensitive to noise
Complete	Maximum distance between any two points	Compact, spherical clusters	Moderate
Average	Mean pairwise distance between all points	Balanced general-purpose	Moderate
Ward	Minimizes total within-cluster variance	Similar-size clusters (most popular)	Low (most robust)

Implementation

Hierarchical Clustering in Python

Steps

Compute linkage matrix (merge history)
Plot dendrogram → visually pick K
Fit AgglomerativeClustering with chosen K

Two Libraries

scipy — linkage + dendrogram
sklearn — AgglomerativeClustering

python

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering

Z = linkage(X_scaled, method='ward')
dendrogram(Z, truncate_mode='level', p=5)

hc = AgglomerativeClustering(n_clusters=4, linkage='ward')
df['cluster'] = hc.fit_predict(X_scaled)

Philippine Case Study

GCash Customer Segmentation

With 94+ million users, GCash uses clustering to segment customers by transaction behavior and tailor financial products.

RFM Features

Recency — days since last transaction
Frequency — number of transactions
Monetary — total amount transacted

python — GCash-style segmentation

# Build RFM features from transactions
from datetime import datetime
today = datetime.now()

rfm = transactions.groupby('user_id').agg({
    'txn_date': lambda x: (today - x.max()).days,
    'txn_id': 'count',
    'amount': 'sum'
}).rename(columns={
    'txn_date': 'recency',
    'txn_id': 'frequency',
    'amount': 'monetary'
})

# Scale and cluster
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm)

kmeans = KMeans(n_clusters=4, random_state=42)
rfm['segment'] = kmeans.fit_predict(rfm_scaled)

# Profile
rfm.groupby('segment').mean().round(1)

Philippine Case Study

Interpreting Customer Segments

After clustering, profile each segment and assign actionable marketing strategies.

Segment	Recency	Frequency	Monetary	Strategy
Champions	Low (<15 days)	High (>10/mo)	High (>₱50k)	Reward loyalty, early access to GCredit
Potential	Low	Medium	Medium	Upsell GInvest, push GCash Mastercard
At Risk	High (lapsed)	Low	Medium	Win-back campaign, cashback offers
Lost	Very High	Low	Low	Exit survey, final reactivation attempt

Activity

In-Class Exercise: Retail Segmentation

Tasks

Load the provided retail transaction dataset
Create RFM features from raw transactions
Scale features with StandardScaler
Run elbow + silhouette analysis to find optimal K
Fit K-Means and profile each cluster
Name your segments and propose strategies

Deliverables

Elbow plot and silhouette plot
PCA scatter plot colored by cluster
Summary table of cluster profiles
1-paragraph marketing recommendation per segment

Time: 30 minutes · Mode: Pairs

Animation

K-Means on 60 Points — Watch It Converge

Speed: Iteration: 0 WCSS: —

Lecture 17 Summary

Key Takeaways

1. Clustering finds natural groups in data without labels.

2. Always scale features before running K-Means.

3. Use elbow + silhouette together to choose K.

4. Hierarchical clustering shows the full merge tree via dendrograms.

Week 09 · Session 2

Advanced Clustering & Customer Analytics

DBSCAN, Market Basket Analysis, and RFM Segmentation

Department of Computer Science

University of the Philippines Cebu

Lecture 18: Beyond K-Means

Agenda

Lecture 18 Objectives

DBSCAN

Apply density-based clustering for arbitrary shapes and outlier detection.

Anomaly Detection

Use clustering and Isolation Forest to flag unusual observations.

Market Basket

Find product associations with Apriori (support, confidence, lift).

RFM Analysis

Score and segment customers by Recency, Frequency, Monetary value.

Density-Based Clustering

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Groups dense regions and labels sparse points as noise.

Two Parameters

eps (ε) — radius of neighborhood
min_samples — minimum points to form a dense region

\begin{algorithm}
\caption{DBSCAN}
\begin{algorithmic}
\REQUIRE Dataset $X$, radius $\varepsilon$, minimum points $m$
\FOR{each unvisited point $x_i \in X$}
    \STATE Mark $x_i$ as visited
    \STATE $N_i \leftarrow$ neighbors of $x_i$ within $\varepsilon$
    \IF{$|N_i| \geq m$}
        \STATE Create new cluster $C$; add $x_i$
        \STATE Expand $C$: recursively add all density-reachable points
    \ELSE
        \STATE Label $x_i$ as \textbf{noise}
    \ENDIF
\ENDFOR
\end{algorithmic}
\end{algorithm}

Point Classification in DBSCAN

Implementation

DBSCAN in Python

Steps

Pick eps (radius) and min_samples
Points with ≥ min_samples neighbors → core
Expand clusters from core points; leftovers = noise (−1)

Choosing Parameters

eps too small → everything is noise
eps too large → one giant cluster
min_samples ≈ 2 × n_features

python

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=1.2, min_samples=2)
labels = db.fit_predict(X_scaled)

n_clusters = len(set(labels) - {-1})
n_noise = list(labels).count(-1)

Live Demo

DBSCAN Explorer

ε: 1.0 min: 2

Name	Inc	Spd	Nbrs	Type	Cluster

Algorithm: DBSCAN

1. for each point p:

N_ε(p) = {q : dist(p, q) ≤ ε}

2. if |N_ε| ≥ min_samples → Core

expand cluster to all reachable points

3. elif p ∈ N_ε(core) → Border

4. else → Noise (label = −1)

Comparison

DBSCAN vs K-Means: When to Use Which?

Aspect	K-Means	DBSCAN
Cluster shape	Spherical / convex only	Any arbitrary shape
Number of clusters	Must specify K	Discovered automatically
Outlier handling	Forces into nearest cluster	Labels as noise (−1)
Cluster density	Assumes uniform	Handles varying density
Scalability	Fast (O(nKt))	Slower (O(n log n) with index)
Reproducibility	Varies by initialization	Nearly deterministic*
Best for	Customer segmentation, compression	Anomaly detection, spatial data
*Border points reachable from multiple clusters may vary by processing order.

Application

Anomaly Detection with Clustering

Use clustering to identify observations that don't belong to any group — potential fraud, errors, or rare events.

Pseudocode

Run a clustering algorithm on your data
Points that are far from everyone = anomalies
Flag the top 5% most isolated points
Investigate manually — fraud? data error? rare event?

Three Approaches

DBSCAN noise points: Points labeled −1 are anomalies
Distance from centroid: Points far from their K-Means centroid are suspicious
Isolation Forest: Isolates points via random partitions — anomalies need fewer splits because they have unusual values. Set contamination to the expected anomaly proportion (e.g., 0.05 = 5%)

python

# Method 1: DBSCAN noise detection
dbscan = DBSCAN(eps=0.5, min_samples=10)
labels = dbscan.fit_predict(X_scaled)
anomalies_db = df[labels == -1]
print(f"DBSCAN anomalies: {len(anomalies_db)}")

# Method 2: Centroid distance threshold
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X_scaled)
distances = kmeans.transform(X_scaled).min(axis=1)

# Flag top 5% as anomalies
threshold = np.percentile(distances, 95)
anomalies_km = df[distances > threshold]
print(f"K-Means anomalies: {len(anomalies_km)}")

# Method 3: Isolation Forest
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05)
pred = iso.fit_predict(X_scaled)
anomalies_if = df[pred == -1]  # -1 = anomaly

Association Mining

Market Basket Analysis

Goal: Discover products that are frequently purchased together. Powers product recommendations, store layouts, and promotional bundles.

Real-World Applications

"Customers who bought X also bought Y"
Optimize shelf placement in grocery stores
Design cross-selling promotions
Bundle pricing strategies

Example Transaction Data

TXN	Items Purchased
1	Rice, Eggs, Milk
2	Rice, Eggs
3	Bread, Butter, Coffee
4	Rice, Milk, Coffee
5	Eggs, Milk, Bread

Which items are often bought together?

Association Rules

Three Key Metrics: Support, Confidence, Lift

Support

How often items appear together in all transactions.

Support(A\toB) = P(A \cap B)

Rice & Eggs both in TXN 1, 2 → 2 out of 5 = 0.4

Confidence

How often the rule is correct (conditional probability).

Conf(A\toB) = P(B|A)

Rice in TXN 1,2,4 (3 total). Rice+Eggs in TXN 1,2. → 2/3 = 0.67

Lift

Strength of association (vs random chance).

Lift = Conf(A\toB) / P(B)

Lift = 0.67 / P(Eggs). P(Eggs)=3/5=0.6. Lift = 0.67/0.6 = 1.11 (slight positive). Lift=1 means independent; Lift=2 means 2× more likely together.

Implementation

Apriori Algorithm in Python

Steps

Convert transactions to a boolean table
Find itemsets with support ≥ min_support
Generate rules (A → B) with lift > 1

Install

pip install mlxtend

python

from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()
basket = pd.DataFrame(
    te.fit_transform(transactions), columns=te.columns_)

freq = apriori(basket, min_support=0.3, use_colnames=True)
rules = association_rules(
    freq, num_itemsets=len(basket),
    metric='lift', min_threshold=1.0)

Philippine Case Study

Sari-Sari Store Basket Analysis

The Philippines has over 1 million sari-sari stores. Basket analysis helps small retailers optimize their limited shelf space and bundling strategies.

Illustrative Rules (Hypothetical Data)

Instant noodles → Eggs (Lift: 2.3)
Rice → Cooking oil (Lift: 1.9)
Coffee sachets → Bread (Lift: 2.1)
Shampoo sachets → Soap (Lift: 1.7)

Business Insight

Place eggs next to instant noodles, and bundle coffee + bread for "breakfast packs."

python — sari-sari analysis

# Load sari-sari store transactions
baskets = pd.read_csv('ph_sarisari_txns.csv')

# One-hot encode
encoded = baskets.groupby(
    ['txn_id', 'product']
)['qty'].sum().unstack().fillna(0)
encoded = (encoded > 0).astype(int)

# Find associations
freq = apriori(encoded, min_support=0.02,
               use_colnames=True)
rules = association_rules(
    freq, num_itemsets=len(encoded),
    metric='lift', min_threshold=1.5
)

# Top actionable rules
top = rules[
    (rules['confidence'] > 0.5) &
    (rules['lift'] > 2)
].sort_values('lift', ascending=False)
print(top.head(10))

Customer Analytics

RFM Analysis: The Gold Standard for Customer Segmentation

RFM segments customers by three behavioral dimensions, each answering a critical business question.

R

Recency

"How recently did they buy?"

Days since last purchase. Lower = better.

F

Frequency

"How often do they buy?"

Total number of purchases. Higher = better.

M

Monetary

"How much do they spend?"

Total revenue from customer. Higher = better.

Implementation

RFM Scoring in Python

Steps

Compute R (days since last buy), F (total orders), M (total spent)
Divide each into 5 quintiles scored 1–5
Concatenate → "555" = Champion, "111" = Lost

Note

Invert R: fewer days = higher score (labels=[5,4,3,2,1]).

python

snapshot = df['order_date'].max() + timedelta(1)
rfm = df.groupby('customer_id').agg({
    'order_date': lambda x: (snapshot - x.max()).days,
    'order_id': 'nunique',
    'amount': 'sum'
}).rename(columns={
    'order_date': 'recency',
    'order_id': 'frequency',
    'amount': 'monetary'
})
rfm['R'] = pd.qcut(rfm['recency'], 5, labels=[5,4,3,2,1])
rfm['F'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1,2,3,4,5])
rfm['M'] = pd.qcut(rfm['monetary'], 5, labels=[1,2,3,4,5])
rfm['RFM'] = rfm['R'].astype(str) + rfm['F'].astype(str) + rfm['M'].astype(str)

Customer Analytics

RFM Segment Definitions & Actions

Segment	R	F	M	Description	Marketing Action
Champions	5	5	5	Best customers, recent & frequent	Loyalty rewards, referral program
Loyal	3-5	4-5	3-5	Consistent, reliable buyers	Upsell, premium access
Potential	4-5	1-3	1-3	Recent but low engagement	Onboarding, education
At Risk	1-2	3-5	3-5	Were loyal, now lapsing	Win-back campaign, discounts
Hibernating	1-2	1-2	2-3	Long inactive, moderate value	Reactivation offer
Lost	1	1	1	Inactive, low value	Exit survey, deprioritize

Advanced Technique

Clustering + RFM: Best of Both

Instead of manually defining segments with quintile thresholds, let K-Means discover natural groupings in the RFM space.

Why Combine?

Manual RFM uses arbitrary thresholds. Clustering finds the actual boundaries in your data — often more meaningful and data-driven.

Pro Tip

Monetary values are often heavily right-skewed. Apply np.log1p() before scaling to prevent a few high-spenders from dominating the clusters. Without log: [100, 500, 50000] has huge range. After log1p: [4.6, 6.2, 10.8] — much more evenly spaced.

python

# Use K-Means on RFM features
rfm_features = rfm[['recency', 'frequency',
                     'monetary']].copy()

# Log-transform skewed monetary column
rfm_features['monetary'] = np.log1p(
    rfm_features['monetary'])

scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_features)

# Find optimal K
from sklearn.metrics import silhouette_score
for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(rfm_scaled)
    sil = silhouette_score(rfm_scaled, labels)
    print(f"K={k}: Silhouette={sil:.3f}")

# Fit best K (from silhouette loop above)
best_k = max(range(2, 8), key=lambda k:
    silhouette_score(rfm_scaled,
    KMeans(n_clusters=k, random_state=42)
    .fit_predict(rfm_scaled)))
best_km = KMeans(n_clusters=best_k, random_state=42)
rfm['cluster'] = best_km.fit_predict(rfm_scaled)

# Profile and name clusters
profile = rfm.groupby('cluster').agg({
    'recency': 'mean',
    'frequency': 'mean',
    'monetary': ['mean', 'count']
}).round(1)
print(profile)

Decision Guide

Which Clustering Algorithm Should I Use?

K-Means

You know (or can estimate) K
Clusters are roughly spherical
Large datasets (fast)
Customer segmentation
Image compression

Hierarchical

Want to explore different K values
Need a visual hierarchy (dendrogram)
Small-medium datasets (<10K)
Taxonomy / phylogenetics
Organizational structure

DBSCAN

Don't know K at all
Non-spherical / irregular clusters
Need outlier detection
Spatial / geographic data
Fraud / anomaly detection

Reference

The sklearn Clustering Landscape

Every algorithm in sklearn.cluster has trade-offs. Use this table to pick the right one.

Algorithm	Parameters	Scalability	Use Case	Geometry
K-Means	n_clusters	Very large n, medium K	General-purpose, even clusters	Euclidean (spherical)
Mini Batch K-Means	n_clusters, batch_size	Huge datasets (streaming)	Same as K-Means, faster	Euclidean (spherical)
Agglomerative	n_clusters, linkage	Large n with connectivity	Hierarchies, dendrograms	Any pairwise distance
DBSCAN	eps, min_samples	Very large n, medium K	Irregular shapes, outliers	Nearest-point distance
HDBSCAN	min_cluster_size	Large n, medium K	Variable density, no eps tuning	Nearest-point distance
OPTICS	min_samples, xi	Very large n, large K	Variable density, reachability plots	Point distances
Gaussian Mixture	n_components, cov_type	Not scalable	Soft clustering, density estimation	Mahalanobis
Spectral	n_clusters, affinity	Medium n, small K	Non-flat manifolds, graphs	Graph distance

Rule of Thumb

Start with K-Means. If clusters aren't spherical, try DBSCAN. If density varies, use HDBSCAN. Need probabilities? Use GMM.

Going Further

Beyond the Basics: Four More Algorithms

K-Means, Hierarchical, and DBSCAN cover most use cases. These four solve specific pain points.

Gaussian Mixture Models (GMM)

Soft clustering — each point gets a probability of belonging to each cluster.

Models clusters as Gaussian distributions
Handles elliptical (non-spherical) clusters
BIC/AIC to select number of components

Best for: Overlapping clusters, uncertainty estimation, anomaly scoring.

HDBSCAN

Hierarchical DBSCAN — no eps to tune.

Finds clusters at varying densities automatically
Only needs min_cluster_size
Provides cluster stability scores

Best for: Mixed-density clusters, when DBSCAN's eps is hard to set.

Mini Batch K-Means

Streaming K-Means — uses random mini-batches instead of the full dataset per iteration.

Much faster convergence on large data
Slightly lower quality than standard K-Means
sklearn.cluster.MiniBatchKMeans

Best for: Datasets too large for standard K-Means (>100K rows).

OPTICS

Generalized DBSCAN — one run replaces many DBSCAN runs at different eps values.

Produces a reachability plot showing density hierarchy
Extract clusters at any density threshold
O(n) memory (better than HDBSCAN for huge data)

Best for: Exploring cluster structure at multiple scales, very large datasets.

Key Concept

Can Your Model Label New Data?

Some clustering algorithms can predict labels for unseen data. Others can only label the data they were trained on.

Inductive

Has a .predict() method. Learn once, apply to new data forever.

K-Means, Mini Batch K-Means, GMM, BIRCH

Transductive

Only has .fit_predict(). Must re-fit if new data arrives.

DBSCAN, HDBSCAN, OPTICS, Hierarchical, Spectral

When does it matter?

Scenario	Need
New customer signs up → assign to segment	Inductive
One-time analysis of existing data	Either works
Real-time fraud detection on streaming data	Inductive
Exploring spatial patterns in a fixed dataset	Transductive OK

Tip: If you need .predict() but want DBSCAN-like behavior, train HDBSCAN then use its approximate_predict() method.

Activity

In-Class Exercise: Complete Customer Analytics

Tasks

Load the Philippine e-commerce dataset
Calculate RFM metrics from raw transactions
Cluster customers with K-Means (find best K)
Compare with DBSCAN (identify anomalous customers)
Run basket analysis on product purchases
Create a 1-page segment recommendation report

Expected Outputs

Elbow + silhouette plots
PCA scatter plot by cluster
RFM cluster profile table
Top 5 association rules with interpretation
DBSCAN anomaly count and examples
Marketing recommendations per segment

Time: 35 minutes · Mode: Pairs

Animation

DBSCAN on Crescents — Where K-Means Fails

Speed: Points processed: 0/0

Lecture 18 Summary

Key Takeaways

1. DBSCAN finds arbitrary-shaped clusters and naturally identifies outliers as noise.

2. Anomaly detection uses distance from normal patterns (DBSCAN noise, centroid distance, Isolation Forest).

3. Market basket analysis reveals product co-occurrence patterns via support, confidence, and lift.

4. RFM + Clustering creates data-driven customer segments with clear marketing actions.

Coming Up

Lab 9: Customer Segmentation Project

A comprehensive end-to-end customer analytics exercise combining everything from this week.

What You'll Build

Full RFM pipeline from raw transaction data
K-Means and DBSCAN comparison
Market basket analysis with Apriori
Interactive segment dashboard (optional)

Presentation Component

Present findings as if pitching to a marketing team
5-minute presentation per pair
Focus on actionable business recommendations
Bonus: Philippine dataset usage (+10%)

Next Week: Time Series Analytics — Forecasting trends, seasonality, and building ARIMA models.

Lab 9 Results

Expected Output: Elbow & Silhouette Analysis

Elbow Method (Inertia vs K)

Silhouette Score vs K

Interpretation: The elbow occurs at K=3 (inertia drops sharply before, flattens after). Silhouette score peaks at K=3 (0.45), confirming 3 well-separated clusters in the PSA regional data.

Lab 9 Results

Expected Output: PCA Cluster Visualization

Cluster 0: Metro Powerhouses

NCR, CALABARZON, Region III — High GDP, low poverty, high tourism

Cluster 1: Mid-Tier Regions

Region VI, VII, XI, X — Moderate GDP, mixed poverty rates

Cluster 2: Rural/Underserved

BARMM, Region V, VIII, IX, XII — High poverty, low GDP

Lab 9 Results

Expected Output: Hierarchical Dendrogram

Reading the dendrogram: Cutting at distance ~4 (red dashed line) yields 3 clusters that match K-Means results. BARMM merges last — it's the most distinct region, driven by 63% poverty rate and lowest literacy.

Clustering & Segmentation

The Power of Grouping

Lecture 17 Objectives

Supervised vs Unsupervised

K-Means

Hierarchical

Interpretation

Two Paradigms of Machine Learning

What is Clustering?

Key Terms

Why Clustering Matters in Industry

Customer Segmentation

Anomaly Detection

Beyond Business

Meet Our 8 Customers

Three Features

K-Means Clustering

Objective Function

K-Means Iteration — By Hand

Step 1: Pick initial centroids

Step 2: Assign → See table

Step 3: Update centroids

Step 4: Repeat

K-Means Simulator

Always Scale Your Features First

Without Scaling

With StandardScaler

K-Means in Python

Steps

Key Parameters

Choosing K: Elbow Method

What is Inertia?

Pseudocode

Reading the Elbow: A Worked Example

Step-by-Step: Calculate % Improvement

What If There's No Clear Elbow?

Common Mistakes

Find the Best K

Choosing K: Silhouette Score

Pseudocode

Beyond Silhouette: More Cluster Validation Metrics

Davies-Bouldin Index

Calinski-Harabasz Index

Profiling Your Clusters

Pseudocode

2D Cluster Visualization with PCA

Why PCA?

When K-Means Fails

Spherical Only

Random Init

Must Specify K

Outlier Sensitive

Equal Variance

What "Spherical Only" Actually Looks Like

Spherical Blobs

Crescent Shapes

Nested Rings

Hierarchical Clustering

How to Read a Dendrogram

X-axis

Y-axis (Height)

Choosing K

Example from Previous Slide

Linkage Methods: How to Measure Cluster Distance

Hierarchical Clustering in Python

Steps

Two Libraries

GCash Customer Segmentation

RFM Features

Interpreting Customer Segments

In-Class Exercise: Retail Segmentation

Tasks

Deliverables

K-Means on 60 Points — Watch It Converge

Key Takeaways

Advanced Clustering & Customer Analytics

Lecture 18 Objectives

DBSCAN

Anomaly Detection

Market Basket