CMSC 178DA | Week 04

The Art of Discovery

Exploratory Data Analysis

Department of Computer Science

University of the Philippines Cebu

Discover Patterns & Insights

Opening Story

The WWII Bomber Problem

World War II. American bombers are getting shot down at alarming rates.

The military examines bombers that return from missions and maps where the bullet holes are:

Bomber aircraft with bullet holes concentrated on fuselage and wings, while engines remain undamaged
Opening Story

The Obvious Answer... Is Wrong

Most engineers said: "Reinforce the fuselage and wings — that's where all the holes are!"

Abraham Wald (Statistical Research Group, Columbia)

"No. Reinforce where the holes AREN'T."

The military was only studying planes that survived. Planes hit in the engines and fuel tanks never came back to be studied — they crashed.

The bullet holes on returning planes showed where planes could take damage and still fly.

Opening Story

Survivorship Bias

Survivorship Bias: Drawing conclusions from data that survived a selection process, while ignoring the data that didn't make it through.

Other examples:

  • "Most successful CEOs dropped out of college" — ignores millions of dropouts who didn't succeed
  • "Old buildings are better built" — poorly built ones already collapsed

EDA isn't just about what the data shows. It's about what's missing from your data.

Foundations

Quick Review: Key Statistics

Before we go further, let's make sure we understand three terms:

Mean (Average): Add up all values, divide by count. Example: [2, 4, 6] → mean = (2+4+6)/3 = 4

Correlation (r): How closely two variables move together. Ranges from -1 (perfect opposite) to +1 (perfect together). 0 = no linear pattern.

Regression Line: The best-fit straight line through data. Written as y = a + bx — lets you predict y from x.

Simple Regression Line Example x y y = a + bx
Anscombe's Quartet

The Four Identical Datasets

In 1973, statistician Francis Anscombe created four datasets. Every single one of these numbers is exactly the same:

9.0 / 7.50
Mean X / Mean Y
(same average position)
0.816
Correlation (r)
(strong positive relationship)
y = 3 + 0.5x
Regression Line
(same best-fit line)
Agenda

Learning Objectives

By the end of this lecture, you should be able to:

  1. Perform Univariate Analysis to understand distributions and outliers
  2. Conduct Bivariate Analysis to uncover relationships between variables
  3. Identify confounding variables and Simpson's Paradox in data
  4. Apply Feature Engineering techniques (encoding, binning) for modeling
  5. Explain why visualization is essential — not optional — in EDA
Motivation

Why EDA?

"You can't model what you don't understand."

EDA is the detective work before the trial:

  • Spotting anomalies (fraud, errors)
  • Testing assumptions (normality, linearity)
  • Generating hypotheses

Philippine FIES 2021

Family Income and Expenditure Survey

  • Average family income: PHP 307,190
  • NCR: PHP 417,850 vs BARMM: PHP 184,940 — a 2.26x gap
  • Gini coefficient: 0.4119 (one of highest in East Asia)
  • Poverty incidence: 18.1% of population

Source: PSA FIES 2021

Part 1

Univariate Analysis

Understanding one variable at a time

Part 1 · Univariate

What Are Descriptive Statistics?

Descriptive statistics summarize the main features of a dataset quantitatively — central tendency (where?), dispersion (how spread?), and shape (what pattern?).

Descriptive = summarize what IS

Mean, median, std dev, histograms

Inferential = predict what COULD BE

Hypothesis tests, confidence intervals

Part 1 · Univariate

Central Tendency: Where Is the Data?

Given 5 UP Cebu student GWAs: [1.5, 2.0, 2.0, 2.5, 4.0]

2.4
Mean (Average)
(1.5+2.0+2.0+2.5+4.0) / 5
2.0
Median (Middle)
Sort → pick middle value
2.0
Mode (Most Frequent)
2.0 appears twice
Part 1 · Univariate

Spread: How Different Are the Values?

Same 5 GWAs: [1.5, 2.0, 2.0, 2.5, 4.0] — Mean = 2.4

Range = Max - Min = 4.0 - 1.5 = 2.5

Variance = average of squared deviations from mean:

= [(1.5-2.4)² + (2.0-2.4)² + (2.0-2.4)² + (2.5-2.4)² + (4.0-2.4)²] / 5
= [0.81 + 0.16 + 0.16 + 0.01 + 2.56] / 5 = 0.74

Standard Deviation = √Variance = √0.74 = 0.86

Std dev is in the same units as the data (GWA points), making it more interpretable than variance (GWA points²).

Part 1 · Univariate

Shape: What Does the Distribution Look Like?

The shape tells you whether data is balanced or has a long tail in one direction.

Three distribution shapes: left-skewed, symmetric, and right-skewed
Part 1 · Univariate

Worked Example: Philippine Family Income

RegionAvg Family Income (PHP)
NCR417,850
CALABARZON361,030
CAR350,430
Region III328,540
Region VII (Cebu)~270,000
BARMM184,940

Range = NCR - BARMM = 417,850 - 184,940 = PHP 232,910

National Mean = PHP 307,190 — but this represents almost nobody (NCR is far above, most regions are below)

Part 1 · Univariate

What Does the Average Actually Mean?

Bar chart of Philippine regional incomes showing the national mean line
Part 1 · Univariate

What Is a Histogram?

A histogram divides data into bins (ranges) and counts how many values fall in each bin. Unlike a bar chart, the x-axis is continuous.

How Histograms Work Raw data: [150, 180, 200, 220, 250, 270, 280, 300, 350, 400] ↓ group into bins ↓ 150-250: 5 values 250-350: 4 values 350-450: 1 value

Histograms show the shape of your data — something a mean or median alone cannot reveal.

Part 1 · Univariate

Histograms: Income Distribution

Right-skewed income distribution histogram showing mean greater than median
  • Right Skewed: Tail extends right (e.g., Income). Mean > Median.
  • Left Skewed: Tail extends left (e.g., Age at death). Mean < Median.
import seaborn as sns import matplotlib.pyplot as plt # Histogram with KDE sns.histplot(data=df, x='income', kde=True) plt.title('Income Distribution') plt.show()
Part 1 · Univariate

What Is a Box Plot?

A box plot shows the five-number summary: minimum, Q1 (25th percentile), median, Q3 (75th percentile), and maximum. The box spans the middle 50% of data, called the IQR (Interquartile Range).

Annotated box plot showing quartiles, IQR, whiskers, and outlier
Part 1 · Univariate

How to Read a Box Plot: Step by Step

Using FIES regional income data (PHP thousands): [185, 250, 270, 307, 350, 361, 418]

Step 1: Q1 (25th percentile) = 250

Step 4: IQR = Q3 - Q1 = 361 - 250 = 111

Step 2: Median (50th percentile) = 307

Step 5: Lower fence = 250 - 1.5(111) = 83.5

Step 3: Q3 (75th percentile) = 361

Step 6: Upper fence = 361 + 1.5(111) = 527.5

NCR (418K) is within fences — not an outlier! Any value above 527.5K would be flagged.

Part 1 · Univariate

Finding Outliers with the IQR Method

A data point is an outlier if:

value < Q1 - 1.5 × IQR

or

value > Q3 + 1.5 × IQR

The 1.5 × IQR rule catches values that are unusually far from the middle 50%.

# Box plot sns.boxplot(x=df['annual_income']) # Finding outliers programmatically Q1 = df['income'].quantile(0.25) Q3 = df['income'].quantile(0.75) IQR = Q3 - Q1 outliers = df[ (df['income'] < (Q1 - 1.5 * IQR)) | (df['income'] > (Q3 + 1.5 * IQR)) ] print(f"Found {len(outliers)} outliers")
Anscombe's Quartet

Here's the Actual Data

Remember: all four have identical statistics (mean, correlation, regression line).

Dataset I
x  y
10
8.04
8
6.95
13
7.58
9
8.81
11
8.33
14
9.96
6
7.24
4
4.26
12
10.84
7
4.82
5
5.68
Dataset II
x  y
10
9.14
8
8.14
13
8.74
9
8.77
11
9.26
14
8.10
6
6.13
4
3.10
12
9.13
7
7.26
5
4.74
Dataset III
x  y
10
7.46
8
6.77
13
12.74
9
7.11
11
7.81
14
8.84
6
6.08
4
5.39
12
8.15
7
6.42
5
5.73
Dataset IV
x  y
8
6.58
8
5.76
8
7.71
8
8.84
8
8.47
8
7.04
8
5.25
19
12.50
8
5.56
8
7.91
8
6.89
Anscombe's Quartet

What Do You Think They Look Like?

Prediction Challenge

If four datasets have the same mean, same spread, same correlation, and same best-fit line...

  • Do they all look the same when plotted?
  • Could they look different? How?

Sketch your prediction: Draw what you think one of these scatter plots looks like. Just a rough sketch on paper or your tablet.

2 minutes

Anscombe's Quartet

The Reveal

Same statistics. Completely different patterns.

Dataset I
x y
10
8.04
8
6.95
13
7.58
9
8.81
11
8.33
14
9.96
6
7.24
4
4.26
12
10.84
7
4.82
5
5.68
Linear
Dataset II
x y
10
9.14
8
8.14
13
8.74
9
8.77
11
9.26
14
8.10
6
6.13
4
3.10
12
9.13
7
7.26
5
4.74
Curved
Dataset III
x y
10
7.46
8
6.77
13
12.74
9
7.11
11
7.81
14
8.84
6
6.08
4
5.39
12
8.15
7
6.42
5
5.73
outlier! Outlier
Dataset IV
x y
8
6.58
8
5.76
8
7.71
8
8.84
8
8.47
8
7.04
8
5.25
19
12.50
8
5.56
8
7.91
8
6.89
leverage! Leverage
Anscombe's Quartet

Why Does This Happen?

Each dataset fools the summary statistics in a different way:

Dataset I: A genuine linear relationship. The statistics are telling the truth.

Dataset II: A curved (quadratic) relationship. The linear regression line completely misses the pattern. r=0.816 is misleading.

Dataset III: A perfect linear trend with one outlier. That single point pulls the regression line and inflates r.

Dataset IV: All points at one x-value except one leverage point that single-handedly creates the illusion of correlation.

Anscombe's Quartet

It Gets Wilder: The Datasaurus Dozen

In 2017, Autodesk researchers extended Anscombe's idea to 13 datasets — including one shaped like a dinosaur.

All 13 share the same:

  • Mean of X and Y
  • Standard deviation of X and Y
  • Pearson correlation

Same mean. Same std dev. Same correlation. Completely different shapes.

Datasaurus Dozen - Different Shapes, Same Stats Dinosaur Star Circle X Shape
Anscombe's Quartet

The Lesson: Always Visualize

If you had only looked at summary statistics, you would have treated all four of Anscombe's datasets identically.

EDA — specifically visualization — would have caught the difference instantly.

Practical Rule:

Before fitting any model, always:

  1. Plot your data (scatter plots, histograms, box plots)
  2. Look for patterns the numbers can't capture (curves, clusters, outliers)
  3. Then — and only then — choose your modeling approach
Activity

Activity: EDA Detective

Scenario: You receive 500 UP Cebu student GWAs. Your boss says: "The average GWA is 2.5 — students are doing fine."

  1. Task 1: What does mean < median tell you about the distribution shape? Sketch it.
  2. Task 2: Is the boss's conclusion justified? What additional EDA would you do?
StatisticValue
Mean2.50
Median2.80
Std Dev0.90
Min1.00
Max5.00

3 minutes

Part 2

Bivariate Analysis

Discovering relationships between variables

Part 2 · Bivariate

What Is Correlation?

Correlation measures how strongly two variables move together:

  • Positive: Both go up together (e.g., height and weight)
  • Negative: One goes up, the other goes down (e.g., price and demand)
  • Zero: No linear pattern
Visual guide showing correlation values from r=-1 to r=+1
Part 2 · Bivariate

Pearson's Correlation Coefficient

Intuition:

  • Numerator: Do x and y deviate from their means in the same direction? If yes, product is positive.
  • Denominator: Normalize by their individual spreads so r is always between -1 and +1.

Critical limitation: r only measures linear relationships.

r = 0 doesn't mean "no relationship" — remember Anscombe's Dataset II had r = 0.816 despite being curved!

Part 2 · Bivariate

Correlation ≠ Causation

The Ice Cream Paradox

Ice cream sales and drowning deaths are positively correlated. Does ice cream cause drowning?

Of course not — both are caused by hot weather (a confounding variable).

Correlation tells you: "These variables move together."

Correlation does NOT tell you: "One CAUSES the other."

To establish causation, you need controlled experiments or causal inference methods — correlation alone is never enough.

Part 2 · Bivariate

Scatter Plots: What to Look For

When examining a scatter plot, check three things:

  1. Direction: Positive (up-right) or Negative (down-right)?
  2. Strength: How tightly do points cluster around a line?
  3. Shape: Linear, curved, or no pattern?
Three scatter plot patterns: strong linear, weak linear, and non-linear
Part 2 · Bivariate

Scatter Plots & Correlation in Python

# Scatter plot with regression line sns.regplot(x='years_experience', y='salary', data=df) # Correlation matrix corr = df[['income', 'education', 'household_size']].corr() sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
incomeeducationhh_size
income1.000.72-0.31
education0.721.000.05
hh_size-0.310.051.00

Income & education: strong positive. Income & household size: weak negative.

Part 2 · Bivariate

Categorical vs Numerical

How does a numerical variable change across categories?

Grouped Box Plots

Compare distributions side-by-side.

Example: Income distribution by Philippine region.

This is where FIES data reveals the NCR vs BARMM gap visually — numbers alone don't show the full picture.

plt.figure(figsize=(10, 6)) sns.boxplot(x='region', y='family_income', data=fies_df) plt.xticks(rotation=45) plt.title('Income by Region (FIES)') plt.tight_layout() plt.show()
Part 2 · Bivariate

Categorical vs Categorical

Cross-Tabulation (Crosstab)

Frequency counts for combinations of categories.

Example: Education Level vs. Employment Status.

Visualized via Heatmaps or Stacked Bar Charts.

# Contingency Table ct = pd.crosstab(df['education'], df['employed']) # Heatmap visualization sns.heatmap(ct, annot=True, cmap='Blues', fmt='d') plt.title('Education vs Employment')
Simpson's Paradox

A Lawsuit at Berkeley

UC Berkeley, Fall 1973. The university was accused of sex discrimination in graduate admissions.

44.5%
Men Admitted
1,198 of 2,691
vs
14% gap
30.4%
Women Admitted
557 of 1,835
Simpson's Paradox

What Would You Conclude?

Based on these numbers alone — 44.5% vs 30.4% — what would you conclude?

Most people — including the legal team — concluded discrimination.

But statistician Peter Bickel was asked to look deeper...

He broke the data down by department.

Simpson's Paradox

The Department-Level Truth

DeptMen AppliedMen AdmittedWomen AppliedWomen Admitted
A82562%10882%
B56063%2568%
C32537%59334%
D41733%37535%
E19128%39324%
F3736%3417%

In 4 of 6 departments, women had equal or HIGHER admission rates!

Simpson's Paradox

Simpson's Paradox

Simpson's Paradox: A trend that appears in aggregated data reverses when the data is separated into groups. Caused by a lurking confounding variable.

Simpson's Paradox: Aggregation Hides Truth Aggregated Data Men: 44.5% Women: 30.4% "Bias against women" Split by Department By Department 4 of 6 depts: women had higher rates "No bias at dept level" Confounding Variable Women chose more competitive departments (with lower rates for all)
Simpson's Paradox

Why Did This Happen?

The Mechanism

Women applied to competitive departments (English, Humanities) with low admission rates for everyone.

Men applied to less competitive departments (Engineering) with high admission rates for everyone.

When you aggregate across departments, it looks like bias — but it's actually about where people applied.

The lesson for EDA:

Always ask: "Is there a third variable I'm not seeing?"

Bivariate analysis must consider confounding variables. Aggregated data can create illusions.

Discussion

Discussion: Data Ethics in EDA

Your company asks you to analyze employee performance data. You find that women have lower average performance scores. Your boss says to include this finding in the report.

Consider:

  • What confounders might exist? (Department? Manager bias?)
  • Could this be Simpson's Paradox?

Discuss:

  1. Should you present the aggregate finding?
  2. What additional EDA would you do first?
  3. How would you push back ethically?

4 minutes

Part 3

Feature Engineering

Transforming raw data into predictive power

Part 3 · Feature Engineering

What Is Feature Engineering?

Feature engineering is the process of using domain knowledge to create, transform, or select variables (features) that make machine learning algorithms work better.

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering."

— Andrew Ng

The best model in the world can't compensate for bad features. EDA tells you which features to engineer.

Part 3 · Feature Engineering

Nominal vs Ordinal: Why It Matters

Nominal = categories with NO natural order (Region, Color, Course Program)

Ordinal = categories with a meaningful order (Low/Medium/High, Year Level, Star Ratings)

Why this matters for encoding:

If you encode nominal data with numbers (NCR=1, Cebu=2, Davao=3), your model thinks Davao > Cebu > NCR. That's a fake ordering — and it will learn wrong patterns!

The encoding method you choose depends on which type of categorical data you have.

Part 3 · Feature Engineering

Encoding Categorical Data

One-Hot Encoding (for Nominal)

Regionis_NCRis_Cebuis_Davao
NCR100
Cebu010
Davao001

Label Encoding (for Ordinal)

Income LevelEncoded
Low0
Medium1
High2
# One-Hot Encoding in Pandas pd.get_dummies(df, columns=['region'], drop_first=True)
Part 3 · Feature Engineering

Why Bin Continuous Data?

Binning (Discretization): Converting a continuous numerical variable into discrete categories.

When is binning useful?

  • When exact values have too much noise
  • When you care about categories more than exact numbers
  • When the relationship is non-linear and steps would help

Example

Instead of predicting with exact income (PHP 307,190), group into Low / Medium / High.

Reduces noise, captures broader patterns.

Part 3 · Feature Engineering

Equal-Width vs Quantile Binning

Equal-Width Bins

Data: [185K, 220K, 270K, 307K, 350K, 418K, 1.2M]

Bins: [185K-524K] [524K-862K] [862K-1.2M]

6 of 7 values in first bin!

Quantile Bins (Equal Count)

Data: [185K, 220K, 270K, 307K, 350K, 418K, 1.2M]

Bins: [185K-260K] [260K-350K] [350K-1.2M]

~Equal items per bin

Equal-width binning fails with skewed data. Quantile binning preserves distribution information.

# Equal-width vs Quantile binning df['income_ew'] = pd.cut(df['income'], bins=3) # Equal-width df['income_qt'] = pd.qcut(df['income'], q=3) # Quantile
Part 3 · Feature Engineering

Interaction Features

Combining two features to capture joint effects.

Philippine Examples

  • Income Per Capita = Household Income / Members
  • Savings Rate = (Income - Expenditure) / Income
  • Education Efficiency = Income / Years of Education
# Creating interaction features df['income_per_capita'] = ( df['household_income'] / df['household_size'] ) df['savings_rate'] = ( (df['income'] - df['expenditure']) / df['income'] ) # Better predictor than income alone!
Activity

Activity: Feature Engineering Challenge

Scenario: You're building a model to predict which UP Cebu students need academic support.

  1. Task 1: Classify each "?" feature as nominal, ordinal, or numerical. How would you encode each?
  2. Task 2: Create 2 interaction features that might be more predictive than the originals. Explain why.
FeatureTypeExample
GWANumerical2.5
Attendance %Numerical85%
Year Level?3rd Year
Course Code?BSCS
Province?Cebu
Scholarship?Yes

5 minutes

Wrap-Up

The Complete EDA Pipeline

The Complete EDA Pipeline 1. Load Data CSV, SQL, API df.head() df.info() df.describe() 2. Univariate Distributions Outliers (IQR) Skewness "What does each variable look like?" 3. Bivariate Correlations Confounders Simpson's Paradox "How do variables relate?" 4. Feature Engineering Encoding Binning Interactions "Transform for ML" Ready for Modeling! Skipping EDA = building on a foundation you haven't inspected (Remember Anscombe's Quartet and Berkeley's admissions!)
Wrap-Up

Key Takeaways

  1. Descriptive statistics (central tendency, spread, shape) summarize your data — but can be misleading without visualization
  2. Histograms and box plots reveal distributions, skewness, and outliers that averages hide
  3. Scatter plots and correlation show relationships between variables — but correlation ≠ causation
  4. Confounding variables and Simpson's Paradox can reverse conclusions when you disaggregate data
  5. Feature engineering (encoding, binning, interactions) transforms raw data for machine learning
Wrap-Up

The Big Picture

  • Abraham Wald proved that what's MISSING from data matters as much as what's there
  • Anscombe showed that identical statistics can hide completely different realities
  • Berkeley's admissions proved that aggregation can create the illusion of bias
  • The FIES data shows PHP 307K average income — but that number represents almost nobody

EDA is not optional. It's the difference between discovering truth and confirming assumptions.

Coming Up

Next Lecture

Week 5: Data Visualization Principles

  • Visual Perception and Cognition
  • The "Grammar of Graphics"
  • Choosing the Right Chart
  • Lab 5: Advanced Visualization with Seaborn & Plotly