Back to Course
CMSC 178DA

Week 3: Data Wrangling

1 / 59

Data Wrangling

CMSC 178DA - Week 03

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

When Bad Data Costs $327 Million

NASA Mars Climate Orbiter (1999)

  • 1998: NASA launches a $125 million probe to study Mars climate
  • Lockheed Martin calculates thruster force in pound-force seconds
  • NASA's navigation team expects newton-seconds (off by factor of 4.45x)
  • Sep 23, 1999: Orbiter enters Mars atmosphere 170 km too low. Destroyed.
  • Total mission loss: $327 million

Navigators noticed the discrepancy months earlier but "did not follow the rules about filling out the form" to flag it.

A single unit conversion error — the most basic data cleaning task — destroyed a spacecraft.

Source: Wikipedia · IEEE Spectrum (Oberg, 1999)

Learning Objectives

Lecture 5: Data Collection & SQL

  1. Understand data collection methodologies
  2. Master SQL fundamentals for data extraction
  3. Work with Philippine government data sources
  4. Apply best practices for data acquisition

Why Data Wrangling?

Data Wrangling: The process of transforming raw data into a format suitable for analysis

80%
of analytics time spent on data prep
60%
of data scientists say cleaning is most tedious
GIGO
"Garbage in, garbage out"

Samsung Securities (April 2018)

An employee typed "shares" instead of "won" when processing dividends → issued 2.8 billion ghost shares worth ~$105 billion → stock dropped 11% in one day. Took 37 minutes to catch. 16 employees sold 5 million shares before the fix.

Source: Wikipedia

Data wrangling isn't boring busywork. It's the difference between insight and disaster.

The Data Wrangling Pipeline

Every analytics project follows this sequence

Data Wrangling Pipeline Collect APIs, SQL, Files Surveys, Scraping Clean Missing values, Duplicates Outliers, Type errors Transform Reshape, Encode Merge, Derive features Validate Assertions, Tests Quality checks Today: We cover all four stages

Data Collection Methods

Method Description Example
APIs Programmatic access Twitter API, PSA OpenSTAT
Web Scraping Extract from websites News articles, prices
Databases SQL queries Company data warehouse
Files CSV, Excel, JSON Government datasets
Surveys Primary collection Google Forms, interviews

Philippine Data Sources 🇵🇭

Source Data Types
PSA OpenSTAT Demographics, labor, prices
BSP Statistics Financial, monetary, banking
PAGASA Weather, climate, disasters
DOH Health statistics, epidemiology
PSE EDGE Stock market, company filings
PhilGEPS Government procurement, tenders, contracts

The Philippine Data Challenge

PIDS Census Data Harmonization Project

  • PIDS had to harmonize 5 decades (1970–2020) of Census of Population and Housing data
  • Found: inconsistent variable definitions across census rounds
  • Different classification schemes and measurement scales used over time
  • K-12 reform (2013) changed ALL education categories overnight — broke historical comparisons
  • New administrative regions over the decades required boundary adjustments in historical data

This is YOUR data. As Filipino data scientists, you'll be the ones cleaning Philippine government datasets.

Source: PIDS Press Release on Census Harmonization

SQL Fundamentals

SQL (Structured Query Language): The standard language for interacting with relational databases

-- Basic SELECT statement
SELECT column1, column2
FROM table_name
WHERE condition
ORDER BY column1;

SQL is essential for extracting data from databases in organizations

SELECT: Choosing Columns

-- Select all columns
SELECT * FROM employees;

-- Select specific columns
SELECT first_name, last_name, salary
FROM employees;

-- Select with alias
SELECT first_name AS name, 
       salary * 12 AS annual_salary
FROM employees;

WHERE: Filtering Rows

-- Comparison operators
SELECT * FROM products WHERE price > 1000;

-- Multiple conditions
SELECT * FROM employees 
WHERE department = 'IT' 
  AND salary >= 50000;

-- IN operator
SELECT * FROM regions 
WHERE province IN ('Cebu', 'Bohol', 'Leyte');

-- LIKE for pattern matching
SELECT * FROM customers 
WHERE name LIKE 'Maria%';

ORDER BY & LIMIT

-- Sort ascending (default)
SELECT * FROM products 
ORDER BY price;

-- Sort descending
SELECT * FROM employees 
ORDER BY salary DESC;

-- Multiple columns
SELECT * FROM sales 
ORDER BY region, date DESC;

-- Limit results (top N)
SELECT * FROM products 
ORDER BY price DESC 
LIMIT 10;

Aggregate Functions

-- Common aggregates
SELECT 
    COUNT(*) AS total_employees,
    AVG(salary) AS avg_salary,
    SUM(salary) AS total_payroll,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM employees;

Key Aggregates: COUNT, SUM, AVG, MIN, MAX, STDDEV

GROUP BY: Aggregating Groups

-- Group by single column
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- Group by multiple columns
SELECT region, product_category, 
       SUM(sales) AS total_sales
FROM transactions
GROUP BY region, product_category;

-- Filter groups with HAVING
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

JOINs: Combining Tables

A B

INNER

Matches only

A B

LEFT

All left + matches

A B

RIGHT

Matches + all right

A B

FULL OUTER

Everything from both

SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d
  ON e.dept_id = d.id;

JOIN Example: Philippine Regions

-- Combine province data with regional info
SELECT 
    p.province_name,
    r.region_name,
    p.population,
    p.land_area
FROM provinces p
LEFT JOIN regions r 
    ON p.region_code = r.region_code
WHERE r.island_group = 'Visayas'
ORDER BY p.population DESC;

What is a Subquery?

Subquery (also called inner query or nested query): A SQL query embedded inside another query. The inner query runs first, and its result is used by the outer query.

Subquery nesting structure OUTER QUERY SELECT * FROM employees WHERE salary > ( ? ) INNER QUERY (runs first) SELECT AVG(salary) FROM employees

Two common uses: (1) Filter using a computed value, (2) Create a derived table as a data source

Subqueries

-- Subquery in WHERE
SELECT * FROM employees
WHERE salary > (
    SELECT AVG(salary) FROM employees
);

-- Subquery in FROM
SELECT dept, avg_sal
FROM (
    SELECT department AS dept, 
           AVG(salary) AS avg_sal
    FROM employees
    GROUP BY department
) AS dept_averages
WHERE avg_sal > 50000;

Case Study: Querying Regional Data

Find Visayas Regions with Above-Average Salary

SELECT r.region_name, s.median_salary
FROM salary_data s
JOIN regions r ON s.region_code = r.region_code
WHERE s.median_salary > (
    SELECT AVG(median_salary) FROM salary_data
)
AND r.island_group = 'Visayas'
ORDER BY s.median_salary DESC;
region_namemedian_salary
0Central Visayas (VII)14,200
1Western Visayas (VI)13,500

Illustrative data based on PSA Labor Force Survey patterns

Think About It

Group Discussion

You have SQL access to a database with student grades, attendance, and personal information at UP Cebu. Your boss asks you to run a query joining grades with social media activity data to "predict student success."

Consider:

  • Data Privacy Act (RA 10173)
  • Consent & purpose limitation
  • Data minimization principle
  • Who controls the data?

Discuss:

  1. Is this query ethical?
  2. What data should you refuse to join?
  3. How would you push back on this request?

4 minutes

Python + SQL Integration

import pandas as pd
import sqlite3

# Connect to database
conn = sqlite3.connect('ph_data.db')

# Execute query and load to DataFrame
query = """
SELECT region, province, population
FROM census_2020
WHERE population > 1000000
ORDER BY population DESC
"""
df = pd.read_sql(query, conn)

# Close connection
conn.close()
print(df.head())

API Data Collection

import requests
import pandas as pd

# PSA OpenSTAT API example
url = "https://openstat.psa.gov.ph/api/data"
params = {
    'indicator': 'population',
    'year': 2020,
    'format': 'json'
}

response = requests.get(url, params=params)
data = response.json()

# Convert to DataFrame
df = pd.DataFrame(data['results'])
print(df.head())

Reading Various File Formats

import pandas as pd

# CSV files
df_csv = pd.read_csv('ph_population.csv')

# Excel files
df_excel = pd.read_excel('census_data.xlsx', 
                          sheet_name='2020')

# JSON files
df_json = pd.read_json('api_response.json')

# Multiple files
import glob
files = glob.glob('data/*.csv')
df = pd.concat([pd.read_csv(f) for f in files])

Lecture 5 Summary

Key Takeaways

  1. Multiple data collection methods exist
  2. Philippine has rich public data sources
  3. SQL is essential for database extraction
  4. Python integrates seamlessly with SQL
  5. Always validate data after collection

Part 2

Data Cleaning & Transformation

Turning messy reality into analysis-ready data

Learning Objectives

Lecture 6: Data Cleaning & Transformation

  1. Identify and handle missing values
  2. Detect and treat outliers
  3. Perform data type conversions
  4. Apply data transformation techniques

Common Data Quality Issues

Issue Example Impact
Missing Values Empty cells, NaN Biased analysis
Duplicates Repeated rows Inflated counts
Inconsistent "Cebu" vs "CEBU" Wrong grouping
Outliers Age = 999 Skewed statistics
Wrong Types Dates as strings Failed operations

Spot the Data Quality Issues

Spot the Issues

Scenario: You received this student enrollment dataset. How many problems can you find?

universityprogramyeargwa
0UP CebuBSCS20241.75
1up cebuBSCS2024NaN
2UP CebuBS CS20241.25
3UP CebuBSCS20241.75
4UP-CebuBSCS2025-0.5
5UP CebuBSCStwenty twenty-four2.00

Task: List each problem AND its data quality category (missing, inconsistent, duplicate, outlier, wrong type).

3 minutes

Exploring Data Quality

import pandas as pd

# Load Philippine economic data
df = pd.read_csv('ph_economic_indicators.csv')

# Basic info
print(df.info())

# Missing values
print(df.isnull().sum())

# Unique values per column
print(df.nunique())

# Statistical summary
print(df.describe())

Handling Missing Values

# Check missing values
print(df.isnull().sum())

# Drop rows with any missing
df_clean = df.dropna()

# Drop rows where specific column is missing
df_clean = df.dropna(subset=['income'])

# Fill with constant
df['income'].fillna(0, inplace=True)

# Fill with mean/median
df['income'].fillna(df['income'].median(), inplace=True)

# Forward/backward fill (time series)
df['price'].fillna(method='ffill', inplace=True)

Worked Example: Missing Values

Philippine Provincial Income Data

Before (raw data)

provincemonthly_income
Cebu15,000
BoholNaN
Leyte11,000
SiquijorNaN
Negros Or.13,000

After (median fill)

provincemonthly_income
Cebu15,000
Bohol13,000
Leyte11,000
Siquijor13,000
Negros Or.13,000

Available values: 15,000 + 11,000 + 13,000 = 39,000   |   Mean = 39,000 / 3 = 13,000   |   Median (sorted: 11K, 13K, 15K) = 13,000

Why Median Beats Mean (Usually)

Philippine Monthly Salary Data (PHP): [8K, 10K, 12K, 13K, 14K, 15K, 18K, 20K, 50K, 500K]

Mean = (8 + 10 + 12 + 13 + 14 + 15 + 18 + 20 + 50 + 500) ÷ 10 = PHP 66,000

Median = (14 + 15) ÷ 2 = PHP 14,500

The mean says the "average" Filipino in this sample earns PHP 66,000/month.

But 9 out of 10 people earn less than PHP 66,000. The CEO's salary pulls the mean to a number that represents nobody.

This is why news reports about "average income" can be deeply misleading. Always check the median.

What is Imputation?

Imputation: The process of replacing missing values with substituted values estimated from the available data. Unlike dropping rows (which loses data), imputation preserves the dataset size.

Three common approaches:

  • Simple imputation — Mean, median, or mode replacement
  • Forward/backward fill — Use adjacent values (for time series)
  • Model-based — Use regression or k-NN to predict missing values

The right method depends on why the data is missing and the data type.

Forward Fill: Visual Example

Forward Fill (ffill): Propagates the last observed non-null value forward. Best suited for time series where values persist until changed (e.g., stock prices, temperature).

Before

dateprice
2024-01-01150.0
2024-01-02NaN
2024-01-03NaN
2024-01-04155.0
2024-01-05NaN

After (ffill)

dateprice
2024-01-01150.0
2024-01-02150.0
2024-01-03150.0
2024-01-04155.0
2024-01-05155.0

When to Use Each Strategy

Strategy When to Use
Drop rows Few missing, random pattern
Mean/Median Numeric, roughly normal
Mode Categorical variables
Forward fill Time series data
Imputation Preserve relationships

Handling Duplicates

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# View duplicate rows
print(df[df.duplicated(keep=False)])

# Remove duplicates
df_clean = df.drop_duplicates()

# Remove based on specific columns
df_clean = df.drop_duplicates(
    subset=['id', 'date'], 
    keep='last'
)

Data Type Conversion

# Check current types
print(df.dtypes)

# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Convert to category (memory efficient)
df['region'] = df['region'].astype('category')

# Convert to string
df['code'] = df['code'].astype(str)

String Cleaning

# Remove whitespace
df['name'] = df['name'].str.strip()

# Standardize case
df['city'] = df['city'].str.title()  # "cebu" → "Cebu"
df['code'] = df['code'].str.upper()  # "ph" → "PH"

# Replace values
df['province'] = df['province'].str.replace(
    'Cebu City', 'Cebu'
)

# Extract patterns
df['area_code'] = df['phone'].str.extract(r'\((\d+)\)')

Understanding Outliers

Outlier: A data point that is significantly different from other observations in the dataset. Outliers can be legitimate extreme values or errors, and they can heavily skew statistical measures like the mean.

Philippine Context

In Philippine salary data, a CEO earning PHP 5,000,000/month is a legitimate outlier when the national median salary is PHP 13,000/month. An "age" value of 999 is an error outlier.

Both affect your analysis, but require different handling strategies.

Quantiles, Percentiles, and IQR

Quantile: A value that divides your sorted data into equal-sized groups.

  • Percentile — Divides into 100 groups. "The 75th percentile means 75% of values are below this point."
  • Quartile — Divides into 4 groups: Q1 (25th %ile), Q2/Median (50th), Q3 (75th)

Interquartile Range (IQR):   $$IQR = Q3 - Q1$$

The IQR captures the middle 50% of the data. Values beyond 1.5 × IQR from Q1 or Q3 are considered outliers.

IQR Box Plot Diagram Q1 - 1.5*IQR Q1 Median Q3 Q3 + 1.5*IQR Outliers IQR

Understanding Z-Scores

Z-Score: Measures how many standard deviations a data point is from the mean. A Z-score of 0 means the value equals the mean; a Z-score of 3 means it is 3 standard deviations above.

$$z = \frac{x - \mu}{\sigma}$$

Where: \(x\) = observed value, \(\mu\) = mean, \(\sigma\) = standard deviation

Convention: Values with |z| > 3 are typically considered outliers.

Under a normal distribution, 99.7% of data falls within 3 standard deviations of the mean (the 68-95-99.7 rule).

Outlier Detection

IQR Method

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[
    (df['salary'] < lower) | 
    (df['salary'] > upper)
]

Z-Score Method

from scipy import stats

z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]

Worked Example: IQR Method

Philippine Monthly Salary Data (PHP)

[8K, 10K, 12K, 13K, 14K, 15K, 18K, 20K, 50K, 500K]

Step 1: Q1 = 25th percentile = PHP 11,500

Step 2: Q3 = 75th percentile = PHP 19,000

Step 3: IQR = Q3 - Q1 = 19,000 - 11,500 = PHP 7,500

Step 4: Lower bound = Q1 - 1.5 × IQR = 11,500 - 11,250 = PHP 250

Step 5: Upper bound = Q3 + 1.5 × IQR = 19,000 + 11,250 = PHP 30,250

Outliers detected: PHP 50,000 and PHP 500,000 (above upper bound of PHP 30,250)

Worked Example: Z-Score Method

Same Salary Dataset

Mean (μ) = PHP 66,000   |   Std Dev (σ) = PHP 152,373

For PHP 500,000:   z = (500,000 - 66,000) / 152,373 = 2.85

For PHP 50,000:    z = (50,000 - 66,000) / 152,373 = -0.10

Key insight: Z-scores are sensitive to extreme values! The mean and standard deviation are pulled by the very outlier we are trying to detect.

ValueIQR MethodZ-Score Method
PHP 50,000OutlierNot outlier (z = -0.10)
PHP 500,000OutlierBorderline (z = 2.85)

IQR is often more robust than Z-score for skewed data.

Handling Outliers

Winsorization (Option 2 below): Replacing extreme values with the nearest non-outlier boundary value. Unlike removal, it preserves the data point but limits its influence.

# Option 1: Remove outliers
df_clean = df[
    (df['salary'] >= lower) & 
    (df['salary'] <= upper)
]

# Option 2: Cap outliers (winsorization)
df['salary'] = df['salary'].clip(lower=lower, upper=upper)

# Option 3: Transform (log)
import numpy as np
df['log_salary'] = np.log1p(df['salary'])

# Option 4: Keep but flag
df['is_outlier'] = (df['salary'] < lower) | (df['salary'] > upper)

Binning: Continuous to Categorical

Binning (also called discretization): Converting continuous numerical values into discrete categories. Useful for simplifying analysis, creating ML features, or visualizing distributions.

Before (continuous)

nameage
Juan22
Maria35
Pedro67
Ana8
Carlos45

After (binned)

nameageage_group
Juan22Young Adult
Maria35Adult
Pedro67Senior
Ana8Child
Carlos45Middle Age

One-Hot & Label Encoding

One-Hot Encoding: Converts each category into a new binary (0/1) column. Each row gets exactly one "1".

Original

idregion
1Visayas
2Luzon
3Mindanao
4Visayas

One-Hot (pd.get_dummies)

idLuzonMindanaoVisayas
1001
2100
3010
4001

Label Encoding assigns integers: Luzon=0, Mindanao=1, Visayas=2. Use for ordinal data (Low/Med/High). Use one-hot for nominal data (region names).

Data Transformation

# Create new columns
df['gdp_per_capita'] = df['gdp'] / df['population']

# Binning continuous variables
df['age_group'] = pd.cut(
    df['age'], 
    bins=[0, 18, 35, 50, 65, 100],
    labels=['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
)

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['region'])

# Label encoding
df['region_code'] = df['region'].astype('category').cat.codes

Pivot vs Melt: Visual

Pivot (Long → Wide): Spreads values from one column into multiple new columns. Melt (Wide → Long): Gathers multiple columns into variable + value pairs.

Pivot and Melt transformation diagram Wide Format region | 2020 | 2021 | 2022 Fewer rows, more columns Long Format region | year | sales More rows, fewer columns melt() pivot()

Wide

region202020212022
Visayas500K550K600K
Luzon800K850K900K

Long (after melt)

regionyearsales
Visayas2020500K
Visayas2021550K
Luzon2020800K
Luzon2021850K

Reshaping: Pivot & Melt

# Wide to long (melt)
df_long = pd.melt(
    df, 
    id_vars=['region', 'year'],
    value_vars=['q1_sales', 'q2_sales', 'q3_sales', 'q4_sales'],
    var_name='quarter',
    value_name='sales'
)

# Long to wide (pivot)
df_wide = df_long.pivot_table(
    index='region',
    columns='year',
    values='sales',
    aggfunc='sum'
)

Merging DataFrames

# Inner merge (SQL INNER JOIN)
df_merged = pd.merge(df1, df2, on='id', how='inner')

# Left merge (SQL LEFT JOIN)
df_merged = pd.merge(
    population_df, 
    income_df,
    left_on='province_code',
    right_on='prov_id',
    how='left'
)

# Concatenate (stack)
df_combined = pd.concat([df_2020, df_2021, df_2022])

Complete Cleaning Pipeline

def clean_philippine_data(df):
    """Standard cleaning pipeline for PH datasets"""
    # 1. Remove duplicates
    df = df.drop_duplicates()
    
    # 2. Standardize column names
    df.columns = df.columns.str.lower().str.replace(' ', '_')
    
    # 3. Handle missing values
    df['income'] = df['income'].fillna(df['income'].median())
    
    # 4. Standardize text
    df['province'] = df['province'].str.title().str.strip()
    
    # 5. Convert types
    df['date'] = pd.to_datetime(df['date'])
    
    # 6. Remove outliers
    df = df[df['age'].between(0, 120)]
    
    return df

Your Turn: Write the Pipeline

Clean This Dataset

Scenario: You received a CSV of Philippine typhoon damage reports from NDRRMC with these issues:

  • Regions are mixed: "NCR", "National Capital Region", "Metro Manila"
  • Damage amounts have prefix and commas: "PHP 1,500,000"
  • Dates are mixed: "2024-01-15", "January 15, 2024", "15/01/2024"
  • 12% of municipality names are NaN
  • One province has damage = -500,000 (negative)

Task 1: Write Python pseudocode for each cleaning step

Task 2: What would you do about the missing municipality names? Drop? Fill? How?

5 minutes

Timeline of Philippine Data Challenges

Why Philippine datasets are harder to clean than most

Philippine Data Challenges Timeline 2000 New Provinces Regional boundary changes 2013 K-12 Reform All education categories changed overnight 2020 COVID-19 Census delayed, new collection methods 2022 ePhilID New ID system = new data schema Each event broke backward compatibility in government datasets

Every one of these events means you can't naively merge datasets across time periods without first harmonizing definitions, codes, and boundaries.

Case Study: Cleaning PSA Data

Common Issues in Philippine Government Data

# 1. Region names inconsistent (abbreviations vs full names)
region_mapping = {
    'NCR': 'National Capital Region',
    'CAR': 'Cordillera Administrative Region',
    'BARMM': 'Bangsamoro Autonomous Region'
}
df['region'] = df['region'].replace(region_mapping)

# 2. Numbers stored with commas as strings
df['population'] = df['population'].str.replace(',', '').astype(int)

# 3. Date formats vary (day-first is common in PH)
df['census_date'] = pd.to_datetime(df['date'], dayfirst=True)
region (before)region (after)
0NCRNational Capital Region
1CARCordillera Administrative Region
2BARMMBangsamoro Autonomous Region

Data Validation

def validate_data(df):
    """Validate cleaned data"""
    assert df.duplicated().sum() == 0, "Duplicates found!"
    assert df.isnull().sum().sum() == 0, "Missing values found!"
    assert df['age'].between(0, 120).all(), "Invalid ages!"
    assert df['population'].ge(0).all(), "Negative population!"
    
    print("✅ All validations passed!")
    return True

# Run validation
validate_data(df_clean)

Lecture 6 Summary

Key Takeaways

  1. Always explore data quality first
  2. Choose appropriate missing value strategies
  3. Detect outliers with IQR or Z-score
  4. Transform data to suit analysis needs
  5. Validate data after cleaning

Week 3 Complete!

Remember: NASA lost $327M because someone didn't convert units. Samsung lost $105B because someone picked the wrong dropdown. The skills you learned this week prevent those disasters.

You can now:

  • Collect data from any source — APIs, SQL, files, scraping
  • Clean the messiest datasets — missing values, duplicates, inconsistencies
  • Transform data into any shape — pivot, melt, encode, bin
  • Validate your results — trust but verify

Next Week: Exploratory Data Analysis

Lab 3: Data Wrangling Practice

This Week's Lab

  • Clean a messy Philippine Census dataset
  • Write SQL queries to extract regional data
  • Handle missing values and outliers
  • Transform data for analysis

Due: Before next lecture