Data Wrangling

CMSC 178DA - Week 03

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

When Bad Data Costs $327 Million

NASA Mars Climate Orbiter (1999)

1998: NASA launches a $125 million probe to study Mars climate
Lockheed Martin calculates thruster force in pound-force seconds
NASA's navigation team expects newton-seconds (off by factor of 4.45x)
Sep 23, 1999: Orbiter enters Mars atmosphere 170 km too low. Destroyed.
Total mission loss: $327 million

Navigators noticed the discrepancy months earlier but "did not follow the rules about filling out the form" to flag it.

A single unit conversion error — the most basic data cleaning task — destroyed a spacecraft.

Source: Wikipedia · IEEE Spectrum (Oberg, 1999)

Learning Objectives

Lecture 5: Data Collection & SQL

Understand data collection methodologies
Master SQL fundamentals for data extraction
Work with Philippine government data sources
Apply best practices for data acquisition

Why Data Wrangling?

Data Wrangling: The process of transforming raw data into a format suitable for analysis

80%

of analytics time spent on data prep

60%

of data scientists say cleaning is most tedious

GIGO

"Garbage in, garbage out"

Samsung Securities (April 2018)

An employee typed "shares" instead of "won" when processing dividends → issued 2.8 billion ghost shares worth ~$105 billion → stock dropped 11% in one day. Took 37 minutes to catch. 16 employees sold 5 million shares before the fix.

Source: Wikipedia

Data wrangling isn't boring busywork. It's the difference between insight and disaster.

The Data Wrangling Pipeline

Every analytics project follows this sequence

Data Collection Methods

Method	Description	Example
APIs	Programmatic access	Twitter API, PSA OpenSTAT
Web Scraping	Extract from websites	News articles, prices
Databases	SQL queries	Company data warehouse
Files	CSV, Excel, JSON	Government datasets
Surveys	Primary collection	Google Forms, interviews

Philippine Data Sources 🇵🇭

Source	Data Types
PSA OpenSTAT	Demographics, labor, prices
BSP Statistics	Financial, monetary, banking
PAGASA	Weather, climate, disasters
DOH	Health statistics, epidemiology
PSE EDGE	Stock market, company filings
PhilGEPS	Government procurement, tenders, contracts

The Philippine Data Challenge

PIDS Census Data Harmonization Project

PIDS had to harmonize 5 decades (1970–2020) of Census of Population and Housing data
Found: inconsistent variable definitions across census rounds
Different classification schemes and measurement scales used over time
K-12 reform (2013) changed ALL education categories overnight — broke historical comparisons
New administrative regions over the decades required boundary adjustments in historical data

This is YOUR data. As Filipino data scientists, you'll be the ones cleaning Philippine government datasets.

Source: PIDS Press Release on Census Harmonization

SQL Fundamentals

SQL (Structured Query Language): The standard language for interacting with relational databases

-- Basic SELECT statement
SELECT column1, column2
FROM table_name
WHERE condition
ORDER BY column1;

SQL is essential for extracting data from databases in organizations

SELECT: Choosing Columns

-- Select all columns
SELECT * FROM employees;

-- Select specific columns
SELECT first_name, last_name, salary
FROM employees;

-- Select with alias
SELECT first_name AS name, 
       salary * 12 AS annual_salary
FROM employees;

WHERE: Filtering Rows

-- Comparison operators
SELECT * FROM products WHERE price > 1000;

-- Multiple conditions
SELECT * FROM employees 
WHERE department = 'IT' 
  AND salary >= 50000;

-- IN operator
SELECT * FROM regions 
WHERE province IN ('Cebu', 'Bohol', 'Leyte');

-- LIKE for pattern matching
SELECT * FROM customers 
WHERE name LIKE 'Maria%';

ORDER BY & LIMIT

-- Sort ascending (default)
SELECT * FROM products 
ORDER BY price;

-- Sort descending
SELECT * FROM employees 
ORDER BY salary DESC;

-- Multiple columns
SELECT * FROM sales 
ORDER BY region, date DESC;

-- Limit results (top N)
SELECT * FROM products 
ORDER BY price DESC 
LIMIT 10;

Aggregate Functions

-- Common aggregates
SELECT 
    COUNT(*) AS total_employees,
    AVG(salary) AS avg_salary,
    SUM(salary) AS total_payroll,
    MIN(salary) AS min_salary,
    MAX(salary) AS max_salary
FROM employees;

Key Aggregates: COUNT, SUM, AVG, MIN, MAX, STDDEV

GROUP BY: Aggregating Groups

-- Group by single column
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- Group by multiple columns
SELECT region, product_category, 
       SUM(sales) AS total_sales
FROM transactions
GROUP BY region, product_category;

-- Filter groups with HAVING
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

JOINs: Combining Tables

INNER

Matches only

LEFT

All left + matches

RIGHT

Matches + all right

FULL OUTER

Everything from both

SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d
  ON e.dept_id = d.id;

JOIN Example: Philippine Regions

-- Combine province data with regional info
SELECT 
    p.province_name,
    r.region_name,
    p.population,
    p.land_area
FROM provinces p
LEFT JOIN regions r 
    ON p.region_code = r.region_code
WHERE r.island_group = 'Visayas'
ORDER BY p.population DESC;

What is a Subquery?

Subquery (also called inner query or nested query): A SQL query embedded inside another query. The inner query runs first, and its result is used by the outer query.

Two common uses: (1) Filter using a computed value, (2) Create a derived table as a data source

Subqueries

-- Subquery in WHERE
SELECT * FROM employees
WHERE salary > (
    SELECT AVG(salary) FROM employees
);

-- Subquery in FROM
SELECT dept, avg_sal
FROM (
    SELECT department AS dept, 
           AVG(salary) AS avg_sal
    FROM employees
    GROUP BY department
) AS dept_averages
WHERE avg_sal > 50000;

Case Study: Querying Regional Data

Find Visayas Regions with Above-Average Salary

SELECT r.region_name, s.median_salary
FROM salary_data s
JOIN regions r ON s.region_code = r.region_code
WHERE s.median_salary > (
    SELECT AVG(median_salary) FROM salary_data
)
AND r.island_group = 'Visayas'
ORDER BY s.median_salary DESC;

	region_name	median_salary
0	Central Visayas (VII)	14,200
1	Western Visayas (VI)	13,500

Illustrative data based on PSA Labor Force Survey patterns

Think About It

Group Discussion

You have SQL access to a database with student grades, attendance, and personal information at UP Cebu. Your boss asks you to run a query joining grades with social media activity data to "predict student success."

Consider:

Data Privacy Act (RA 10173)
Consent & purpose limitation
Data minimization principle
Who controls the data?

Discuss:

Is this query ethical?
What data should you refuse to join?
How would you push back on this request?

4 minutes

Python + SQL Integration

import pandas as pd
import sqlite3

# Connect to database
conn = sqlite3.connect('ph_data.db')

# Execute query and load to DataFrame
query = """
SELECT region, province, population
FROM census_2020
WHERE population > 1000000
ORDER BY population DESC
"""
df = pd.read_sql(query, conn)

# Close connection
conn.close()
print(df.head())

API Data Collection

import requests
import pandas as pd

# PSA OpenSTAT API example
url = "https://openstat.psa.gov.ph/api/data"
params = {
    'indicator': 'population',
    'year': 2020,
    'format': 'json'
}

response = requests.get(url, params=params)
data = response.json()

# Convert to DataFrame
df = pd.DataFrame(data['results'])
print(df.head())

Reading Various File Formats

import pandas as pd

# CSV files
df_csv = pd.read_csv('ph_population.csv')

# Excel files
df_excel = pd.read_excel('census_data.xlsx', 
                          sheet_name='2020')

# JSON files
df_json = pd.read_json('api_response.json')

# Multiple files
import glob
files = glob.glob('data/*.csv')
df = pd.concat([pd.read_csv(f) for f in files])

Lecture 5 Summary

                    Key Takeaways
                    Multiple data collection methods exist
Philippine has rich public data sources
SQL is essential for database extraction
Python integrates seamlessly with SQL
Always validate data after collection

                

Part 2

Data Cleaning & Transformation

Turning messy reality into analysis-ready data

Learning Objectives

Lecture 6: Data Cleaning & Transformation

Identify and handle missing values
Detect and treat outliers
Perform data type conversions
Apply data transformation techniques

Common Data Quality Issues

Issue	Example	Impact
Missing Values	Empty cells, NaN	Biased analysis
Duplicates	Repeated rows	Inflated counts
Inconsistent	"Cebu" vs "CEBU"	Wrong grouping
Outliers	Age = 999	Skewed statistics
Wrong Types	Dates as strings	Failed operations

Spot the Data Quality Issues

Spot the Issues

Scenario: You received this student enrollment dataset. How many problems can you find?

	university	program	year	gwa
0	UP Cebu	BSCS	2024	1.75
1	up cebu	BSCS	2024	NaN
2	UP Cebu	BS CS	2024	1.25
3	UP Cebu	BSCS	2024	1.75
4	UP-Cebu	BSCS	2025	-0.5
5	UP Cebu	BSCS	twenty twenty-four	2.00

Task: List each problem AND its data quality category (missing, inconsistent, duplicate, outlier, wrong type).

3 minutes

Exploring Data Quality

import pandas as pd

# Load Philippine economic data
df = pd.read_csv('ph_economic_indicators.csv')

# Basic info
print(df.info())

# Missing values
print(df.isnull().sum())

# Unique values per column
print(df.nunique())

# Statistical summary
print(df.describe())

Handling Missing Values

# Check missing values
print(df.isnull().sum())

# Drop rows with any missing
df_clean = df.dropna()

# Drop rows where specific column is missing
df_clean = df.dropna(subset=['income'])

# Fill with constant
df['income'].fillna(0, inplace=True)

# Fill with mean/median
df['income'].fillna(df['income'].median(), inplace=True)

# Forward/backward fill (time series)
df['price'].fillna(method='ffill', inplace=True)

Worked Example: Missing Values

Philippine Provincial Income Data

Before (raw data)

province	monthly_income
Cebu	15,000
Bohol	NaN
Leyte	11,000
Siquijor	NaN
Negros Or.	13,000

After (median fill)

province	monthly_income
Cebu	15,000
Bohol	13,000
Leyte	11,000
Siquijor	13,000
Negros Or.	13,000

Available values: 15,000 + 11,000 + 13,000 = 39,000 | Mean = 39,000 / 3 = 13,000 | Median (sorted: 11K, 13K, 15K) = 13,000

Why Median Beats Mean (Usually)

Philippine Monthly Salary Data (PHP): [8K, 10K, 12K, 13K, 14K, 15K, 18K, 20K, 50K, 500K]

Mean = (8 + 10 + 12 + 13 + 14 + 15 + 18 + 20 + 50 + 500) ÷ 10 = PHP 66,000

Median = (14 + 15) ÷ 2 = PHP 14,500

The mean says the "average" Filipino in this sample earns PHP 66,000/month.

But 9 out of 10 people earn less than PHP 66,000. The CEO's salary pulls the mean to a number that represents nobody.

This is why news reports about "average income" can be deeply misleading. Always check the median.

What is Imputation?

Imputation: The process of replacing missing values with substituted values estimated from the available data. Unlike dropping rows (which loses data), imputation preserves the dataset size.

Three common approaches:

Simple imputation — Mean, median, or mode replacement
Forward/backward fill — Use adjacent values (for time series)
Model-based — Use regression or k-NN to predict missing values

The right method depends on why the data is missing and the data type.

Forward Fill: Visual Example

Forward Fill (ffill): Propagates the last observed non-null value forward. Best suited for time series where values persist until changed (e.g., stock prices, temperature).

Before

date	price
2024-01-01	150.0
2024-01-02	NaN
2024-01-03	NaN
2024-01-04	155.0
2024-01-05	NaN

After (ffill)

date	price
2024-01-01	150.0
2024-01-02	150.0
2024-01-03	150.0
2024-01-04	155.0
2024-01-05	155.0

When to Use Each Strategy

Strategy	When to Use
Drop rows	Few missing, random pattern
Mean/Median	Numeric, roughly normal
Mode	Categorical variables
Forward fill	Time series data
Imputation	Preserve relationships

Handling Duplicates

# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")

# View duplicate rows
print(df[df.duplicated(keep=False)])

# Remove duplicates
df_clean = df.drop_duplicates()

# Remove based on specific columns
df_clean = df.drop_duplicates(
    subset=['id', 'date'], 
    keep='last'
)

Data Type Conversion

# Check current types
print(df.dtypes)

# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Convert to category (memory efficient)
df['region'] = df['region'].astype('category')

# Convert to string
df['code'] = df['code'].astype(str)

String Cleaning

# Remove whitespace
df['name'] = df['name'].str.strip()

# Standardize case
df['city'] = df['city'].str.title()  # "cebu" → "Cebu"
df['code'] = df['code'].str.upper()  # "ph" → "PH"

# Replace values
df['province'] = df['province'].str.replace(
    'Cebu City', 'Cebu'
)

# Extract patterns
df['area_code'] = df['phone'].str.extract(r'\((\d+)\)')

Understanding Outliers

Outlier: A data point that is significantly different from other observations in the dataset. Outliers can be legitimate extreme values or errors, and they can heavily skew statistical measures like the mean.

Philippine Context

In Philippine salary data, a CEO earning PHP 5,000,000/month is a legitimate outlier when the national median salary is PHP 13,000/month. An "age" value of 999 is an error outlier.

Both affect your analysis, but require different handling strategies.

Quantiles, Percentiles, and IQR

Quantile: A value that divides your sorted data into equal-sized groups.

Percentile — Divides into 100 groups. "The 75th percentile means 75% of values are below this point."
Quartile — Divides into 4 groups: Q1 (25th %ile), Q2/Median (50th), Q3 (75th)

Interquartile Range (IQR): $$IQR = Q3 - Q1$$

The IQR captures the middle 50% of the data. Values beyond 1.5 × IQR from Q1 or Q3 are considered outliers.

Understanding Z-Scores

Z-Score: Measures how many standard deviations a data point is from the mean. A Z-score of 0 means the value equals the mean; a Z-score of 3 means it is 3 standard deviations above.

$$z = \frac{x - \mu}{\sigma}$$

Where: $x$ = observed value, $\mu$ = mean, $\sigma$ = standard deviation

Convention: Values with |z| > 3 are typically considered outliers.

Under a normal distribution, 99.7% of data falls within 3 standard deviations of the mean (the 68-95-99.7 rule).

Outlier Detection

IQR Method

Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[
    (df['salary'] < lower) | 
    (df['salary'] > upper)
]

Z-Score Method

from scipy import stats

z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]

Worked Example: IQR Method

Philippine Monthly Salary Data (PHP)

[8K, 10K, 12K, 13K, 14K, 15K, 18K, 20K, 50K, 500K]

Step 1: Q1 = 25th percentile = PHP 11,500

Step 2: Q3 = 75th percentile = PHP 19,000

Step 3: IQR = Q3 - Q1 = 19,000 - 11,500 = PHP 7,500

Step 4: Lower bound = Q1 - 1.5 × IQR = 11,500 - 11,250 = PHP 250

Step 5: Upper bound = Q3 + 1.5 × IQR = 19,000 + 11,250 = PHP 30,250

Outliers detected: PHP 50,000 and PHP 500,000 (above upper bound of PHP 30,250)

Worked Example: Z-Score Method

Same Salary Dataset

Mean (μ) = PHP 66,000 | Std Dev (σ) = PHP 152,373

For PHP 500,000: z = (500,000 - 66,000) / 152,373 = 2.85

For PHP 50,000: z = (50,000 - 66,000) / 152,373 = -0.10

Key insight: Z-scores are sensitive to extreme values! The mean and standard deviation are pulled by the very outlier we are trying to detect.

Value	IQR Method	Z-Score Method
PHP 50,000	Outlier	Not outlier (z = -0.10)
PHP 500,000	Outlier	Borderline (z = 2.85)

IQR is often more robust than Z-score for skewed data.

Handling Outliers

Winsorization (Option 2 below): Replacing extreme values with the nearest non-outlier boundary value. Unlike removal, it preserves the data point but limits its influence.

# Option 1: Remove outliers
df_clean = df[
    (df['salary'] >= lower) & 
    (df['salary'] <= upper)
]

# Option 2: Cap outliers (winsorization)
df['salary'] = df['salary'].clip(lower=lower, upper=upper)

# Option 3: Transform (log)
import numpy as np
df['log_salary'] = np.log1p(df['salary'])

# Option 4: Keep but flag
df['is_outlier'] = (df['salary'] < lower) | (df['salary'] > upper)

Binning: Continuous to Categorical

Binning (also called discretization): Converting continuous numerical values into discrete categories. Useful for simplifying analysis, creating ML features, or visualizing distributions.

Before (continuous)

name	age
Juan	22
Maria	35
Pedro	67
Ana	8
Carlos	45

After (binned)

name	age	age_group
Juan	22	Young Adult
Maria	35	Adult
Pedro	67	Senior
Ana	8	Child
Carlos	45	Middle Age

One-Hot & Label Encoding

One-Hot Encoding: Converts each category into a new binary (0/1) column. Each row gets exactly one "1".

Original

id	region
1	Visayas
2	Luzon
3	Mindanao
4	Visayas

One-Hot (pd.get_dummies)

id	Luzon	Mindanao	Visayas
1	0	0	1
2	1	0	0
3	0	1	0
4	0	0	1

Label Encoding assigns integers: Luzon=0, Mindanao=1, Visayas=2. Use for ordinal data (Low/Med/High). Use one-hot for nominal data (region names).

Data Transformation

# Create new columns
df['gdp_per_capita'] = df['gdp'] / df['population']

# Binning continuous variables
df['age_group'] = pd.cut(
    df['age'], 
    bins=[0, 18, 35, 50, 65, 100],
    labels=['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
)

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['region'])

# Label encoding
df['region_code'] = df['region'].astype('category').cat.codes

Pivot vs Melt: Visual

Pivot (Long → Wide): Spreads values from one column into multiple new columns. Melt (Wide → Long): Gathers multiple columns into variable + value pairs.

Wide

region	2020	2021	2022
Visayas	500K	550K	600K
Luzon	800K	850K	900K

Long (after melt)

region	year	sales
Visayas	2020	500K
Visayas	2021	550K
Luzon	2020	800K
Luzon	2021	850K

Reshaping: Pivot & Melt

# Wide to long (melt)
df_long = pd.melt(
    df, 
    id_vars=['region', 'year'],
    value_vars=['q1_sales', 'q2_sales', 'q3_sales', 'q4_sales'],
    var_name='quarter',
    value_name='sales'
)

# Long to wide (pivot)
df_wide = df_long.pivot_table(
    index='region',
    columns='year',
    values='sales',
    aggfunc='sum'
)

Merging DataFrames

# Inner merge (SQL INNER JOIN)
df_merged = pd.merge(df1, df2, on='id', how='inner')

# Left merge (SQL LEFT JOIN)
df_merged = pd.merge(
    population_df, 
    income_df,
    left_on='province_code',
    right_on='prov_id',
    how='left'
)

# Concatenate (stack)
df_combined = pd.concat([df_2020, df_2021, df_2022])

Complete Cleaning Pipeline

def clean_philippine_data(df):
    """Standard cleaning pipeline for PH datasets"""
    # 1. Remove duplicates
    df = df.drop_duplicates()
    
    # 2. Standardize column names
    df.columns = df.columns.str.lower().str.replace(' ', '_')
    
    # 3. Handle missing values
    df['income'] = df['income'].fillna(df['income'].median())
    
    # 4. Standardize text
    df['province'] = df['province'].str.title().str.strip()
    
    # 5. Convert types
    df['date'] = pd.to_datetime(df['date'])
    
    # 6. Remove outliers
    df = df[df['age'].between(0, 120)]
    
    return df

Your Turn: Write the Pipeline

Clean This Dataset

Scenario: You received a CSV of Philippine typhoon damage reports from NDRRMC with these issues:

Regions are mixed: "NCR", "National Capital Region", "Metro Manila"
Damage amounts have prefix and commas: "PHP 1,500,000"
Dates are mixed: "2024-01-15", "January 15, 2024", "15/01/2024"
12% of municipality names are NaN
One province has damage = -500,000 (negative)

Task 1: Write Python pseudocode for each cleaning step

Task 2: What would you do about the missing municipality names? Drop? Fill? How?

5 minutes

Timeline of Philippine Data Challenges

Why Philippine datasets are harder to clean than most

Every one of these events means you can't naively merge datasets across time periods without first harmonizing definitions, codes, and boundaries.

Case Study: Cleaning PSA Data

Common Issues in Philippine Government Data

# 1. Region names inconsistent (abbreviations vs full names)
region_mapping = {
    'NCR': 'National Capital Region',
    'CAR': 'Cordillera Administrative Region',
    'BARMM': 'Bangsamoro Autonomous Region'
}
df['region'] = df['region'].replace(region_mapping)

# 2. Numbers stored with commas as strings
df['population'] = df['population'].str.replace(',', '').astype(int)

# 3. Date formats vary (day-first is common in PH)
df['census_date'] = pd.to_datetime(df['date'], dayfirst=True)

	region (before)	region (after)
0	NCR	National Capital Region
1	CAR	Cordillera Administrative Region
2	BARMM	Bangsamoro Autonomous Region

Data Validation

def validate_data(df):
    """Validate cleaned data"""
    assert df.duplicated().sum() == 0, "Duplicates found!"
    assert df.isnull().sum().sum() == 0, "Missing values found!"
    assert df['age'].between(0, 120).all(), "Invalid ages!"
    assert df['population'].ge(0).all(), "Negative population!"
    
    print("✅ All validations passed!")
    return True

# Run validation
validate_data(df_clean)

Lecture 6 Summary

                    Key Takeaways
                    Always explore data quality first
Choose appropriate missing value strategies
Detect outliers with IQR or Z-score
Transform data to suit analysis needs
Validate data after cleaning

                

Week 3 Complete!

Remember: NASA lost $327M because someone didn't convert units. Samsung lost $105B because someone picked the wrong dropdown. The skills you learned this week prevent those disasters.

You can now:

Collect data from any source — APIs, SQL, files, scraping
Clean the messiest datasets — missing values, duplicates, inconsistencies
Transform data into any shape — pivot, melt, encode, bin
Validate your results — trust but verify

Next Week: Exploratory Data Analysis

Lab 3: Data Wrangling Practice

This Week's Lab

Clean a messy Philippine Census dataset
Write SQL queries to extract regional data
Handle missing values and outliers
Transform data for analysis

Due: Before next lecture