Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu
Navigators noticed the discrepancy months earlier but "did not follow the rules about filling out the form" to flag it.
A single unit conversion error — the most basic data cleaning task — destroyed a spacecraft.
Source: Wikipedia · IEEE Spectrum (Oberg, 1999)
Lecture 5: Data Collection & SQL
Data Wrangling: The process of transforming raw data into a format suitable for analysis
An employee typed "shares" instead of "won" when processing dividends → issued 2.8 billion ghost shares worth ~$105 billion → stock dropped 11% in one day. Took 37 minutes to catch. 16 employees sold 5 million shares before the fix.
Source: Wikipedia
Data wrangling isn't boring busywork. It's the difference between insight and disaster.
Every analytics project follows this sequence
| Method | Description | Example |
|---|---|---|
| APIs | Programmatic access | Twitter API, PSA OpenSTAT |
| Web Scraping | Extract from websites | News articles, prices |
| Databases | SQL queries | Company data warehouse |
| Files | CSV, Excel, JSON | Government datasets |
| Surveys | Primary collection | Google Forms, interviews |
| Source | Data Types |
|---|---|
| PSA OpenSTAT | Demographics, labor, prices |
| BSP Statistics | Financial, monetary, banking |
| PAGASA | Weather, climate, disasters |
| DOH | Health statistics, epidemiology |
| PSE EDGE | Stock market, company filings |
| PhilGEPS | Government procurement, tenders, contracts |
This is YOUR data. As Filipino data scientists, you'll be the ones cleaning Philippine government datasets.
SQL (Structured Query Language): The standard language for interacting with relational databases
-- Basic SELECT statement
SELECT column1, column2
FROM table_name
WHERE condition
ORDER BY column1;
SQL is essential for extracting data from databases in organizations
-- Select all columns
SELECT * FROM employees;
-- Select specific columns
SELECT first_name, last_name, salary
FROM employees;
-- Select with alias
SELECT first_name AS name,
salary * 12 AS annual_salary
FROM employees;
-- Comparison operators
SELECT * FROM products WHERE price > 1000;
-- Multiple conditions
SELECT * FROM employees
WHERE department = 'IT'
AND salary >= 50000;
-- IN operator
SELECT * FROM regions
WHERE province IN ('Cebu', 'Bohol', 'Leyte');
-- LIKE for pattern matching
SELECT * FROM customers
WHERE name LIKE 'Maria%';
-- Sort ascending (default)
SELECT * FROM products
ORDER BY price;
-- Sort descending
SELECT * FROM employees
ORDER BY salary DESC;
-- Multiple columns
SELECT * FROM sales
ORDER BY region, date DESC;
-- Limit results (top N)
SELECT * FROM products
ORDER BY price DESC
LIMIT 10;
-- Common aggregates
SELECT
COUNT(*) AS total_employees,
AVG(salary) AS avg_salary,
SUM(salary) AS total_payroll,
MIN(salary) AS min_salary,
MAX(salary) AS max_salary
FROM employees;
Key Aggregates: COUNT, SUM, AVG, MIN, MAX, STDDEV
-- Group by single column
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
-- Group by multiple columns
SELECT region, product_category,
SUM(sales) AS total_sales
FROM transactions
GROUP BY region, product_category;
-- Filter groups with HAVING
SELECT department, COUNT(*) AS emp_count
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;
INNER
Matches only
LEFT
All left + matches
RIGHT
Matches + all right
FULL OUTER
Everything from both
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.id;
-- Combine province data with regional info
SELECT
p.province_name,
r.region_name,
p.population,
p.land_area
FROM provinces p
LEFT JOIN regions r
ON p.region_code = r.region_code
WHERE r.island_group = 'Visayas'
ORDER BY p.population DESC;
Subquery (also called inner query or nested query): A SQL query embedded inside another query. The inner query runs first, and its result is used by the outer query.
Two common uses: (1) Filter using a computed value, (2) Create a derived table as a data source
-- Subquery in WHERE
SELECT * FROM employees
WHERE salary > (
SELECT AVG(salary) FROM employees
);
-- Subquery in FROM
SELECT dept, avg_sal
FROM (
SELECT department AS dept,
AVG(salary) AS avg_sal
FROM employees
GROUP BY department
) AS dept_averages
WHERE avg_sal > 50000;
SELECT r.region_name, s.median_salary
FROM salary_data s
JOIN regions r ON s.region_code = r.region_code
WHERE s.median_salary > (
SELECT AVG(median_salary) FROM salary_data
)
AND r.island_group = 'Visayas'
ORDER BY s.median_salary DESC;
| region_name | median_salary | |
|---|---|---|
| 0 | Central Visayas (VII) | 14,200 |
| 1 | Western Visayas (VI) | 13,500 |
Illustrative data based on PSA Labor Force Survey patterns
You have SQL access to a database with student grades, attendance, and personal information at UP Cebu. Your boss asks you to run a query joining grades with social media activity data to "predict student success."
Consider:
Discuss:
4 minutes
import pandas as pd
import sqlite3
# Connect to database
conn = sqlite3.connect('ph_data.db')
# Execute query and load to DataFrame
query = """
SELECT region, province, population
FROM census_2020
WHERE population > 1000000
ORDER BY population DESC
"""
df = pd.read_sql(query, conn)
# Close connection
conn.close()
print(df.head())
import requests
import pandas as pd
# PSA OpenSTAT API example
url = "https://openstat.psa.gov.ph/api/data"
params = {
'indicator': 'population',
'year': 2020,
'format': 'json'
}
response = requests.get(url, params=params)
data = response.json()
# Convert to DataFrame
df = pd.DataFrame(data['results'])
print(df.head())
import pandas as pd
# CSV files
df_csv = pd.read_csv('ph_population.csv')
# Excel files
df_excel = pd.read_excel('census_data.xlsx',
sheet_name='2020')
# JSON files
df_json = pd.read_json('api_response.json')
# Multiple files
import glob
files = glob.glob('data/*.csv')
df = pd.concat([pd.read_csv(f) for f in files])
Turning messy reality into analysis-ready data
Lecture 6: Data Cleaning & Transformation
| Issue | Example | Impact |
|---|---|---|
| Missing Values | Empty cells, NaN | Biased analysis |
| Duplicates | Repeated rows | Inflated counts |
| Inconsistent | "Cebu" vs "CEBU" | Wrong grouping |
| Outliers | Age = 999 | Skewed statistics |
| Wrong Types | Dates as strings | Failed operations |
Scenario: You received this student enrollment dataset. How many problems can you find?
| university | program | year | gwa | |
|---|---|---|---|---|
| 0 | UP Cebu | BSCS | 2024 | 1.75 |
| 1 | up cebu | BSCS | 2024 | NaN |
| 2 | UP Cebu | BS CS | 2024 | 1.25 |
| 3 | UP Cebu | BSCS | 2024 | 1.75 |
| 4 | UP-Cebu | BSCS | 2025 | -0.5 |
| 5 | UP Cebu | BSCS | twenty twenty-four | 2.00 |
Task: List each problem AND its data quality category (missing, inconsistent, duplicate, outlier, wrong type).
3 minutes
import pandas as pd
# Load Philippine economic data
df = pd.read_csv('ph_economic_indicators.csv')
# Basic info
print(df.info())
# Missing values
print(df.isnull().sum())
# Unique values per column
print(df.nunique())
# Statistical summary
print(df.describe())
# Check missing values
print(df.isnull().sum())
# Drop rows with any missing
df_clean = df.dropna()
# Drop rows where specific column is missing
df_clean = df.dropna(subset=['income'])
# Fill with constant
df['income'].fillna(0, inplace=True)
# Fill with mean/median
df['income'].fillna(df['income'].median(), inplace=True)
# Forward/backward fill (time series)
df['price'].fillna(method='ffill', inplace=True)
Before (raw data)
| province | monthly_income |
|---|---|
| Cebu | 15,000 |
| Bohol | NaN |
| Leyte | 11,000 |
| Siquijor | NaN |
| Negros Or. | 13,000 |
After (median fill)
| province | monthly_income |
|---|---|
| Cebu | 15,000 |
| Bohol | 13,000 |
| Leyte | 11,000 |
| Siquijor | 13,000 |
| Negros Or. | 13,000 |
Available values: 15,000 + 11,000 + 13,000 = 39,000 | Mean = 39,000 / 3 = 13,000 | Median (sorted: 11K, 13K, 15K) = 13,000
Philippine Monthly Salary Data (PHP): [8K, 10K, 12K, 13K, 14K, 15K, 18K, 20K, 50K, 500K]
Mean = (8 + 10 + 12 + 13 + 14 + 15 + 18 + 20 + 50 + 500) ÷ 10 = PHP 66,000
Median = (14 + 15) ÷ 2 = PHP 14,500
The mean says the "average" Filipino in this sample earns PHP 66,000/month.
But 9 out of 10 people earn less than PHP 66,000. The CEO's salary pulls the mean to a number that represents nobody.
This is why news reports about "average income" can be deeply misleading. Always check the median.
Imputation: The process of replacing missing values with substituted values estimated from the available data. Unlike dropping rows (which loses data), imputation preserves the dataset size.
Three common approaches:
The right method depends on why the data is missing and the data type.
Forward Fill (ffill): Propagates the last observed non-null value forward. Best suited for time series where values persist until changed (e.g., stock prices, temperature).
Before
| date | price |
|---|---|
| 2024-01-01 | 150.0 |
| 2024-01-02 | NaN |
| 2024-01-03 | NaN |
| 2024-01-04 | 155.0 |
| 2024-01-05 | NaN |
After (ffill)
| date | price |
|---|---|
| 2024-01-01 | 150.0 |
| 2024-01-02 | 150.0 |
| 2024-01-03 | 150.0 |
| 2024-01-04 | 155.0 |
| 2024-01-05 | 155.0 |
| Strategy | When to Use |
|---|---|
| Drop rows | Few missing, random pattern |
| Mean/Median | Numeric, roughly normal |
| Mode | Categorical variables |
| Forward fill | Time series data |
| Imputation | Preserve relationships |
# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# View duplicate rows
print(df[df.duplicated(keep=False)])
# Remove duplicates
df_clean = df.drop_duplicates()
# Remove based on specific columns
df_clean = df.drop_duplicates(
subset=['id', 'date'],
keep='last'
)
# Check current types
print(df.dtypes)
# Convert to numeric
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Convert to datetime
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# Convert to category (memory efficient)
df['region'] = df['region'].astype('category')
# Convert to string
df['code'] = df['code'].astype(str)
# Remove whitespace
df['name'] = df['name'].str.strip()
# Standardize case
df['city'] = df['city'].str.title() # "cebu" → "Cebu"
df['code'] = df['code'].str.upper() # "ph" → "PH"
# Replace values
df['province'] = df['province'].str.replace(
'Cebu City', 'Cebu'
)
# Extract patterns
df['area_code'] = df['phone'].str.extract(r'\((\d+)\)')
Outlier: A data point that is significantly different from other observations in the dataset. Outliers can be legitimate extreme values or errors, and they can heavily skew statistical measures like the mean.
In Philippine salary data, a CEO earning PHP 5,000,000/month is a legitimate outlier when the national median salary is PHP 13,000/month. An "age" value of 999 is an error outlier.
Both affect your analysis, but require different handling strategies.
Quantile: A value that divides your sorted data into equal-sized groups.
Interquartile Range (IQR): $$IQR = Q3 - Q1$$
The IQR captures the middle 50% of the data. Values beyond 1.5 × IQR from Q1 or Q3 are considered outliers.
Z-Score: Measures how many standard deviations a data point is from the mean. A Z-score of 0 means the value equals the mean; a Z-score of 3 means it is 3 standard deviations above.
$$z = \frac{x - \mu}{\sigma}$$
Where: \(x\) = observed value, \(\mu\) = mean, \(\sigma\) = standard deviation
Convention: Values with |z| > 3 are typically considered outliers.
Under a normal distribution, 99.7% of data falls within 3 standard deviations of the mean (the 68-95-99.7 rule).
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[
(df['salary'] < lower) |
(df['salary'] > upper)
]
from scipy import stats
z_scores = stats.zscore(df['salary'])
outliers = df[abs(z_scores) > 3]
[8K, 10K, 12K, 13K, 14K, 15K, 18K, 20K, 50K, 500K]
Step 1: Q1 = 25th percentile = PHP 11,500
Step 2: Q3 = 75th percentile = PHP 19,000
Step 3: IQR = Q3 - Q1 = 19,000 - 11,500 = PHP 7,500
Step 4: Lower bound = Q1 - 1.5 × IQR = 11,500 - 11,250 = PHP 250
Step 5: Upper bound = Q3 + 1.5 × IQR = 19,000 + 11,250 = PHP 30,250
Outliers detected: PHP 50,000 and PHP 500,000 (above upper bound of PHP 30,250)
Mean (μ) = PHP 66,000 | Std Dev (σ) = PHP 152,373
For PHP 500,000: z = (500,000 - 66,000) / 152,373 = 2.85
For PHP 50,000: z = (50,000 - 66,000) / 152,373 = -0.10
Key insight: Z-scores are sensitive to extreme values! The mean and standard deviation are pulled by the very outlier we are trying to detect.
| Value | IQR Method | Z-Score Method |
|---|---|---|
| PHP 50,000 | Outlier | Not outlier (z = -0.10) |
| PHP 500,000 | Outlier | Borderline (z = 2.85) |
IQR is often more robust than Z-score for skewed data.
Winsorization (Option 2 below): Replacing extreme values with the nearest non-outlier boundary value. Unlike removal, it preserves the data point but limits its influence.
# Option 1: Remove outliers
df_clean = df[
(df['salary'] >= lower) &
(df['salary'] <= upper)
]
# Option 2: Cap outliers (winsorization)
df['salary'] = df['salary'].clip(lower=lower, upper=upper)
# Option 3: Transform (log)
import numpy as np
df['log_salary'] = np.log1p(df['salary'])
# Option 4: Keep but flag
df['is_outlier'] = (df['salary'] < lower) | (df['salary'] > upper)
Binning (also called discretization): Converting continuous numerical values into discrete categories. Useful for simplifying analysis, creating ML features, or visualizing distributions.
Before (continuous)
| name | age |
|---|---|
| Juan | 22 |
| Maria | 35 |
| Pedro | 67 |
| Ana | 8 |
| Carlos | 45 |
After (binned)
| name | age | age_group |
|---|---|---|
| Juan | 22 | Young Adult |
| Maria | 35 | Adult |
| Pedro | 67 | Senior |
| Ana | 8 | Child |
| Carlos | 45 | Middle Age |
One-Hot Encoding: Converts each category into a new binary (0/1) column. Each row gets exactly one "1".
Original
| id | region |
|---|---|
| 1 | Visayas |
| 2 | Luzon |
| 3 | Mindanao |
| 4 | Visayas |
One-Hot (pd.get_dummies)
| id | Luzon | Mindanao | Visayas |
|---|---|---|---|
| 1 | 0 | 0 | 1 |
| 2 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 |
| 4 | 0 | 0 | 1 |
Label Encoding assigns integers: Luzon=0, Mindanao=1, Visayas=2. Use for ordinal data (Low/Med/High). Use one-hot for nominal data (region names).
# Create new columns
df['gdp_per_capita'] = df['gdp'] / df['population']
# Binning continuous variables
df['age_group'] = pd.cut(
df['age'],
bins=[0, 18, 35, 50, 65, 100],
labels=['Child', 'Young Adult', 'Adult', 'Middle Age', 'Senior']
)
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['region'])
# Label encoding
df['region_code'] = df['region'].astype('category').cat.codes
Pivot (Long → Wide): Spreads values from one column into multiple new columns. Melt (Wide → Long): Gathers multiple columns into variable + value pairs.
Wide
| region | 2020 | 2021 | 2022 |
|---|---|---|---|
| Visayas | 500K | 550K | 600K |
| Luzon | 800K | 850K | 900K |
Long (after melt)
| region | year | sales |
|---|---|---|
| Visayas | 2020 | 500K |
| Visayas | 2021 | 550K |
| Luzon | 2020 | 800K |
| Luzon | 2021 | 850K |
# Wide to long (melt)
df_long = pd.melt(
df,
id_vars=['region', 'year'],
value_vars=['q1_sales', 'q2_sales', 'q3_sales', 'q4_sales'],
var_name='quarter',
value_name='sales'
)
# Long to wide (pivot)
df_wide = df_long.pivot_table(
index='region',
columns='year',
values='sales',
aggfunc='sum'
)
# Inner merge (SQL INNER JOIN)
df_merged = pd.merge(df1, df2, on='id', how='inner')
# Left merge (SQL LEFT JOIN)
df_merged = pd.merge(
population_df,
income_df,
left_on='province_code',
right_on='prov_id',
how='left'
)
# Concatenate (stack)
df_combined = pd.concat([df_2020, df_2021, df_2022])
def clean_philippine_data(df):
"""Standard cleaning pipeline for PH datasets"""
# 1. Remove duplicates
df = df.drop_duplicates()
# 2. Standardize column names
df.columns = df.columns.str.lower().str.replace(' ', '_')
# 3. Handle missing values
df['income'] = df['income'].fillna(df['income'].median())
# 4. Standardize text
df['province'] = df['province'].str.title().str.strip()
# 5. Convert types
df['date'] = pd.to_datetime(df['date'])
# 6. Remove outliers
df = df[df['age'].between(0, 120)]
return df
Scenario: You received a CSV of Philippine typhoon damage reports from NDRRMC with these issues:
"NCR", "National Capital Region", "Metro Manila""PHP 1,500,000""2024-01-15", "January 15, 2024", "15/01/2024"NaN-500,000 (negative)Task 1: Write Python pseudocode for each cleaning step
Task 2: What would you do about the missing municipality names? Drop? Fill? How?
5 minutes
Why Philippine datasets are harder to clean than most
Every one of these events means you can't naively merge datasets across time periods without first harmonizing definitions, codes, and boundaries.
# 1. Region names inconsistent (abbreviations vs full names)
region_mapping = {
'NCR': 'National Capital Region',
'CAR': 'Cordillera Administrative Region',
'BARMM': 'Bangsamoro Autonomous Region'
}
df['region'] = df['region'].replace(region_mapping)
# 2. Numbers stored with commas as strings
df['population'] = df['population'].str.replace(',', '').astype(int)
# 3. Date formats vary (day-first is common in PH)
df['census_date'] = pd.to_datetime(df['date'], dayfirst=True)
| region (before) | region (after) | |
|---|---|---|
| 0 | NCR | National Capital Region |
| 1 | CAR | Cordillera Administrative Region |
| 2 | BARMM | Bangsamoro Autonomous Region |
def validate_data(df):
"""Validate cleaned data"""
assert df.duplicated().sum() == 0, "Duplicates found!"
assert df.isnull().sum().sum() == 0, "Missing values found!"
assert df['age'].between(0, 120).all(), "Invalid ages!"
assert df['population'].ge(0).all(), "Negative population!"
print("✅ All validations passed!")
return True
# Run validation
validate_data(df_clean)
Remember: NASA lost $327M because someone didn't convert units. Samsung lost $105B because someone picked the wrong dropdown. The skills you learned this week prevent those disasters.
You can now:
Next Week: Exploratory Data Analysis
Due: Before next lecture