The complete journey from raw data to actionable insights:
1. Collection
Wrangling, cleaning, sampling
2. Management
Storage, access, reliability
3. Exploration
EDA, hypotheses, intuition
4. Prediction
Models, algorithms, inference
5. Communication
Visualization, storytelling
1. Data Collection
Sources:
Databases (SQL, NoSQL)
APIs (REST, GraphQL)
Web scraping
Surveys and forms
IoT sensors
Philippine Example:
PSA Census Data
Every 5 years, PSA collects data on:
Population demographics
Housing conditions
Education levels
Employment status
2. Data Management
Key Considerations:
Storage: CSV, databases, cloud
Quality: Accuracy, completeness
Governance: Access control, policies
Security: Encryption, backups
Philippine Government Data:
Portal
Data Types
PSA OpenSTAT
Demographics, labor
BSP Statistics
Financial, banking
PAGASA
Weather, climate
Data.gov.ph
Open government
3. Exploratory Data Analysis
EDA: "Detective work" on data - understanding patterns before modeling.
Key Activities:
Summary statistics
Distribution analysis
Correlation exploration
Outlier detection
# Quick EDA in Python
import pandas as pd
df = pd.read_csv('ph_data.csv')
df.describe() # Summary stats
df.info() # Data types
df.isnull().sum() # Missing values
Code Demo: Loading Philippine Data
# Load Philippine Population Data (PSA 2020 Census)
import pandas as pd
# Load the dataset
df = pd.read_csv('ph_population_2020.csv')
# Quick overview
print(f"Dataset: {len(df)} provinces across {df['region'].nunique()} regions")
print(f"Columns: {list(df.columns)}")
# Preview first 5 rows
df.head()
region
region_name
province
population_2020
growth_rate
0
NCR
National Capital Region
Metro Manila
13,484,462
0.93
1
CAR
Cordillera Admin Region
Benguet
460,683
1.55
2
CAR
Cordillera Admin Region
Ifugao
207,498
0.95
Code Demo: Descriptive Statistics
# Get summary statistics
df['population_2020'].describe()
# Output:
# count 81.000000
# mean 1,364,578.54
# std 1,892,432.12
# min 17,783.00 (Batanes - smallest province)
# 25% 460,683.00
# 50% 786,653.00
# max 13,484,462.00 (Metro Manila - largest)
# Which region has the highest average population?
df.groupby('region_name')['population_2020'].mean().sort_values(ascending=False).head(3)
Key Insight: Metro Manila has 13.5M people in just 619 km² - that's a density of 21,783 people per km²!
Code Demo: Quick Visualization
import matplotlib.pyplot as plt
# Population by region
region_pop = df.groupby('region')['population_2020'].sum()
region_pop = region_pop.sort_values(ascending=False)
# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(region_pop.index[:5],
region_pop.values[:5] / 1e6)
plt.ylabel('Population (Millions)')
plt.title('Top 5 Regions by Population')
plt.show()
Code Demo: Filtering & Analysis
# Find provinces with high growth rates (> 2% annual)
high_growth = df[df['growth_rate'] > 2.0].sort_values('growth_rate', ascending=False)
print(f"Found {len(high_growth)} high-growth provinces")
# Top 5 fastest growing
high_growth[['province', 'region_name', 'growth_rate', 'population_2020']].head()
province
region_name
growth_rate
population_2020
1
Sulu
Bangsamoro
3.92%
1,000,108
2
Cavite
CALABARZON
3.38%
4,344,829
3
Rizal
CALABARZON
2.91%
3,330,143
4
Lanao del Sur
Bangsamoro
2.72%
1,195,518
5
Tawi-Tawi
Bangsamoro
2.45%
441,045
Insight: CALABARZON (Metro Manila suburbs) and BARMM show the highest growth rates.
4. Prediction & Inference
Statistical Inference:
Drawing conclusions about populations
Hypothesis testing
Confidence intervals
Machine Learning (Refresher):
Supervised: Regression, Classification
Unsupervised: Clustering
Key Difference:
Inference: Explains relationships (why?)
Prediction: Forecasts outcomes (what?)
Analytics often prioritizes inference for decision-making.
5. Communication
The most critical (and often neglected) facet!
"The goal is not to build models - it's to drive decisions."
Key insight: Not all 2-point shots are equal, and most mid-range shots are bad decisions.
Moreyball: The Expected Value Math
Expected Points = Shot Value × Success Rate
Shot Type
Value
Typical %
Expected Points
Verdict
Layup/Dunk
2
65%
1.30
Best
3-Pointer
3
36%
1.08
Good
Mid-Range (2pt)
2
40%
0.80
Avoid!
Conclusion: A 36% three-pointer is worth more than a 40% mid-range shot!
Moreyball: Visualizing Expected Value
The visual makes it obvious: Mid-range shots are inefficient. Even a mediocre 3-pointer beats a good mid-range shot!
Activity: Calculate Expected Value
Think-Pair-Share
Instructions: With a partner, calculate the expected points for each shot type. Which shot should you take?
Shot Type
Point Value
Your Shooting %
Expected Points
Free Throw
1
80%
____
Corner 3-Pointer
3
42%
____
Post-up (close 2pt)
2
55%
____
Floater (mid-range)
2
38%
____
3 minutes
Discussion: Would your strategy change if you were down by 2 with 10 seconds left?
Moreyball: The Three-Point Revolution
The Rockets' Strategy:
Shoot 3-pointers (high expected value)
Drive to the basket (layups + fouls)
Get to the free throw line
Eliminate mid-range shots
50+
3PT Attempts per Game (2018-19)
League-leading, 2x more than 2012
James Harden's step-back 3 became the signature move of the analytics era.
Moreyball: How It Changed the NBA
22 → 35
League Avg 3PA (2010 → 2023)
↓ 50%
Mid-Range Shot Frequency
Steph Curry
The Ultimate Moreyball Player
Before and After Analytics
Before: Mid-range specialists like Michael Jordan, Kobe Bryant valued
After: Teams hunt 3-pointers; players like DeMar DeRozan seen as "inefficient"
Warriors dynasty: Built on Curry/Thompson shooting + analytics
Basketball Analytics: Beyond Shooting
Analytics Application
Traditional View
Analytics View
Player Value
Points per game
Win Shares, PER, VORP, RPM
Draft Picks
Eye test, athleticism
College stats, physical measurements, motor
Defense
Blocks, steals
Defensive Rating, opponent FG% at rim
Lineup Decisions
Coach intuition
Plus/minus data, matchup analytics
Rest/Load Management
Play through injury
Injury prediction models, rest optimization
Analytics Finds Hidden Gems
Draft Steals Found by Data
Nikola Jokic (41st pick): Advanced stats loved his passing
Draymond Green (35th): Analytics showed defensive versatility
Malcolm Brogdon (36th): Efficient college stats overlooked
PBA Application?
Which local players are undervalued?
Is the PBA still mid-range heavy?
Import decisions: stats vs. "name"
UAAP/NCAA data for draft picks
Question: Could a PBA team use Moreyball principles to win a championship on a budget?
Moreyball: Lessons for Data Analytics
Challenge the "eye test": What looks good isn't always effective
Expected value matters: Think probabilistically, not emotionally
Market inefficiencies exist: Find what others undervalue
Data changes culture: Analytics is now mandatory in all sports
But there are limits: Rockets never won championship; Warriors lost to Raptors
Key takeaway: Analytics gives an edge, but doesn't guarantee victory. Human execution still matters.
Analytics in the Philippines
Industry
Application
Companies
Fintech
Credit scoring, fraud detection
GCash, Maya, Tonik
Banking
Risk modeling, churn prediction
BDO, BPI, UnionBank
Retail
Demand forecasting, basket analysis
SM, Puregold, Mercury
Telecom
Network optimization, churn
Globe, Smart, DITO
Transport
Route optimization, pricing
Grab, Angkas, Lalamove
Course Overview: 12 Weeks
Foundations (Weeks 1-3)
Week 1: Introduction & Lifecycle
Week 2: Probability & Statistics
Week 3: Data Wrangling
EDA & Visualization (Weeks 4-6)
Week 4: Exploratory Data Analysis
Week 5: Visualization Principles
Week 6: Storytelling & Dashboards
Modeling (Weeks 7-9)
Week 7: Regression Analytics
Week 8: Tree-Based Methods
Week 9: Clustering & Segmentation
Advanced (Weeks 10-12)
Week 10: Time Series Analytics
Week 11: Text Analytics & Ethics
Week 12: Capstone Presentations
Assessment Structure
Component
Weight
Weekly Labs
25%
Midterm Exam
20%
Quizzes (3)
15%
Capstone Project
30%
Participation
10%
Capstone Project
End-to-end analytics project:
Team of 2-3 students
Philippine context preferred
EDA + Model + Dashboard
15-minute presentation
Tools We'll Use
Python
pandas, numpy, scikit-learn, matplotlib, seaborn
SQL
Data extraction, joins, aggregations
Tableau/Streamlit
Interactive dashboards
# Your standard analytics stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
The Analytics Edge in Industry
Lecture 2: Real-World Case Studies
Case Study: Netflix
The $1 Billion Recommendation Engine
Netflix: The Scale of the Problem
238M
Subscribers Worldwide
17,000+
Titles Available
90 sec
Average Decision Time
The Challenge: If a user can't find something to watch in 90 seconds, they leave. Every lost session = churn risk. Netflix estimates poor recommendations cost $1 billion/year in lost subscribers.
Netflix: Subscriber Growth Journey
Key Inflection Points:
2013: House of Cards launches (data-driven)
2016: Global expansion complete
2020: COVID-19 boosts streaming
12x growth in 13 years - powered by recommendation algorithms that improve engagement by 80%
Netflix: The Netflix Prize (2006-2009)
The Competition
Prize: $1,000,000
Goal: Beat Netflix's algorithm by 10%
Dataset: 100M ratings from 480K users
Duration: 3 years
Teams: 40,000+ from 186 countries
What They Learned
Ensemble methods beat single algorithms
Combining diverse approaches works best
The winning solution: 107 algorithms combined!
Spawned entire Kaggle-style competition industry
Netflix: The House of Cards Decision
$100M bet on data: Netflix's first original series investment
Data Points Analyzed:
British "House of Cards" was popular on Netflix
Kevin Spacey films performed well
David Fincher films had high completion rates
Political dramas had dedicated audiences
No Pilot
Ordered 2 Seasons Directly
Result: Massive success, proved data-driven content creation works. Now standard practice for streaming.
Netflix: A/B Testing Everything
Netflix runs 250+ A/B tests simultaneously - every change is tested on real users.
What They Test:
Thumbnail images (different ones for different users!)
Row ordering on homepage
Synopsis text and trailers
Autoplay timing
Skip intro button placement
Thumbnail Optimization:
Same movie, different thumbnails based on your viewing history:
Romance viewer → sees couple
Action viewer → sees explosion
Comedy viewer → sees funny scene
Netflix: Data They Collect
Viewing Behavior
What you watch and when
Where you pause, rewind, skip
Time of day patterns
Device and location
Binge vs. casual viewing
Content Analysis
Genre micro-tags (76,897 categories!)
Scene-by-scene metadata
Color palettes and pacing
Actor/director connections
Audio characteristics
Engagement Signals
Completion rates
Re-watch behavior
Search queries
Browse patterns
Sharing activity
Fun fact: Netflix knows that 70% of viewers who watch 3 episodes will finish the season.
Netflix: Lessons for Data Analytics
Personalization at scale: One-size-fits-all doesn't work
Test everything: A/B testing removes guesswork
Data for creativity: Analytics can inform content decisions
Micro-segmentation: 76,897 genres > 10 genres
Retention is key: Keeping users > acquiring new ones
Case Study: Spotify
Turning Data Into a Product Feature
Spotify Wrapped: Analytics as Marketing
156M
Users Shared Wrapped (2023)
$0
Ad Spend Required
Why It's Brilliant:
Free viral marketing every December
Makes users feel "seen" by the algorithm
Creates social proof and FOMO
Competitors (Apple, YouTube) now copy it
Spotify: Analytics Behind Wrapped
Listening Analysis
Total minutes played
Top artists, songs, genres
Discovery vs. familiar ratio
Listening time patterns
Behavioral Insights
"Listening personality" classification
Audio feature preferences
Playlist creation habits
Skip rate patterns
Social Context
How you compare to others
Top % of artist fans
Genre popularity trends
Regional differences
Lesson: Your analytics insights can become a product feature that users love!
Case Study: Framingham
The Study That Changed Medicine
Framingham Heart Study: 75+ Years of Data
Study Design (1948)
Location: Framingham, Massachusetts
Original cohort: 5,209 adults
Now: 3rd generation enrolled
Visits: Every 2-4 years for life
Variables: 1,000+ health measures
75+
Years of Continuous Data
3,000+
Published Papers
Framingham: Revolutionary Discoveries
Year
Discovery
Impact
1961
Cholesterol linked to heart disease
Statin drugs, dietary guidelines
1967
Physical activity reduces risk
Exercise recommendations
1970
High blood pressure → stroke
Blood pressure medications
1978
HDL ("good") cholesterol protective
Refined cholesterol guidelines
1988
Obesity as independent risk factor
Public health campaigns
The term "risk factor" was invented by Framingham researchers!
Framingham: Why This Study Matters
This is the power of longitudinal data analysis:
Correlation → Causation: Decades of data help establish causal relationships
Risk prediction: Framingham Risk Score used by doctors worldwide
Policy impact: Directly influenced FDA, WHO, AHA guidelines
Saved millions of lives: Heart disease deaths dropped 60% since 1950s
In Week 7: We'll build predictive models using actual Framingham data!
Philippine Case Study
GCash: Analytics for Financial Inclusion
GCash: The Philippine Problem
Financial Inclusion Challenge
51% of Filipinos are unbanked (2019)
7,641 islands - physical banks impractical
No credit history = no loans
Cash-based economy, remittances vital
51%
Filipinos Without Bank Accounts (2019)
$36B
Annual OFW Remittances
Opportunity: 70M+ smartphone users but only 30M banked citizens
GCash: Explosive Growth Through Analytics
81M+
Registered Users (2023)
↑ from 20M in 2019
$17B+
Monthly Transactions
3.5M
Partner Merchants
COVID-19 accelerated adoption: 4x growth in 2020-2021 as cash became risky and ayuda needed digital distribution.
GCash: The 4x Growth Story
What Drove the Growth?
Ayuda distribution - 18M+ beneficiaries
QR payments - 3.5M merchants
GCash Forest - gamification
GLoan/GCredit - financial inclusion
GCash: GScore - Analytics-Powered Credit
Problem: No credit history = No traditional credit score
Solution: Build credit scores from app behavior
Alternative Data Used:
Transaction frequency and amounts
Bill payment consistency
App usage patterns
Network of contacts
Top-up behavior
Impact
GLoan: Instant loans up to ₱25,000
GCredit: Buy now, pay later
Approval in seconds using ML models
First-time borrowers who never had credit access
This is exactly what we'll build in Week 7-8!
GCash: Real-Time Fraud Detection
Challenge: Process millions of transactions per day while catching fraud in real-time
Fraud Signals Analyzed:
Transaction velocity (too many, too fast)
Geographic anomalies (login from new location)
Device fingerprinting
Behavioral biometrics (typing patterns)
Network analysis (connected to known fraudsters)
ML Models Used:
Real-time scoring (< 100ms decision)
Anomaly detection for unusual patterns
Graph analytics for fraud rings
Adaptive models that learn from new fraud patterns
Activity: Design Fraud Detection Signals
Brainstorm Session
Scenario: You're a data analyst at GCash. A new fraud pattern has emerged: scammers are tricking users into sending money via "wrong send" schemes.
The "wrong send" scam:
Scammer "accidentally" sends money
Asks victim to return it
Original transaction was fraudulent (stolen funds)
Victim becomes money mule
Your task:
What data signals would you look for to detect this pattern? Think about:
Transaction patterns
Account characteristics
User behavior
4 minutes
Philippine Case Study
Grab: Real-Time Analytics at Scale
Grab: The Optimization Challenge
Real-Time Decisions Required
Matching: Which driver for which rider?
Pricing: What fare is fair right now?
ETA: How long will it really take?
Routing: Best path considering live traffic?
Millions
Rides per Day Across SEA
< 2 sec
Driver Match Decision Time
Grab: Analytics in Action
Function
Analytics Method
Business Impact
Dynamic Pricing
Demand prediction, elasticity modeling
Balance supply/demand, reduce wait times
Driver Matching
Optimization algorithms, ML ranking
Faster pickups, higher driver earnings
ETA Prediction
Time series, traffic pattern ML
Accurate expectations, fewer cancellations
GrabFood
Demand forecasting, restaurant recommendations
Reduced food waste, faster delivery
Safety
Anomaly detection, route monitoring
Protect riders and drivers
Grab: The Surge Pricing Algorithm
Factors in the Model:
Real-time demand (ride requests)
Available drivers in area
Weather conditions (rain = high demand)
Events (concerts, games)
Time of day patterns
Historical data for the area
Goal:
Price high enough to attract drivers from other areas, but low enough that riders still book.
This is optimization under uncertainty!
Ethical consideration: Is surge pricing during typhoons fair? We'll discuss in Week 11.
Ethics in Data Analytics
When Algorithms Affect Lives
Why Ethics Matters in Analytics
Algorithms now make decisions about:
High Stakes
Who gets a loan
Who gets hired
Prison sentences
Medical diagnoses
Medium Stakes
Insurance pricing
College admissions
Housing applications
Content recommendations
Key Questions
Who is harmed if wrong?
Is there recourse?
Can decisions be explained?
Is training data biased?
Case Study: COMPAS
When Algorithms Decide Who Goes to Prison
COMPAS: The Algorithm in the Courtroom
What is COMPAS?
Name: Correctional Offender Management Profiling for Alternative Sanctions
Purpose: Predict likelihood of re-offending
Used by: Judges in bail and sentencing decisions
States: Used in Florida, New York, Wisconsin, others
Risk Score: 1-10
Based on 137 questions about:
Criminal history
Family background
Education, employment
Social relationships
COMPAS: The ProPublica Investigation
2016: ProPublica analyzed 7,000 defendants in Broward County, Florida
Key Findings:
Metric
Black
White
False Positive Rate
44.9%
23.5%
False Negative Rate
28.0%
47.7%
Translation: Black defendants who didn't re-offend were almost twice as likely to be labeled high-risk.
Real Impact
Higher risk scores mean:
Higher bail amounts
Longer sentences
Less likely parole
Real people, real consequences
Activity: Identify the Bias
Group Discussion
If the algorithm doesn't use race as an input, how can it still produce racially biased outcomes?
Consider these factors:
Prior arrests (not convictions)
Neighborhood crime rates
Family members with criminal history
Employment history
Education level
Discussion questions:
Which factors might correlate with race?
What's the difference between "fair" and "accurate"?
Should algorithms be used for sentencing at all?
5 minutes
COMPAS: The Fairness Paradox
Northpointe's Defense
"The algorithm is calibrated fairly - if we say 70% risk, 70% actually re-offend, regardless of race."
This is true! Calibration is equal.
The Mathematical Reality
When base rates differ between groups, you cannot have:
Equal false positive rates AND
Equal false negative rates AND
Equal calibration
Pick 2, sacrifice 1.
COMPAS: Lessons for Data Analysts
Historical bias → Model bias: If past data reflects discrimination, models will too
"Fairness" has many definitions: Stakeholders may disagree on which matters
Black-box algorithms are dangerous: If you can't explain it, should it decide freedom?
Context matters: Different applications need different fairness criteria
Human accountability: Someone must be responsible for algorithmic decisions
We'll study fairness in ML formally in Week 11
Ethics in the Philippine Context
Potential Issues:
GScore: Alternative credit can perpetuate inequality
Facial recognition: Used by malls, may misidentify darker skin
Surge pricing: Fair during typhoons?
Content moderation: Filipino-language hate speech harder to detect
Questions for Your Capstone
Who benefits from your analysis?
Who might be harmed?
Is your data representative?
Can your results be misused?
Philippine Data Privacy Act (RA 10173)
Key Provisions (enacted 2012):
Consent: Must be freely given, specific, informed
Purpose limitation: Use only for stated purpose
Data minimization: Collect only what's needed
Accuracy: Keep data up to date
Storage limitation: Delete when no longer needed
Security: Protect against unauthorized access
Penalty: Up to ₱5M fine and imprisonment for violations. NPC is the enforcing agency.
Your Ethical Responsibilities
As a data analyst, you should:
Question your data sources and potential biases
Consider who is affected by your analysis
Be transparent about limitations and uncertainties
Protect privacy and follow data protection laws
Speak up when you see analytics being misused
Remember: "Just because we can doesn't mean we should."
The Analytics Job Market
35%
Projected Growth (2022-2032)
Much faster than average
PHP 40-80K
Entry-Level Salary (PH)
Skills in Demand:
Python/R programming
SQL and databases
Data visualization
Statistical analysis
Communication skills
Business acumen
Week 1 Key Takeaways
Data analytics transforms raw data into actionable insights
The lifecycle has 5 facets: Collect → Manage → Explore → Predict → Communicate
Communication is often the most critical (and neglected) facet
Philippine companies (GCash, Grab, SM) are actively using analytics
Ethics must be central to analytics practice
Growing job market with skills shortage = opportunity!
Lab 1 Preview
This Week's Lab
Introduction to Python for Analytics
Setting up environment
Loading Philippine economic data
Basic exploration with pandas
Your first visualization
# Lab 1 Preview
import pandas as pd
import matplotlib.pyplot as plt
# Load Philippine GDP data
df = pd.read_csv('ph_gdp.csv')
# Quick exploration
print(df.shape)
print(df.describe())
# First visualization
df.plot(x='year', y='gdp')
plt.title('Philippine GDP Growth')
plt.show()