Introduction to
Data Analytics

CMSC 178DA - Week 01

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Learning Objectives

By the end of this week, you will be able to:

Define data analytics and distinguish it from related fields
Explain the data science lifecycle and its five key facets
Identify real-world applications of data analytics
Understand ethical considerations in analytics
Recognize analytics opportunities in the Philippine context

What is Data Analytics?

Definition:

The science of analyzing raw data to make conclusions, identify patterns, and support decision-making.

Key Components:

Input: Raw data from various sources
Process: Statistical & computational methods
Output: Actionable insights

Why It Matters

Companies using analytics are 5x more likely to make faster decisions
Data-driven organizations are 23x more likely to acquire customers
35% projected growth in analytics jobs (2022-2032)

Analytics vs Related Fields

Field	Primary Question	Focus
Data Analytics	What happened? Why?	Insights & decisions
Data Science	What can we learn?	Broader exploration, ML/AI
Machine Learning	What will happen?	Prediction & automation
Business Intelligence	What are the KPIs?	Reporting & dashboards

This course focuses on Data Analytics with ML refresher (you've already taken ML!)

The Analytics Spectrum

Descriptive
What happened?

Diagnostic
Why did it happen?

Predictive
What will happen?

Prescriptive
What should we do?

Example - E-commerce:

Descriptive: Sales dropped 15% last month
Diagnostic: Checkout abandonment increased

Predictive: Sales will drop 20% if unchanged
Prescriptive: Simplify checkout, add payment options

The Data Science Lifecycle

Harvard CS109 Framework

Five Key Facets

The complete journey from raw data to actionable insights:

1. Collection

Wrangling, cleaning, sampling

2. Management

Storage, access, reliability

3. Exploration

EDA, hypotheses, intuition

4. Prediction

Models, algorithms, inference

5. Communication

Visualization, storytelling

1. Data Collection

Sources:

Databases (SQL, NoSQL)
APIs (REST, GraphQL)
Web scraping
Surveys and forms
IoT sensors

Philippine Example:

PSA Census Data

Every 5 years, PSA collects data on:

Population demographics
Housing conditions
Education levels
Employment status

2. Data Management

Key Considerations:

Storage: CSV, databases, cloud
Quality: Accuracy, completeness
Governance: Access control, policies
Security: Encryption, backups

Philippine Government Data:

Portal	Data Types
PSA OpenSTAT	Demographics, labor
BSP Statistics	Financial, banking
PAGASA	Weather, climate
Data.gov.ph	Open government

3. Exploratory Data Analysis

EDA: "Detective work" on data - understanding patterns before modeling.

Key Activities:

Summary statistics
Distribution analysis
Correlation exploration
Outlier detection

# Quick EDA in Python
import pandas as pd

df = pd.read_csv('ph_data.csv')
df.describe()  # Summary stats
df.info()      # Data types
df.isnull().sum()  # Missing values

Code Demo: Loading Philippine Data

# Load Philippine Population Data (PSA 2020 Census)
import pandas as pd

# Load the dataset
df = pd.read_csv('ph_population_2020.csv')

# Quick overview
print(f"Dataset: {len(df)} provinces across {df['region'].nunique()} regions")
print(f"Columns: {list(df.columns)}")

# Preview first 5 rows
df.head()

	region	region_name	province	population_2020	growth_rate
0	NCR	National Capital Region	Metro Manila	13,484,462	0.93
1	CAR	Cordillera Admin Region	Benguet	460,683	1.55
2	CAR	Cordillera Admin Region	Ifugao	207,498	0.95

Code Demo: Descriptive Statistics

# Get summary statistics
df['population_2020'].describe()

# Output:
# count      81.000000
# mean    1,364,578.54
# std     1,892,432.12
# min        17,783.00  (Batanes - smallest province)
# 25%       460,683.00
# 50%       786,653.00
# max    13,484,462.00  (Metro Manila - largest)

# Which region has the highest average population?
df.groupby('region_name')['population_2020'].mean().sort_values(ascending=False).head(3)

Key Insight: Metro Manila has 13.5M people in just 619 km² - that's a density of 21,783 people per km²!

Code Demo: Quick Visualization

import matplotlib.pyplot as plt

# Population by region
region_pop = df.groupby('region')['population_2020'].sum()
region_pop = region_pop.sort_values(ascending=False)

# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(region_pop.index[:5],
        region_pop.values[:5] / 1e6)
plt.ylabel('Population (Millions)')
plt.title('Top 5 Regions by Population')
plt.show()

Code Demo: Filtering & Analysis

# Find provinces with high growth rates (> 2% annual)
high_growth = df[df['growth_rate'] > 2.0].sort_values('growth_rate', ascending=False)
print(f"Found {len(high_growth)} high-growth provinces")

# Top 5 fastest growing
high_growth[['province', 'region_name', 'growth_rate', 'population_2020']].head()

	province	region_name	growth_rate	population_2020
1	Sulu	Bangsamoro	3.92%	1,000,108
2	Cavite	CALABARZON	3.38%	4,344,829
3	Rizal	CALABARZON	2.91%	3,330,143
4	Lanao del Sur	Bangsamoro	2.72%	1,195,518
5	Tawi-Tawi	Bangsamoro	2.45%	441,045

Insight: CALABARZON (Metro Manila suburbs) and BARMM show the highest growth rates.

4. Prediction & Inference

Statistical Inference:

Drawing conclusions about populations
Hypothesis testing
Confidence intervals

Machine Learning (Refresher):

Supervised: Regression, Classification
Unsupervised: Clustering

Key Difference:

Inference: Explains relationships (why?)
Prediction: Forecasts outcomes (what?)

Analytics often prioritizes inference for decision-making.

5. Communication

The most critical (and often neglected) facet!

"The goal is not to build models - it's to drive decisions."

Visualization

Charts & graphs
Interactive dashboards
Infographics

Storytelling

Narrative structure
Context & meaning
Call to action

Delivery

Reports
Presentations
Executive summaries

Philippine Data Sources

Source	Data Types	URL
PSA OpenSTAT	GDP, population, labor, poverty	openstat.psa.gov.ph
BSP Statistics	Exchange rates, remittances, banking	bsp.gov.ph/statistics
PAGASA	Weather, typhoons, climate	bagong.pagasa.dost.gov.ph
DOH	Health, COVID-19, diseases	doh.gov.ph
PSE	Stock prices, company data	edge.pse.com.ph

Full dataset guide: Philippine Datasets Reference

Case Study: Moreyball

How Analytics Changed Basketball Forever

Moreyball: Analytics Comes to Basketball

Daryl Morey (Houston Rockets, 2007-2020)

MIT Sloan graduate, not a basketball player
Applied statistical analysis to the NBA
Asked: "Which shots actually win games?"
The answer changed basketball forever

2 vs 3

The Math That Changed the Game

Key insight: Not all 2-point shots are equal, and most mid-range shots are bad decisions.

Moreyball: The Expected Value Math

Expected Points = Shot Value × Success Rate

Shot Type	Value	Typical %	Expected Points	Verdict
Layup/Dunk	2	65%	1.30	Best
3-Pointer	3	36%	1.08	Good
Mid-Range (2pt)	2	40%	0.80	Avoid!

Conclusion: A 36% three-pointer is worth more than a 40% mid-range shot!

Moreyball: Visualizing Expected Value

The visual makes it obvious: Mid-range shots are inefficient. Even a mediocre 3-pointer beats a good mid-range shot!

Activity: Calculate Expected Value

Think-Pair-Share

Instructions: With a partner, calculate the expected points for each shot type. Which shot should you take?

Shot Type	Point Value	Your Shooting %	Expected Points
Free Throw	1	80%	____
Corner 3-Pointer	3	42%	____
Post-up (close 2pt)	2	55%	____
Floater (mid-range)	2	38%	____

3 minutes

Discussion: Would your strategy change if you were down by 2 with 10 seconds left?

Moreyball: The Three-Point Revolution

The Rockets' Strategy:

Shoot 3-pointers (high expected value)
Drive to the basket (layups + fouls)
Get to the free throw line
Eliminate mid-range shots

50+

3PT Attempts per Game (2018-19)

League-leading, 2x more than 2012

James Harden's step-back 3 became the signature move of the analytics era.

Moreyball: How It Changed the NBA

22 → 35

League Avg 3PA (2010 → 2023)

↓ 50%

Mid-Range Shot Frequency

Steph Curry

The Ultimate Moreyball Player

Before and After Analytics

Before: Mid-range specialists like Michael Jordan, Kobe Bryant valued
After: Teams hunt 3-pointers; players like DeMar DeRozan seen as "inefficient"
Warriors dynasty: Built on Curry/Thompson shooting + analytics

Basketball Analytics: Beyond Shooting

Analytics Application	Traditional View	Analytics View
Player Value	Points per game	Win Shares, PER, VORP, RPM
Draft Picks	Eye test, athleticism	College stats, physical measurements, motor
Defense	Blocks, steals	Defensive Rating, opponent FG% at rim
Lineup Decisions	Coach intuition	Plus/minus data, matchup analytics
Rest/Load Management	Play through injury	Injury prediction models, rest optimization

Analytics Finds Hidden Gems

Draft Steals Found by Data

Nikola Jokic (41st pick): Advanced stats loved his passing
Draymond Green (35th): Analytics showed defensive versatility
Malcolm Brogdon (36th): Efficient college stats overlooked

PBA Application?

Which local players are undervalued?
Is the PBA still mid-range heavy?
Import decisions: stats vs. "name"
UAAP/NCAA data for draft picks

Question: Could a PBA team use Moreyball principles to win a championship on a budget?

Moreyball: Lessons for Data Analytics

                    Challenge the "eye test": What looks good isn't always effective
Expected value matters: Think probabilistically, not emotionally
Market inefficiencies exist: Find what others undervalue
Data changes culture: Analytics is now mandatory in all sports
But there are limits: Rockets never won championship; Warriors lost to Raptors

                

Key takeaway: Analytics gives an edge, but doesn't guarantee victory. Human execution still matters.

Analytics in the Philippines

Industry	Application	Companies
Fintech	Credit scoring, fraud detection	GCash, Maya, Tonik
Banking	Risk modeling, churn prediction	BDO, BPI, UnionBank
Retail	Demand forecasting, basket analysis	SM, Puregold, Mercury
Telecom	Network optimization, churn	Globe, Smart, DITO
Transport	Route optimization, pricing	Grab, Angkas, Lalamove

Course Overview: 12 Weeks

Foundations (Weeks 1-3)

Week 1: Introduction & Lifecycle
Week 2: Probability & Statistics
Week 3: Data Wrangling

EDA & Visualization (Weeks 4-6)

Week 4: Exploratory Data Analysis
Week 5: Visualization Principles
Week 6: Storytelling & Dashboards

Modeling (Weeks 7-9)

Week 7: Regression Analytics
Week 8: Tree-Based Methods
Week 9: Clustering & Segmentation

Advanced (Weeks 10-12)

Week 10: Time Series Analytics
Week 11: Text Analytics & Ethics
Week 12: Capstone Presentations

Assessment Structure

Component	Weight
Weekly Labs	25%
Midterm Exam	20%
Quizzes (3)	15%
Capstone Project	30%
Participation	10%

Capstone Project

End-to-end analytics project:

Team of 2-3 students
Philippine context preferred
EDA + Model + Dashboard
15-minute presentation

Tools We'll Use

Python

pandas, numpy, scikit-learn, matplotlib, seaborn

SQL

Data extraction, joins, aggregations

Tableau/Streamlit

Interactive dashboards

# Your standard analytics stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

The Analytics Edge
in Industry

Lecture 2: Real-World Case Studies

Case Study: Netflix

The $1 Billion Recommendation Engine

Netflix: The Scale of the Problem

238M

Subscribers Worldwide

17,000+

Titles Available

90 sec

Average Decision Time

The Challenge: If a user can't find something to watch in 90 seconds, they leave. Every lost session = churn risk. Netflix estimates poor recommendations cost $1 billion/year in lost subscribers.

Netflix: Subscriber Growth Journey

Key Inflection Points:

2013: House of Cards launches (data-driven)
2016: Global expansion complete
2020: COVID-19 boosts streaming

12x growth in 13 years - powered by recommendation algorithms that improve engagement by 80%

Netflix: The Netflix Prize (2006-2009)

The Competition

Prize: $1,000,000
Goal: Beat Netflix's algorithm by 10%
Dataset: 100M ratings from 480K users
Duration: 3 years
Teams: 40,000+ from 186 countries

What They Learned

Ensemble methods beat single algorithms
Combining diverse approaches works best
The winning solution: 107 algorithms combined!
Spawned entire Kaggle-style competition industry

Netflix: The House of Cards Decision

$100M bet on data: Netflix's first original series investment

Data Points Analyzed:

British "House of Cards" was popular on Netflix
Kevin Spacey films performed well
David Fincher films had high completion rates
Political dramas had dedicated audiences

No Pilot

Ordered 2 Seasons Directly

Result: Massive success, proved data-driven content creation works. Now standard practice for streaming.

Netflix: A/B Testing Everything

Netflix runs 250+ A/B tests simultaneously - every change is tested on real users.

What They Test:

Thumbnail images (different ones for different users!)
Row ordering on homepage
Synopsis text and trailers
Autoplay timing
Skip intro button placement

Thumbnail Optimization:

Same movie, different thumbnails based on your viewing history:

Romance viewer → sees couple
Action viewer → sees explosion
Comedy viewer → sees funny scene

Netflix: Data They Collect

Viewing Behavior

What you watch and when
Where you pause, rewind, skip
Time of day patterns
Device and location
Binge vs. casual viewing

Content Analysis

Genre micro-tags (76,897 categories!)
Scene-by-scene metadata
Color palettes and pacing
Actor/director connections
Audio characteristics

Engagement Signals

Completion rates
Re-watch behavior
Search queries
Browse patterns
Sharing activity

Fun fact: Netflix knows that 70% of viewers who watch 3 episodes will finish the season.

Netflix: Lessons for Data Analytics

                    Personalization at scale: One-size-fits-all doesn't work
Test everything: A/B testing removes guesswork
Data for creativity: Analytics can inform content decisions
Micro-segmentation: 76,897 genres > 10 genres
Retention is key: Keeping users > acquiring new ones

                

Case Study: Spotify

Turning Data Into a Product Feature

Spotify Wrapped: Analytics as Marketing

156M

Users Shared Wrapped (2023)

Ad Spend Required

Why It's Brilliant:

Free viral marketing every December
Makes users feel "seen" by the algorithm
Creates social proof and FOMO
Competitors (Apple, YouTube) now copy it

Spotify: Analytics Behind Wrapped

Listening Analysis

Total minutes played
Top artists, songs, genres
Discovery vs. familiar ratio
Listening time patterns

Behavioral Insights

"Listening personality" classification
Audio feature preferences
Playlist creation habits
Skip rate patterns

Social Context

How you compare to others
Top % of artist fans
Genre popularity trends
Regional differences

Lesson: Your analytics insights can become a product feature that users love!

Case Study: Framingham

The Study That Changed Medicine

Framingham Heart Study: 75+ Years of Data

Study Design (1948)

Location: Framingham, Massachusetts
Original cohort: 5,209 adults
Now: 3rd generation enrolled
Visits: Every 2-4 years for life
Variables: 1,000+ health measures

75+

Years of Continuous Data

3,000+

Published Papers

Framingham: Revolutionary Discoveries

Year	Discovery	Impact
1961	Cholesterol linked to heart disease	Statin drugs, dietary guidelines
1967	Physical activity reduces risk	Exercise recommendations
1970	High blood pressure → stroke	Blood pressure medications
1978	HDL ("good") cholesterol protective	Refined cholesterol guidelines
1988	Obesity as independent risk factor	Public health campaigns

The term "risk factor" was invented by Framingham researchers!

Framingham: Why This Study Matters

This is the power of longitudinal data analysis:

Correlation → Causation: Decades of data help establish causal relationships
Risk prediction: Framingham Risk Score used by doctors worldwide
Policy impact: Directly influenced FDA, WHO, AHA guidelines
Saved millions of lives: Heart disease deaths dropped 60% since 1950s

In Week 7: We'll build predictive models using actual Framingham data!

Philippine Case Study

GCash: Analytics for Financial Inclusion

GCash: The Philippine Problem

Financial Inclusion Challenge

51% of Filipinos are unbanked (2019)
7,641 islands - physical banks impractical
No credit history = no loans
Cash-based economy, remittances vital

51%

Filipinos Without Bank Accounts (2019)

$36B

Annual OFW Remittances

Opportunity: 70M+ smartphone users but only 30M banked citizens

GCash: Explosive Growth Through Analytics

81M+

Registered Users (2023)

↑ from 20M in 2019

$17B+

Monthly Transactions

3.5M

Partner Merchants

COVID-19 accelerated adoption: 4x growth in 2020-2021 as cash became risky and ayuda needed digital distribution.

GCash: The 4x Growth Story

What Drove the Growth?

Ayuda distribution - 18M+ beneficiaries
QR payments - 3.5M merchants
GCash Forest - gamification
GLoan/GCredit - financial inclusion

GCash: GScore - Analytics-Powered Credit

Problem: No credit history = No traditional credit score

Solution: Build credit scores from app behavior

Alternative Data Used:

Transaction frequency and amounts
Bill payment consistency
App usage patterns
Network of contacts
Top-up behavior

Impact

GLoan: Instant loans up to ₱25,000
GCredit: Buy now, pay later
Approval in seconds using ML models
First-time borrowers who never had credit access

This is exactly what we'll build in Week 7-8!

GCash: Real-Time Fraud Detection

Challenge: Process millions of transactions per day while catching fraud in real-time

Fraud Signals Analyzed:

Transaction velocity (too many, too fast)
Geographic anomalies (login from new location)
Device fingerprinting
Behavioral biometrics (typing patterns)
Network analysis (connected to known fraudsters)

ML Models Used:

Real-time scoring (< 100ms decision)
Anomaly detection for unusual patterns
Graph analytics for fraud rings
Adaptive models that learn from new fraud patterns

Activity: Design Fraud Detection Signals

Brainstorm Session

Scenario: You're a data analyst at GCash. A new fraud pattern has emerged: scammers are tricking users into sending money via "wrong send" schemes.

The "wrong send" scam:

Scammer "accidentally" sends money
Asks victim to return it
Original transaction was fraudulent (stolen funds)
Victim becomes money mule

Your task:

What data signals would you look for to detect this pattern? Think about:

Transaction patterns
Account characteristics
User behavior

4 minutes

Philippine Case Study

Grab: Real-Time Analytics at Scale

Grab: The Optimization Challenge

Real-Time Decisions Required

Matching: Which driver for which rider?
Pricing: What fare is fair right now?
ETA: How long will it really take?
Routing: Best path considering live traffic?

Millions

Rides per Day Across SEA

< 2 sec

Driver Match Decision Time

Grab: Analytics in Action

Function	Analytics Method	Business Impact
Dynamic Pricing	Demand prediction, elasticity modeling	Balance supply/demand, reduce wait times
Driver Matching	Optimization algorithms, ML ranking	Faster pickups, higher driver earnings
ETA Prediction	Time series, traffic pattern ML	Accurate expectations, fewer cancellations
GrabFood	Demand forecasting, restaurant recommendations	Reduced food waste, faster delivery
Safety	Anomaly detection, route monitoring	Protect riders and drivers

Grab: The Surge Pricing Algorithm

Factors in the Model:

Real-time demand (ride requests)
Available drivers in area
Weather conditions (rain = high demand)
Events (concerts, games)
Time of day patterns
Historical data for the area

Goal:

Price high enough to attract drivers from other areas, but low enough that riders still book.

This is optimization under uncertainty!

Ethical consideration: Is surge pricing during typhoons fair? We'll discuss in Week 11.

Ethics in Data Analytics

When Algorithms Affect Lives

Why Ethics Matters in Analytics

Algorithms now make decisions about:

High Stakes

Who gets a loan
Who gets hired
Prison sentences
Medical diagnoses

Medium Stakes

Insurance pricing
College admissions
Housing applications
Content recommendations

Key Questions

Who is harmed if wrong?
Is there recourse?
Can decisions be explained?
Is training data biased?

Case Study: COMPAS

When Algorithms Decide Who Goes to Prison

COMPAS: The Algorithm in the Courtroom

What is COMPAS?

Name: Correctional Offender Management Profiling for Alternative Sanctions
Purpose: Predict likelihood of re-offending
Used by: Judges in bail and sentencing decisions
States: Used in Florida, New York, Wisconsin, others

Risk Score: 1-10

Based on 137 questions about:

Criminal history
Family background
Education, employment
Social relationships

COMPAS: The ProPublica Investigation

2016: ProPublica analyzed 7,000 defendants in Broward County, Florida

Key Findings:

Metric	Black	White
False Positive Rate	44.9%	23.5%
False Negative Rate	28.0%	47.7%

Translation: Black defendants who didn't re-offend were almost twice as likely to be labeled high-risk.

Real Impact

Higher risk scores mean:

Higher bail amounts
Longer sentences
Less likely parole
Real people, real consequences

Activity: Identify the Bias

Group Discussion

If the algorithm doesn't use race as an input, how can it still produce racially biased outcomes?

Consider these factors:

Prior arrests (not convictions)
Neighborhood crime rates
Family members with criminal history
Employment history
Education level

Discussion questions:

Which factors might correlate with race?
What's the difference between "fair" and "accurate"?
Should algorithms be used for sentencing at all?

5 minutes

COMPAS: The Fairness Paradox

Northpointe's Defense

"The algorithm is calibrated fairly - if we say 70% risk, 70% actually re-offend, regardless of race."

This is true! Calibration is equal.

The Mathematical Reality

When base rates differ between groups, you cannot have:

Equal false positive rates AND
Equal false negative rates AND
Equal calibration

Pick 2, sacrifice 1.

COMPAS: Lessons for Data Analysts

                    Historical bias → Model bias: If past data reflects discrimination, models will too
"Fairness" has many definitions: Stakeholders may disagree on which matters
Black-box algorithms are dangerous: If you can't explain it, should it decide freedom?
Context matters: Different applications need different fairness criteria
Human accountability: Someone must be responsible for algorithmic decisions

                

We'll study fairness in ML formally in Week 11

Ethics in the Philippine Context

Potential Issues:

GScore: Alternative credit can perpetuate inequality
Facial recognition: Used by malls, may misidentify darker skin
Surge pricing: Fair during typhoons?
Content moderation: Filipino-language hate speech harder to detect

Questions for Your Capstone

Who benefits from your analysis?
Who might be harmed?
Is your data representative?
Can your results be misused?

Philippine Data Privacy Act (RA 10173)

Key Provisions (enacted 2012):

Consent: Must be freely given, specific, informed
Purpose limitation: Use only for stated purpose
Data minimization: Collect only what's needed
Accuracy: Keep data up to date
Storage limitation: Delete when no longer needed
Security: Protect against unauthorized access

Penalty: Up to ₱5M fine and imprisonment for violations. NPC is the enforcing agency.

Your Ethical Responsibilities

As a data analyst, you should:

Question your data sources and potential biases
Consider who is affected by your analysis
Be transparent about limitations and uncertainties
Protect privacy and follow data protection laws
Speak up when you see analytics being misused

Remember: "Just because we can doesn't mean we should."

The Analytics Job Market

35%

Projected Growth (2022-2032)

Much faster than average

PHP 40-80K

Entry-Level Salary (PH)

Skills in Demand:

Python/R programming
SQL and databases
Data visualization
Statistical analysis
Communication skills
Business acumen

Week 1 Key Takeaways

                    Data analytics transforms raw data into actionable insights
The lifecycle has 5 facets: Collect → Manage → Explore → Predict → Communicate
Communication is often the most critical (and neglected) facet
Philippine companies (GCash, Grab, SM) are actively using analytics
Ethics must be central to analytics practice
Growing job market with skills shortage = opportunity!

                

Lab 1 Preview

This Week's Lab

Introduction to Python for Analytics

Setting up environment
Loading Philippine economic data
Basic exploration with pandas
Your first visualization

# Lab 1 Preview
import pandas as pd
import matplotlib.pyplot as plt

# Load Philippine GDP data
df = pd.read_csv('ph_gdp.csv')

# Quick exploration
print(df.shape)
print(df.describe())

# First visualization
df.plot(x='year', y='gdp')
plt.title('Philippine GDP Growth')
plt.show()

Next Week Preview

Week 2: Probability & Statistical Foundations

Lecture 3: Probability Review

Random variables & distributions
Bayes' theorem
Central Limit Theorem

Lecture 4: Statistical Inference

Hypothesis testing
Confidence intervals
A/B testing framework

References

MIT OpenCourseWare: The Analytics Edge (15.071)
Harvard CS109: Data Science Course Materials
UC Berkeley: Data 100 - Principles of Data Science
Davenport, T. (2006). Competing on Analytics
Lewis, M. (2003). Moneyball: The Art of Winning an Unfair Game
ProPublica: Machine Bias (COMPAS analysis)
Philippine Data Privacy Act (RA 10173)
GCash Annual Reports (2022-2023)

Questions?

CMSC 178DA - Data Analytics

University of the Philippines Cebu
Department of Computer Science

Introduction toData Analytics

CMSC 178DA - Week 01

Learning Objectives

What is Data Analytics?

Why It Matters

Analytics vs Related Fields

The Analytics Spectrum

The Data Science Lifecycle

Five Key Facets

1. Data Collection

2. Data Management

3. Exploratory Data Analysis

Code Demo: Loading Philippine Data

Code Demo: Descriptive Statistics

Code Demo: Quick Visualization

Code Demo: Filtering & Analysis

4. Prediction & Inference

5. Communication

Philippine Data Sources

Case Study: Moreyball

Moreyball: Analytics Comes to Basketball

Daryl Morey (Houston Rockets, 2007-2020)

Moreyball: The Expected Value Math

Moreyball: Visualizing Expected Value

Activity: Calculate Expected Value

Think-Pair-Share

Moreyball: The Three-Point Revolution

Moreyball: How It Changed the NBA

Before and After Analytics

Basketball Analytics: Beyond Shooting

Analytics Finds Hidden Gems

Draft Steals Found by Data

PBA Application?

Moreyball: Lessons for Data Analytics

Analytics in the Philippines

Course Overview: 12 Weeks

Assessment Structure

Capstone Project

Tools We'll Use

The Analytics Edgein Industry

Case Study: Netflix

Netflix: The Scale of the Problem

Netflix: Subscriber Growth Journey

Netflix: The Netflix Prize (2006-2009)

The Competition

What They Learned

Netflix: The House of Cards Decision

Netflix: A/B Testing Everything

Netflix: Data They Collect

Netflix: Lessons for Data Analytics

Case Study: Spotify

Spotify Wrapped: Analytics as Marketing

Spotify: Analytics Behind Wrapped

Case Study: Framingham

Framingham Heart Study: 75+ Years of Data

Study Design (1948)

Framingham: Revolutionary Discoveries

Framingham: Why This Study Matters

Philippine Case Study

GCash: The Philippine Problem

Financial Inclusion Challenge

GCash: Explosive Growth Through Analytics

GCash: The 4x Growth Story

What Drove the Growth?

GCash: GScore - Analytics-Powered Credit

Impact

GCash: Real-Time Fraud Detection

Activity: Design Fraud Detection Signals

Brainstorm Session

Philippine Case Study

Grab: The Optimization Challenge

Real-Time Decisions Required

Grab: Analytics in Action

Grab: The Surge Pricing Algorithm

Ethics in Data Analytics

Why Ethics Matters in Analytics

Case Study: COMPAS

COMPAS: The Algorithm in the Courtroom

What is COMPAS?

COMPAS: The ProPublica Investigation

Introduction to
Data Analytics

The Analytics Edge
in Industry