Back to Course
CMSC 178DA

Week 1: Introduction to Data Analytics

1 / 65

Introduction to
Data Analytics

CMSC 178DA - Week 01

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Learning Objectives

By the end of this week, you will be able to:

  1. Define data analytics and distinguish it from related fields
  2. Explain the data science lifecycle and its five key facets
  3. Identify real-world applications of data analytics
  4. Understand ethical considerations in analytics
  5. Recognize analytics opportunities in the Philippine context

What is Data Analytics?

Definition:

The science of analyzing raw data to make conclusions, identify patterns, and support decision-making.

Key Components:

  • Input: Raw data from various sources
  • Process: Statistical & computational methods
  • Output: Actionable insights

Why It Matters

  • Companies using analytics are 5x more likely to make faster decisions
  • Data-driven organizations are 23x more likely to acquire customers
  • 35% projected growth in analytics jobs (2022-2032)

Analytics vs Related Fields

Field Primary Question Focus
Data Analytics What happened? Why? Insights & decisions
Data Science What can we learn? Broader exploration, ML/AI
Machine Learning What will happen? Prediction & automation
Business Intelligence What are the KPIs? Reporting & dashboards

This course focuses on Data Analytics with ML refresher (you've already taken ML!)

The Analytics Spectrum

1
Descriptive
What happened?
2
Diagnostic
Why did it happen?
3
Predictive
What will happen?
4
Prescriptive
What should we do?

Example - E-commerce:

  • Descriptive: Sales dropped 15% last month
  • Diagnostic: Checkout abandonment increased
  • Predictive: Sales will drop 20% if unchanged
  • Prescriptive: Simplify checkout, add payment options

The Data Science Lifecycle

Harvard CS109 Framework

Five Key Facets

The complete journey from raw data to actionable insights:

1. Collection
Wrangling, cleaning, sampling
2. Management
Storage, access, reliability
3. Exploration
EDA, hypotheses, intuition
4. Prediction
Models, algorithms, inference
5. Communication
Visualization, storytelling

1. Data Collection

Sources:

  • Databases (SQL, NoSQL)
  • APIs (REST, GraphQL)
  • Web scraping
  • Surveys and forms
  • IoT sensors

Philippine Example:

PSA Census Data

Every 5 years, PSA collects data on:

  • Population demographics
  • Housing conditions
  • Education levels
  • Employment status

2. Data Management

Key Considerations:

  • Storage: CSV, databases, cloud
  • Quality: Accuracy, completeness
  • Governance: Access control, policies
  • Security: Encryption, backups

Philippine Government Data:

PortalData Types
PSA OpenSTATDemographics, labor
BSP StatisticsFinancial, banking
PAGASAWeather, climate
Data.gov.phOpen government

3. Exploratory Data Analysis

EDA: "Detective work" on data - understanding patterns before modeling.

Key Activities:

  • Summary statistics
  • Distribution analysis
  • Correlation exploration
  • Outlier detection
# Quick EDA in Python
import pandas as pd

df = pd.read_csv('ph_data.csv')
df.describe()  # Summary stats
df.info()      # Data types
df.isnull().sum()  # Missing values

Code Demo: Loading Philippine Data

# Load Philippine Population Data (PSA 2020 Census)
import pandas as pd

# Load the dataset
df = pd.read_csv('ph_population_2020.csv')

# Quick overview
print(f"Dataset: {len(df)} provinces across {df['region'].nunique()} regions")
print(f"Columns: {list(df.columns)}")

# Preview first 5 rows
df.head()
regionregion_nameprovincepopulation_2020growth_rate
0NCRNational Capital RegionMetro Manila13,484,4620.93
1CARCordillera Admin RegionBenguet460,6831.55
2CARCordillera Admin RegionIfugao207,4980.95

Code Demo: Descriptive Statistics

# Get summary statistics
df['population_2020'].describe()

# Output:
# count      81.000000
# mean    1,364,578.54
# std     1,892,432.12
# min        17,783.00  (Batanes - smallest province)
# 25%       460,683.00
# 50%       786,653.00
# max    13,484,462.00  (Metro Manila - largest)

# Which region has the highest average population?
df.groupby('region_name')['population_2020'].mean().sort_values(ascending=False).head(3)

Key Insight: Metro Manila has 13.5M people in just 619 km² - that's a density of 21,783 people per km²!

Code Demo: Quick Visualization

import matplotlib.pyplot as plt

# Population by region
region_pop = df.groupby('region')['population_2020'].sum()
region_pop = region_pop.sort_values(ascending=False)

# Create bar chart
plt.figure(figsize=(10, 6))
plt.bar(region_pop.index[:5],
        region_pop.values[:5] / 1e6)
plt.ylabel('Population (Millions)')
plt.title('Top 5 Regions by Population')
plt.show()
0 5M 10M 15M 20M CALABARZON 16.9M NCR 13.5M Central Luzon 12.4M Central Visayas 8.7M Western Visayas 7.5M

Code Demo: Filtering & Analysis

# Find provinces with high growth rates (> 2% annual)
high_growth = df[df['growth_rate'] > 2.0].sort_values('growth_rate', ascending=False)
print(f"Found {len(high_growth)} high-growth provinces")

# Top 5 fastest growing
high_growth[['province', 'region_name', 'growth_rate', 'population_2020']].head()
provinceregion_namegrowth_ratepopulation_2020
1SuluBangsamoro3.92%1,000,108
2CaviteCALABARZON3.38%4,344,829
3RizalCALABARZON2.91%3,330,143
4Lanao del SurBangsamoro2.72%1,195,518
5Tawi-TawiBangsamoro2.45%441,045

Insight: CALABARZON (Metro Manila suburbs) and BARMM show the highest growth rates.

4. Prediction & Inference

Statistical Inference:

  • Drawing conclusions about populations
  • Hypothesis testing
  • Confidence intervals

Machine Learning (Refresher):

  • Supervised: Regression, Classification
  • Unsupervised: Clustering

Key Difference:

  • Inference: Explains relationships (why?)
  • Prediction: Forecasts outcomes (what?)

Analytics often prioritizes inference for decision-making.

5. Communication

The most critical (and often neglected) facet!

"The goal is not to build models - it's to drive decisions."

Visualization

  • Charts & graphs
  • Interactive dashboards
  • Infographics

Storytelling

  • Narrative structure
  • Context & meaning
  • Call to action

Delivery

  • Reports
  • Presentations
  • Executive summaries

Philippine Data Sources

Source Data Types URL
PSA OpenSTAT GDP, population, labor, poverty openstat.psa.gov.ph
BSP Statistics Exchange rates, remittances, banking bsp.gov.ph/statistics
PAGASA Weather, typhoons, climate bagong.pagasa.dost.gov.ph
DOH Health, COVID-19, diseases doh.gov.ph
PSE Stock prices, company data edge.pse.com.ph

Full dataset guide: Philippine Datasets Reference

Case Study: Moreyball

How Analytics Changed Basketball Forever

Moreyball: Analytics Comes to Basketball

Daryl Morey (Houston Rockets, 2007-2020)

  • MIT Sloan graduate, not a basketball player
  • Applied statistical analysis to the NBA
  • Asked: "Which shots actually win games?"
  • The answer changed basketball forever
2 vs 3
The Math That Changed the Game

Key insight: Not all 2-point shots are equal, and most mid-range shots are bad decisions.

Moreyball: The Expected Value Math

Expected Points = Shot Value × Success Rate

Shot Type Value Typical % Expected Points Verdict
Layup/Dunk 2 65% 1.30 Best
3-Pointer 3 36% 1.08 Good
Mid-Range (2pt) 2 40% 0.80 Avoid!

Conclusion: A 36% three-pointer is worth more than a 40% mid-range shot!

Moreyball: Visualizing Expected Value

0.0 0.75 1.50 1.30 Layup/Dunk 2pt × 65% 1.08 3-Pointer 3pt × 36% 0.80 Mid-Range 2pt × 40%

The visual makes it obvious: Mid-range shots are inefficient. Even a mediocre 3-pointer beats a good mid-range shot!

Activity: Calculate Expected Value

Think-Pair-Share

Instructions: With a partner, calculate the expected points for each shot type. Which shot should you take?

Shot Type Point Value Your Shooting % Expected Points
Free Throw 1 80% ____
Corner 3-Pointer 3 42% ____
Post-up (close 2pt) 2 55% ____
Floater (mid-range) 2 38% ____

3 minutes

Discussion: Would your strategy change if you were down by 2 with 10 seconds left?

Moreyball: The Three-Point Revolution

The Rockets' Strategy:

  1. Shoot 3-pointers (high expected value)
  2. Drive to the basket (layups + fouls)
  3. Get to the free throw line
  4. Eliminate mid-range shots
50+
3PT Attempts per Game (2018-19)
League-leading, 2x more than 2012

James Harden's step-back 3 became the signature move of the analytics era.

Moreyball: How It Changed the NBA

22 → 35
League Avg 3PA (2010 → 2023)
↓ 50%
Mid-Range Shot Frequency
Steph Curry
The Ultimate Moreyball Player

Before and After Analytics

  • Before: Mid-range specialists like Michael Jordan, Kobe Bryant valued
  • After: Teams hunt 3-pointers; players like DeMar DeRozan seen as "inefficient"
  • Warriors dynasty: Built on Curry/Thompson shooting + analytics

Basketball Analytics: Beyond Shooting

Analytics Application Traditional View Analytics View
Player Value Points per game Win Shares, PER, VORP, RPM
Draft Picks Eye test, athleticism College stats, physical measurements, motor
Defense Blocks, steals Defensive Rating, opponent FG% at rim
Lineup Decisions Coach intuition Plus/minus data, matchup analytics
Rest/Load Management Play through injury Injury prediction models, rest optimization

Analytics Finds Hidden Gems

Draft Steals Found by Data

  • Nikola Jokic (41st pick): Advanced stats loved his passing
  • Draymond Green (35th): Analytics showed defensive versatility
  • Malcolm Brogdon (36th): Efficient college stats overlooked

PBA Application?

  • Which local players are undervalued?
  • Is the PBA still mid-range heavy?
  • Import decisions: stats vs. "name"
  • UAAP/NCAA data for draft picks

Question: Could a PBA team use Moreyball principles to win a championship on a budget?

Moreyball: Lessons for Data Analytics

  1. Challenge the "eye test": What looks good isn't always effective
  2. Expected value matters: Think probabilistically, not emotionally
  3. Market inefficiencies exist: Find what others undervalue
  4. Data changes culture: Analytics is now mandatory in all sports
  5. But there are limits: Rockets never won championship; Warriors lost to Raptors

Key takeaway: Analytics gives an edge, but doesn't guarantee victory. Human execution still matters.

Analytics in the Philippines

Industry Application Companies
Fintech Credit scoring, fraud detection GCash, Maya, Tonik
Banking Risk modeling, churn prediction BDO, BPI, UnionBank
Retail Demand forecasting, basket analysis SM, Puregold, Mercury
Telecom Network optimization, churn Globe, Smart, DITO
Transport Route optimization, pricing Grab, Angkas, Lalamove

Course Overview: 12 Weeks

Foundations (Weeks 1-3)

  • Week 1: Introduction & Lifecycle
  • Week 2: Probability & Statistics
  • Week 3: Data Wrangling

EDA & Visualization (Weeks 4-6)

  • Week 4: Exploratory Data Analysis
  • Week 5: Visualization Principles
  • Week 6: Storytelling & Dashboards

Modeling (Weeks 7-9)

  • Week 7: Regression Analytics
  • Week 8: Tree-Based Methods
  • Week 9: Clustering & Segmentation

Advanced (Weeks 10-12)

  • Week 10: Time Series Analytics
  • Week 11: Text Analytics & Ethics
  • Week 12: Capstone Presentations

Assessment Structure

ComponentWeight
Weekly Labs25%
Midterm Exam20%
Quizzes (3)15%
Capstone Project30%
Participation10%

Capstone Project

End-to-end analytics project:

  • Team of 2-3 students
  • Philippine context preferred
  • EDA + Model + Dashboard
  • 15-minute presentation

Tools We'll Use

Python
pandas, numpy, scikit-learn, matplotlib, seaborn
SQL
Data extraction, joins, aggregations
Tableau/Streamlit
Interactive dashboards
# Your standard analytics stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

The Analytics Edge
in Industry

Lecture 2: Real-World Case Studies

Case Study: Netflix

The $1 Billion Recommendation Engine

Netflix: The Scale of the Problem

238M
Subscribers Worldwide
17,000+
Titles Available
90 sec
Average Decision Time

The Challenge: If a user can't find something to watch in 90 seconds, they leave. Every lost session = churn risk. Netflix estimates poor recommendations cost $1 billion/year in lost subscribers.

Netflix: Subscriber Growth Journey

250M 175M 100M 25M 2010 20M 2012 33M 2014 57M 2016 83M 2018 118M 2020 167M 2023 238M

Key Inflection Points:

  • 2013: House of Cards launches (data-driven)
  • 2016: Global expansion complete
  • 2020: COVID-19 boosts streaming

12x growth in 13 years - powered by recommendation algorithms that improve engagement by 80%

Netflix: The Netflix Prize (2006-2009)

The Competition

  • Prize: $1,000,000
  • Goal: Beat Netflix's algorithm by 10%
  • Dataset: 100M ratings from 480K users
  • Duration: 3 years
  • Teams: 40,000+ from 186 countries

What They Learned

  • Ensemble methods beat single algorithms
  • Combining diverse approaches works best
  • The winning solution: 107 algorithms combined!
  • Spawned entire Kaggle-style competition industry

Netflix: The House of Cards Decision

$100M bet on data: Netflix's first original series investment

Data Points Analyzed:

  • British "House of Cards" was popular on Netflix
  • Kevin Spacey films performed well
  • David Fincher films had high completion rates
  • Political dramas had dedicated audiences
No Pilot
Ordered 2 Seasons Directly

Result: Massive success, proved data-driven content creation works. Now standard practice for streaming.

Netflix: A/B Testing Everything

Netflix runs 250+ A/B tests simultaneously - every change is tested on real users.

What They Test:

  • Thumbnail images (different ones for different users!)
  • Row ordering on homepage
  • Synopsis text and trailers
  • Autoplay timing
  • Skip intro button placement

Thumbnail Optimization:

Same movie, different thumbnails based on your viewing history:

  • Romance viewer → sees couple
  • Action viewer → sees explosion
  • Comedy viewer → sees funny scene

Netflix: Data They Collect

Viewing Behavior

  • What you watch and when
  • Where you pause, rewind, skip
  • Time of day patterns
  • Device and location
  • Binge vs. casual viewing

Content Analysis

  • Genre micro-tags (76,897 categories!)
  • Scene-by-scene metadata
  • Color palettes and pacing
  • Actor/director connections
  • Audio characteristics

Engagement Signals

  • Completion rates
  • Re-watch behavior
  • Search queries
  • Browse patterns
  • Sharing activity

Fun fact: Netflix knows that 70% of viewers who watch 3 episodes will finish the season.

Netflix: Lessons for Data Analytics

  1. Personalization at scale: One-size-fits-all doesn't work
  2. Test everything: A/B testing removes guesswork
  3. Data for creativity: Analytics can inform content decisions
  4. Micro-segmentation: 76,897 genres > 10 genres
  5. Retention is key: Keeping users > acquiring new ones

Case Study: Spotify

Turning Data Into a Product Feature

Spotify Wrapped: Analytics as Marketing

156M
Users Shared Wrapped (2023)
$0
Ad Spend Required

Why It's Brilliant:

  • Free viral marketing every December
  • Makes users feel "seen" by the algorithm
  • Creates social proof and FOMO
  • Competitors (Apple, YouTube) now copy it

Spotify: Analytics Behind Wrapped

Listening Analysis

  • Total minutes played
  • Top artists, songs, genres
  • Discovery vs. familiar ratio
  • Listening time patterns

Behavioral Insights

  • "Listening personality" classification
  • Audio feature preferences
  • Playlist creation habits
  • Skip rate patterns

Social Context

  • How you compare to others
  • Top % of artist fans
  • Genre popularity trends
  • Regional differences

Lesson: Your analytics insights can become a product feature that users love!

Case Study: Framingham

The Study That Changed Medicine

Framingham Heart Study: 75+ Years of Data

Study Design (1948)

  • Location: Framingham, Massachusetts
  • Original cohort: 5,209 adults
  • Now: 3rd generation enrolled
  • Visits: Every 2-4 years for life
  • Variables: 1,000+ health measures
75+
Years of Continuous Data
3,000+
Published Papers

Framingham: Revolutionary Discoveries

Year Discovery Impact
1961 Cholesterol linked to heart disease Statin drugs, dietary guidelines
1967 Physical activity reduces risk Exercise recommendations
1970 High blood pressure → stroke Blood pressure medications
1978 HDL ("good") cholesterol protective Refined cholesterol guidelines
1988 Obesity as independent risk factor Public health campaigns

The term "risk factor" was invented by Framingham researchers!

Framingham: Why This Study Matters

This is the power of longitudinal data analysis:

  1. Correlation → Causation: Decades of data help establish causal relationships
  2. Risk prediction: Framingham Risk Score used by doctors worldwide
  3. Policy impact: Directly influenced FDA, WHO, AHA guidelines
  4. Saved millions of lives: Heart disease deaths dropped 60% since 1950s

In Week 7: We'll build predictive models using actual Framingham data!

Philippine Case Study

GCash: Analytics for Financial Inclusion

GCash: The Philippine Problem

Financial Inclusion Challenge

  • 51% of Filipinos are unbanked (2019)
  • 7,641 islands - physical banks impractical
  • No credit history = no loans
  • Cash-based economy, remittances vital
51%
Filipinos Without Bank Accounts (2019)
$36B
Annual OFW Remittances

Opportunity: 70M+ smartphone users but only 30M banked citizens

GCash: Explosive Growth Through Analytics

81M+
Registered Users (2023)
↑ from 20M in 2019
$17B+
Monthly Transactions
3.5M
Partner Merchants

COVID-19 accelerated adoption: 4x growth in 2020-2021 as cash became risky and ayuda needed digital distribution.

GCash: The 4x Growth Story

80M 60M 40M 20M 2019 Pre-COVID 20M 2020 Ayuda 33M 2021 Cashless 55M 2022 76M 2023 81M+

What Drove the Growth?

  • Ayuda distribution - 18M+ beneficiaries
  • QR payments - 3.5M merchants
  • GCash Forest - gamification
  • GLoan/GCredit - financial inclusion

GCash: GScore - Analytics-Powered Credit

Problem: No credit history = No traditional credit score

Solution: Build credit scores from app behavior

Alternative Data Used:

  • Transaction frequency and amounts
  • Bill payment consistency
  • App usage patterns
  • Network of contacts
  • Top-up behavior

Impact

  • GLoan: Instant loans up to ₱25,000
  • GCredit: Buy now, pay later
  • Approval in seconds using ML models
  • First-time borrowers who never had credit access

This is exactly what we'll build in Week 7-8!

GCash: Real-Time Fraud Detection

Challenge: Process millions of transactions per day while catching fraud in real-time

Fraud Signals Analyzed:

  • Transaction velocity (too many, too fast)
  • Geographic anomalies (login from new location)
  • Device fingerprinting
  • Behavioral biometrics (typing patterns)
  • Network analysis (connected to known fraudsters)

ML Models Used:

  • Real-time scoring (< 100ms decision)
  • Anomaly detection for unusual patterns
  • Graph analytics for fraud rings
  • Adaptive models that learn from new fraud patterns

Activity: Design Fraud Detection Signals

Brainstorm Session

Scenario: You're a data analyst at GCash. A new fraud pattern has emerged: scammers are tricking users into sending money via "wrong send" schemes.

The "wrong send" scam:

  1. Scammer "accidentally" sends money
  2. Asks victim to return it
  3. Original transaction was fraudulent (stolen funds)
  4. Victim becomes money mule

Your task:

What data signals would you look for to detect this pattern? Think about:

  • Transaction patterns
  • Account characteristics
  • User behavior

4 minutes

Philippine Case Study

Grab: Real-Time Analytics at Scale

Grab: The Optimization Challenge

Real-Time Decisions Required

  • Matching: Which driver for which rider?
  • Pricing: What fare is fair right now?
  • ETA: How long will it really take?
  • Routing: Best path considering live traffic?
Millions
Rides per Day Across SEA
< 2 sec
Driver Match Decision Time

Grab: Analytics in Action

Function Analytics Method Business Impact
Dynamic Pricing Demand prediction, elasticity modeling Balance supply/demand, reduce wait times
Driver Matching Optimization algorithms, ML ranking Faster pickups, higher driver earnings
ETA Prediction Time series, traffic pattern ML Accurate expectations, fewer cancellations
GrabFood Demand forecasting, restaurant recommendations Reduced food waste, faster delivery
Safety Anomaly detection, route monitoring Protect riders and drivers

Grab: The Surge Pricing Algorithm

Factors in the Model:

  • Real-time demand (ride requests)
  • Available drivers in area
  • Weather conditions (rain = high demand)
  • Events (concerts, games)
  • Time of day patterns
  • Historical data for the area

Goal:

Price high enough to attract drivers from other areas, but low enough that riders still book.

This is optimization under uncertainty!

Ethical consideration: Is surge pricing during typhoons fair? We'll discuss in Week 11.

Ethics in Data Analytics

When Algorithms Affect Lives

Why Ethics Matters in Analytics

Algorithms now make decisions about:

High Stakes

  • Who gets a loan
  • Who gets hired
  • Prison sentences
  • Medical diagnoses

Medium Stakes

  • Insurance pricing
  • College admissions
  • Housing applications
  • Content recommendations

Key Questions

  • Who is harmed if wrong?
  • Is there recourse?
  • Can decisions be explained?
  • Is training data biased?

Case Study: COMPAS

When Algorithms Decide Who Goes to Prison

COMPAS: The Algorithm in the Courtroom

What is COMPAS?

  • Name: Correctional Offender Management Profiling for Alternative Sanctions
  • Purpose: Predict likelihood of re-offending
  • Used by: Judges in bail and sentencing decisions
  • States: Used in Florida, New York, Wisconsin, others

Risk Score: 1-10

Based on 137 questions about:

  • Criminal history
  • Family background
  • Education, employment
  • Social relationships

COMPAS: The ProPublica Investigation

2016: ProPublica analyzed 7,000 defendants in Broward County, Florida

Key Findings:

Metric Black White
False Positive Rate 44.9% 23.5%
False Negative Rate 28.0% 47.7%

Translation: Black defendants who didn't re-offend were almost twice as likely to be labeled high-risk.

Real Impact

Higher risk scores mean:

  • Higher bail amounts
  • Longer sentences
  • Less likely parole
  • Real people, real consequences

Activity: Identify the Bias

Group Discussion

If the algorithm doesn't use race as an input, how can it still produce racially biased outcomes?

Consider these factors:

  • Prior arrests (not convictions)
  • Neighborhood crime rates
  • Family members with criminal history
  • Employment history
  • Education level

Discussion questions:

  1. Which factors might correlate with race?
  2. What's the difference between "fair" and "accurate"?
  3. Should algorithms be used for sentencing at all?

5 minutes

COMPAS: The Fairness Paradox

Northpointe's Defense

"The algorithm is calibrated fairly - if we say 70% risk, 70% actually re-offend, regardless of race."

This is true! Calibration is equal.

The Mathematical Reality

When base rates differ between groups, you cannot have:

  • Equal false positive rates AND
  • Equal false negative rates AND
  • Equal calibration

Pick 2, sacrifice 1.

COMPAS: Lessons for Data Analysts

  1. Historical bias → Model bias: If past data reflects discrimination, models will too
  2. "Fairness" has many definitions: Stakeholders may disagree on which matters
  3. Black-box algorithms are dangerous: If you can't explain it, should it decide freedom?
  4. Context matters: Different applications need different fairness criteria
  5. Human accountability: Someone must be responsible for algorithmic decisions

We'll study fairness in ML formally in Week 11

Ethics in the Philippine Context

Potential Issues:

  • GScore: Alternative credit can perpetuate inequality
  • Facial recognition: Used by malls, may misidentify darker skin
  • Surge pricing: Fair during typhoons?
  • Content moderation: Filipino-language hate speech harder to detect

Questions for Your Capstone

  • Who benefits from your analysis?
  • Who might be harmed?
  • Is your data representative?
  • Can your results be misused?

Philippine Data Privacy Act (RA 10173)

Key Provisions (enacted 2012):

  1. Consent: Must be freely given, specific, informed
  2. Purpose limitation: Use only for stated purpose
  3. Data minimization: Collect only what's needed
  4. Accuracy: Keep data up to date
  5. Storage limitation: Delete when no longer needed
  6. Security: Protect against unauthorized access

Penalty: Up to ₱5M fine and imprisonment for violations. NPC is the enforcing agency.

Your Ethical Responsibilities

As a data analyst, you should:

  1. Question your data sources and potential biases
  2. Consider who is affected by your analysis
  3. Be transparent about limitations and uncertainties
  4. Protect privacy and follow data protection laws
  5. Speak up when you see analytics being misused

Remember: "Just because we can doesn't mean we should."

The Analytics Job Market

35%
Projected Growth (2022-2032)
Much faster than average
PHP 40-80K
Entry-Level Salary (PH)

Skills in Demand:

  • Python/R programming
  • SQL and databases
  • Data visualization
  • Statistical analysis
  • Communication skills
  • Business acumen

Week 1 Key Takeaways

  1. Data analytics transforms raw data into actionable insights
  2. The lifecycle has 5 facets: Collect → Manage → Explore → Predict → Communicate
  3. Communication is often the most critical (and neglected) facet
  4. Philippine companies (GCash, Grab, SM) are actively using analytics
  5. Ethics must be central to analytics practice
  6. Growing job market with skills shortage = opportunity!

Lab 1 Preview

This Week's Lab

Introduction to Python for Analytics

  • Setting up environment
  • Loading Philippine economic data
  • Basic exploration with pandas
  • Your first visualization
# Lab 1 Preview
import pandas as pd
import matplotlib.pyplot as plt

# Load Philippine GDP data
df = pd.read_csv('ph_gdp.csv')

# Quick exploration
print(df.shape)
print(df.describe())

# First visualization
df.plot(x='year', y='gdp')
plt.title('Philippine GDP Growth')
plt.show()

Next Week Preview

Week 2: Probability & Statistical Foundations

Lecture 3: Probability Review

  • Random variables & distributions
  • Bayes' theorem
  • Central Limit Theorem

Lecture 4: Statistical Inference

  • Hypothesis testing
  • Confidence intervals
  • A/B testing framework

References

Questions?

CMSC 178DA - Data Analytics

University of the Philippines Cebu
Department of Computer Science