CMSC 178DA | Week 7 · Session 1

Regression Analytics:
Explaining What Drives
the Numbers

From correlation to causal interpretation

Department of Computer Science

University of the Philippines Cebu

"Every coefficient tells a story about cause and effect."

"All models are wrong, but some are useful."

— George E.P. Box, 1976

This week: how to make regression useful for explaining and predicting in the Philippine context.

Agenda

Session 1 Objectives

Regression Fundamentals

The OLS equation, coefficient interpretation, and the meaning of “holding all else constant.”

Diagnostics & Assumptions

LINE assumptions, residual plots, VIF, and when to trust your model’s output.

Philippine Application

Regional poverty model using PSA data: literacy, GDP, and urbanization as predictors.

Part I

What Regression Answers

Regression in analytics is about explanation — not just prediction. The coefficients are the story.

Part I — What Regression Answers

ML Predicts; Analytics Explains

ML prediction vs analytics explanation comparison

Part I — What Regression Answers

The Regression Equation Tells a Story

Every regression equation has three characters: the baseline (b₀), the effects (b₁, b₂…), and the unexplained (ε). Together they tell a story about what drives the outcome.

The Intercept (b₀)

The predicted value when all predictors are zero — often a theoretical baseline.

The Slope (b₁)

The change in y for a 1-unit increase in x, holding everything else constant.

Regression equation with labeled components

Part I — What Regression Answers

Each Coefficient Is a Ceteris Paribus Statement

Coefficient bar chart with confidence intervals

Part I — What Regression Answers

Interpreting Coefficients for Stakeholders

Predictor	b	Translation
Ad Spend ₱K	+8.4	Each additional ₱1K in ads → ₱8.4K more sales
Store Size m²	+5.7	Larger stores sell more, all else equal
Promotion Days	+3.2	Each promo day adds ₱3.2K
Competitor Count	−1.8	Each nearby competitor costs ₱1.8K

Translation Rule

For every 1-unit increase in X, Y changes by b₁ units — holding all other variables constant.

What Stakeholders Want

They don’t want coefficients. They want: “Should we spend more on ads or open a bigger store?” Your regression answers this.

Part II

Trusting the Model

A regression is only as good as its assumptions. Visual diagnostics tell you when to trust — and when to stop.

Part II — Trusting the Model

LINE: Four Assumptions You Must Check

LINE assumptions: Linearity, Independence, Normality, Equal variance

Part II — Trusting the Model

Residual Plots Reveal Hidden Problems

Part II — Trusting the Model

Good Residuals vs Bad Residuals

The residual plot is your first diagnostic check. Random scatter means the model’s assumptions hold. Patterns reveal specific problems.

✓ Good

Random scatter around zero — assumptions met.

× Bad

Funnel shape — variance increases with x (heteroscedasticity). Fix: log-transform y or use robust standard errors.

Good vs bad residual patterns side by side

Part II — Trusting the Model

Multicollinearity Inflates Your Standard Errors

VIF heatmap showing multicollinearity among predictors

Part III

Model Evaluation & Application

R² tells you how much you explained. RMSE tells you how far off you are. Context tells you if the model is useful.

Part III — Model Evaluation & Application

R² Measures Explained Variance

R-squared visual: explained vs unexplained variance

Part III — Model Evaluation & Application

RMSE and MAE: Error in Units You Care About

Both measure prediction error in the same units as your outcome variable. RMSE penalizes large errors more heavily; MAE treats all errors equally.

Metric	Strength	Weakness
RMSE	Sensitive to outliers	Penalizes big misses
MAE	Robust to outliers	Ignores error magnitude

Practical Rule

Report both. If RMSE >> MAE, you have outlier predictions that need investigation.

RMSE vs MAE comparison on sample predictions

Part III — Model Evaluation & Application

Feature Selection: Which Predictors Earn Their Place?

Part III — Model Evaluation & Application

Philippine Regional Poverty: A Regression Case Study

Scatter plot: literacy rate vs poverty incidence across 17 Philippine regions

Part III — Model Evaluation & Application

Multiple Predictors Paint a Richer Picture

Adding GDP per capita and urbanization rate to the literacy model:

Predictor	Coeff.	Note
Literacy Rate	−2.80*	Strongest predictor
GDP per capita	−0.08*	Significant but small effect
Urbanization	−0.15	Not significant after controlling for GDP

Key Insight

Urbanization becomes non-significant once GDP is in the model — suggesting urbanization’s effect on poverty operates through economic output, not independently.

Coefficient plot for multiple regression poverty model

Part III — Model Evaluation & Application

Diagnostics for the Poverty Model

Residual plot and leverage plot for Philippine poverty model

What We See

Generally clean residuals for 15 of 17 regions. BARMM appears as a high-leverage point — its poverty rate (63%) is a structural outlier driven by conflict, not just economic factors.

Decision

Model is reasonable for most regions. BARMM and NCR are structural outliers — report them separately rather than letting them distort the model.

Part III — Model Evaluation & Application

Three Regression Pitfalls to Avoid

Overfitting

Too many predictors for your sample size. Rule of thumb: need 10–15 observations per predictor. With 17 Philippine regions, limit to 2–3 predictors.

"I added 8 predictors to a 17-row dataset and got R² = 0.99!"

Extrapolation

The model predicts well within the data range. Outside it, predictions are meaningless. Don’t predict poverty for a region with 70% literacy if your data only goes down to 82%.

"Our model says poverty would be −5% if literacy reached 100%."

Correlation ≠ Causation

Regression coefficients show association, not causation. Ice cream sales and drowning both rise in summer — regression would show a “significant” relationship.

"The data proves that ice cream causes drowning."

Part III — Model Evaluation & Application

Correlation Does Not Imply Causation

A regression coefficient tells you that X and Y move together after controlling for other variables. It does not tell you that X causes Y.

When Can We Claim Causation?

Randomized controlled trials, natural experiments, or instrumental variables. Observational regression alone — no matter how high the R² — cannot prove causation.

The Analyst’s Responsibility

Always use language like “associated with” or “predicts,” never “causes” or “leads to,” when reporting regression results from observational data.

Spurious correlation example: ice cream vs drowning

Session 1 Key Takeaways

Regression for analytics = interpretation first, prediction second.
Every coefficient is a “holding all else constant” statement — translate it for stakeholders.
Check LINE assumptions visually with residual plots before trusting any p-value.
VIF > 10 means your coefficients are unreliable — drop or combine correlated predictors.
The Philippine poverty model shows literacy, not urbanization, as the stronger lever.

Next: Session 2 — Logistic Regression & Classification

CMSC 178DA | Week 7 · Session 2

Logistic Regression:
Predicting Yes-or-No
Outcomes

From probabilities to decisions

Department of Computer Science

University of the Philippines Cebu

"Behind every automated approval is a probability and a threshold."

GCash approves or rejects 2 million loan applications per month.

Behind every decision is a logistic regression model trained on borrower features.

2M

decisions/month

0.5 sec

per decision

₱0

human review for standard cases

Agenda

Session 2 Objectives

Logistic Fundamentals

The sigmoid function, log-odds, and how to interpret odds ratios for stakeholders.

Classification Metrics

Confusion matrix, precision, recall, F1, and ROC-AUC — and when each matters most.

Decision-Making

Threshold tuning, cost-benefit analysis, and Philippine case studies in credit and education.

Part I

From Linear to Logistic

When the outcome is yes or no, linear regression breaks. The sigmoid function fixes it — and odds ratios make it interpretable.

Part I — From Linear to Logistic

Linear Regression Cannot Predict Probabilities

Linear regression vs logistic regression for binary outcome

Part I — From Linear to Logistic

The Sigmoid Curve Maps Any Score to a Probability

The logistic function transforms any real-valued score z into a probability P between 0 and 1. The S-shape ensures smooth transitions near the decision boundary.

Why Sigmoid?

Natural for binary outcomes: as evidence increases, probability approaches 1 asymptotically but never exceeds it.

The Decision Boundary

Where P crosses 0.5 is the default classification threshold — but it’s rarely the optimal one for real problems.

Sigmoid S-curve with decision boundary at P=0.5

Part I — From Linear to Logistic

Odds Ratios Make Coefficients Interpretable

In logistic regression, we exponentiate the coefficient (e𝗛) to get the odds ratio. This tells stakeholders how much the odds change per unit increase in the predictor.

How to Read Odds Ratios

OR = 1: no effect. OR > 1: increases odds. OR < 1: decreases odds. OR = 2.5 means the odds are 2.5× higher for a 1-unit increase.

Example

A scholarship holder has OR = 0.35 for dropout — meaning 65% lower odds of dropping out compared to non-scholarship students.

Odds ratio forest plot with confidence intervals

Part I — From Linear to Logistic

Decision Boundaries Separate the Outcomes

Decision boundary separating two classes in feature space

Part II

Measuring Classification Quality

Accuracy is a lie in imbalanced data. Precision, recall, and AUC tell the real story.

Part II — Measuring Classification Quality

The Confusion Matrix: Every Classifier’s Report Card

Confusion matrix with TP, FP, TN, FN labeled

Part II — Measuring Classification Quality

Accuracy Misleads When Classes Are Imbalanced

Accuracy paradox illustration with imbalanced dataset

Part II — Measuring Classification Quality

Precision and Recall Answer Different Questions

Precision asks: “Of everything I flagged, how much was correct?” Recall asks: “Of everything that was positive, how much did I catch?”

✓ High Precision Needed

Spam filter — don’t block legitimate email. Cost of FP > cost of FN.

× High Recall Needed

Disease screening — don’t miss sick patients. Cost of FN > cost of FP.

F1 Score

Harmonic mean of precision and recall. Use when you need to balance both and can’t afford to optimize one at the expense of the other.

Part II — Measuring Classification Quality

The ROC Curve Summarizes Performance Across All Thresholds

Part II — Measuring Classification Quality

Threshold Tuning: The Business Decides, Not the Algorithm

Precision-recall tradeoff across different thresholds

Part III

Philippine Applications & What Comes Next

From credit scoring to student retention — logistic regression powers decisions across Philippine institutions.

Part III — Philippine Applications

Predicting Student Dropout: A UP Cebu Scenario

Student dropout logistic regression model results

Part III — Philippine Applications

Credit Default Prediction for Philippine Lenders

BSP-regulated consumer lending (GCash GLoan, Bayad Center) requires explainable models per BSP Circular 855.

Feature	OR	Interpretation
Monthly Income	0.62	Higher income → lower default odds
Outstanding Balance	1.45	Higher balance → higher risk
Employment Length	0.78	Longer tenure → lower risk
Active Loans	1.68	More loans → 68% higher default odds

Regulatory Context

BSP requires that lending models be explainable to regulators and consumers. Logistic regression’s odds ratios satisfy this requirement directly.

ROC curve for Philippine credit default model

Part III — Philippine Applications

From Odds Ratios to Actionable Recommendations

The analyst’s job: translate OR = 1.68 into “every additional active loan increases default risk by 68%.”

Decision Matrix

Low risk (P < 0.3): auto-approve. Medium (0.3–0.7): manual review. High (P > 0.7): auto-reject. Thresholds set by business, not by data science.

The Deliverable

A stakeholder-ready table mapping model output to business actions — not a confusion matrix.

Business impact chart showing cost at different thresholds

Part III — Philippine Applications

Multiclass Extension: Softmax Regression

When the outcome has more than two categories (e.g., customer segment A/B/C), the sigmoid extends to softmax — one probability per class, summing to 1.

Brief Mention

If you have 3+ classes, use softmax. For most analytics use cases, binary logistic regression covers the majority of decision problems.

Week 8 Preview

Decision trees handle multiclass naturally and don’t require the linearity assumption. Next week we explore when trees beat regression.

Softmax function mapping inputs to multiclass probabilities

Part III — Philippine Applications

Logistic Regression vs Decision Trees: When to Use Which

Both can classify. The choice depends on your data and your audience.

Criterion	Logistic Regression	Decision Tree
Interpretability	Coefficients + OR	If-then rules
Feature Types	Numeric (needs encoding)	Numeric + categorical natively
Linearity	Assumes linear log-odds	No linearity assumption
Interactions	Must add manually	Discovers automatically
Speed	Very fast	Fast (slower for ensembles)

Preview

Next week we explore decision trees, random forests, and gradient boosting — and when they outperform logistic regression.

Side-by-side logistic regression vs decision tree boundaries

Part III — Philippine Applications

Three Logistic Regression Mistakes to Avoid

Using Accuracy on Imbalanced Data

Always check class balance first. If 99% of cases are negative, accuracy is useless. Use F1, AUC, or precision-recall instead.

"Our fraud model has 99.7% accuracy!" — it predicts “no fraud” for everything.

Ignoring Odds Ratio Direction

OR = 0.5 means protective (50% lower odds), not harmful. Misreading direction leads to exactly wrong recommendations.

"OR = 0.35, so scholarships increase dropout risk!" — it’s the opposite.

One Threshold for All

Different business contexts require different thresholds. Fraud detection (low threshold) ≠ churn prediction (balanced) ≠ medical screening (very low).

"We use 0.5 for everything." — one size never fits all.

Session 2 Key Takeaways

Logistic regression predicts probabilities for binary (yes/no) outcomes — bounded between 0 and 1.
Odds ratios (e𝗛) make coefficients interpretable for stakeholders — no math degree required.
Accuracy lies in imbalanced data — always check precision, recall, F1, and AUC.
Threshold choice is a business decision, not a statistical one — align it with the cost of errors.
Decision trees (Week 8) handle non-linear boundaries and multiclass outcomes naturally.

Next Week: Tree-Based Methods — Decision Trees, Random Forests, and Gradient Boosting

Week 8 Preview

Tree-Based Methods

Decision Trees — interpretable if-then rules

Random Forests — ensemble power

Gradient Boosting — state-of-the-art tabular performance

Lab 7: Build a Philippine credit scoring model, interpret odds ratios, and optimize the threshold for a business objective.

Regression Analytics:Explaining What Drivesthe Numbers

"All models are wrong, but some are useful."

Session 1 Objectives

Regression Fundamentals

Diagnostics & Assumptions

Philippine Application

What Regression Answers

ML Predicts; Analytics Explains

This Week’s Focus

From CMSC 173

The Regression Equation Tells a Story

The Intercept (b₀)

The Slope (b₁)

Each Coefficient Is a Ceteris Paribus Statement

How to Read This

The Key Phrase

Interpreting Coefficients for Stakeholders

Translation Rule

What Stakeholders Want

Trusting the Model

LINE: Four Assumptions You Must Check

Why It Matters

The Good News

Residual Plots Reveal Hidden Problems

What to Look For

Reading the Q-Q Plot

Good Residuals vs Bad Residuals

Multicollinearity Inflates Your Standard Errors

VIF Rule of Thumb

Philippine Example

Model Evaluation & Application

R² Measures Explained Variance

R² Is Not Everything

Adjusted R²

RMSE and MAE: Error in Units You Care About

Practical Rule

Feature Selection: Which Predictors Earn Their Place?

Stepwise Caution

Practical Rule

Philippine Regional Poverty: A Regression Case Study

Data Sources

What the Model Shows

Multiple Predictors Paint a Richer Picture

Key Insight

Diagnostics for the Poverty Model

What We See

Decision

Three Regression Pitfalls to Avoid

Overfitting

Extrapolation

Correlation ≠ Causation

The Analyst’s Rule

Correlation Does Not Imply Causation

When Can We Claim Causation?

The Analyst’s Responsibility

Session 1 Key Takeaways

Logistic Regression:Predicting Yes-or-NoOutcomes

GCash approves or rejects 2 million loan applications per month.

Session 2 Objectives

Logistic Fundamentals

Classification Metrics

Decision-Making

From Linear to Logistic

Linear Regression Cannot Predict Probabilities

The Problem

The Fix

The Sigmoid Curve Maps Any Score to a Probability

Why Sigmoid?

The Decision Boundary

Odds Ratios Make Coefficients Interpretable

How to Read Odds Ratios

Example

Decision Boundaries Separate the Outcomes

The Default Threshold is 0.5

Moving the Boundary

Measuring Classification Quality

The Confusion Matrix: Every Classifier’s Report Card

Type I Error (False Positive)

Type II Error (False Negative)

Accuracy Misleads When Classes Are Imbalanced

Regression Analytics:
Explaining What Drives
the Numbers

Logistic Regression:
Predicting Yes-or-No
Outcomes