CMSC 178DA | Week 11 · Session 1

Mining Meaning
from Text

From raw words to actionable insights

Department of Computer Science

University of the Philippines Cebu

"80% of enterprise data is unstructured — and much of it is text."

"The limits of my language mean the limits of my world."

— Ludwig Wittgenstein, 1921

Today: how to teach machines to understand text — and how to use that power responsibly.

Case Study

When NLP Goes Wrong

Amazon's AI Recruiting Tool (2018)

Reuters exclusive report, October 2018

Amazon trained an NLP model on 10 years of resumes to automate hiring. The model learned to penalize resumes containing the word "women's" (e.g., "women's chess club captain") and downgrade graduates of all-women's colleges.

Why? The training data was 10 years of mostly male hires. The model learned that "male" patterns predicted success. Amazon scrapped the entire program.

Text analytics is powerful — but the data you train on encodes the biases of the world that produced it.

Agenda

Session 1 Objectives

Text Preprocessing

Tokenize, clean, and normalize raw text into analysis-ready tokens.

Vectorization

Convert words into numbers using Bag of Words and TF-IDF representations.

Sentiment & Topics

Classify polarity with sentiment analysis and discover themes with LDA topic modeling.

Running Example

Meet Our Data: PH Social Media

Throughout this session, we will preprocess, vectorize, and analyze these posts. They represent real patterns in Philippine social media data.

The Challenge

Code-switching (Taglish), informal spelling, emojis, and sarcasm make PH social media data uniquely difficult for NLP tools built for English.

#	Platform	Post	Sentiment
1	Twitter	"Grabe ang bilis ng GCash today! Love it"	Positive
2	Facebook	"Ang bagal ng internet dito sa probinsya"	Negative
3	Twitter	"Just tried the new Jollibee menu, it's okay naman"	Neutral
4	Facebook	"Sobrang init ngayon, parang oven ang Cebu"	Negative
5	Twitter	"Congrats sa mga bagong graduates! Proud kami!"	Positive
6	Facebook	"The new MRT extension is a game changer"	Positive
7	Twitter	"Nag-update na ba kayo ng PhilSys ID? Hassle amp"	Negative
8	Facebook	"May pasok ba bukas? Walang announcement eh"	Neutral

Part I

Turning Noise
Into Signal

Natural language is messy. Preprocessing cleans, normalizes, and tokenizes text before any model can learn.

Part I — Text Preprocessing

Why Text Analytics?

Unstructured text data is everywhere — and growing faster than any other data type.

Text Data Sources

Customer reviews & feedback
Social media posts & comments
Support tickets & emails
News articles & reports
Survey open-ended responses

Bar chart showing 80% of enterprise data is unstructured text

Part I — Brief History

70 Years of Text Analytics

Why this history matters

Each era didn't replace the previous — they layered. Today's production pipelines mix all of them: regex preprocessing (1960s), TF-IDF retrieval (1990s), and LLM generation (2020s).

Where this lecture sits

We focus on the Classical ML era (★) — BoW, TF-IDF, VADER, TextBlob. These are the foundations every modern system still relies on for cheap, interpretable, fast text features.

Part I — Brief History

Three Paradigms, Side by Side

Paradigm	Era	Core Idea	Strengths	Weaknesses
Symbolic / Rule-based ELIZA, hand-rules	1950s–80s	Encode language as explicit grammars and dictionaries	Transparent, debuggable, no training data needed	Brittle; doesn't generalize; rules explode in complexity
Statistical / Classical ML ★ BoW, TF-IDF, VADER, LDA, SVM	1990s–2010s	Count word frequencies, learn weights from labeled corpora	Fast, cheap, interpretable; works on small data	Loses word order & meaning; lexicons are language-bound (e.g. no Filipino)
Neural / Transformers BERT, GPT, Claude, Llama	2017–today	Learn contextual representations from billions of tokens via self-attention	State-of-the-art on every benchmark; multilingual; multimodal	Expensive; opaque; hallucinates; data & compute hungry

The "embarrassingly effective" baseline

A 1990s-style TF-IDF + logistic regression often beats a fine-tuned BERT for short, domain-specific text classification — at 1/1000th the compute cost. Always benchmark the simple thing first.

Today's stack is hybrid

Modern RAG systems use TF-IDF/BM25 (Era 3) to retrieve documents, then an LLM (Era 7) to generate the answer. Old methods aren't dead — they're load-bearing for the new ones.

Part I — Text Preprocessing

The Preprocessing Pipeline

Raw text needs cleaning before any algorithm can use it. Each step transforms the data into a more useful form.

Pipeline (4 Steps)

1. lowercase: text → text.lower()

2. tokenize: split into words

3. remove stopwords (the, ang, ng, ...)

4. lemmatize: word → base form

Order matters: lowercase before tokenize; stopwords before lemmatize.

Part I — Text Preprocessing

Preprocessing in Python

NLTK Library

The Natural Language Toolkit provides tokenizers, stemmers, lemmatizers, and stopword lists for 20+ languages.

Key Functions

word_tokenize() — split into words
stopwords.words() — common words list
WordNetLemmatizer() — dictionary lookup

Caveat: POS Tagging

WordNet defaults to noun POS — “running” stays as-is. Pass pos='v' for verb lemmatization to get “run”.

In [1]:

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess(text):
    text = text.lower()
    tokens = word_tokenize(text)
    stops = set(stopwords.words('english'))
    tokens = [t for t in tokens
              if t not in stops and t.isalpha()]
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(t) for t in tokens]

preprocess("The dogs are running FAST!")

Out[1]:

['dog', 'running', 'fast']

Trace: "The dogs are running FAST!" → lowercase → ["the","dogs","are","running","fast"] → drop "the","are" → lemmatize → 3 tokens

Worked Example

Preprocessing Step by Step

Watch a real Taglish tweet get transformed through each pipeline stage.

Step	Operation	Result
Raw	—	Grabe ang ganda ng new GCash update!! 💯🔥 #fintech
1. Lowercase	`.lower()`	grabe ang ganda ng new gcash update!! 💯🔥 #fintech
2. Tokenize	`word_tokenize()`	['grabe', 'ang', 'ganda', 'ng', 'new', 'gcash', 'update', '!', '!', '💯', '🔥', '#', 'fintech']
3. Remove stopwords + non-alpha	`isalpha()` + stopword filter	['grabe', 'ganda', 'new', 'gcash', 'update', 'fintech']
4. Lemmatize	`WordNetLemmatizer()`	['grabe', 'ganda', 'new', 'gcash', 'update', 'fintech']

Live Demo

Text Preprocessing Simulator

Grabe ang ganda ng new GCash update!! 💯🔥 #fintech

Step	Operation	Result

Press "1. Lowercase" to start.

Algorithm: Text Preprocessing

1. lowercase: x = x.lower()

2. tokenize: split on whitespace + punctuation

3. remove stopwords (ang, ng, sa, the, is…)

4. lemmatize: dictionary lookup → base form

Part I — Text Preprocessing

Stemming vs. Lemmatization

Comparison of stemmer vs lemmatizer outputs for various words

Part II

Making Words
Countable

Machines need numbers, not words. Bag of Words and TF-IDF convert text into vector space.

Part II — Why BoW?

From Words to Numbers

The Core Problem

Algorithms work on numbers, not text. We need a way to convert "The food was good" into a vector a computer can crunch.

The Idea

Treat each document as an unordered "bag" of words. Build a vocabulary, then count occurrences.

Vector Representation

document d = ($\text{count}(w_1, d), \text{count}(w_2, d), \dots, \text{count}(w_V, d)$)

where $V$ = vocabulary size

d ∈ $\mathbb{R}^V$ (a $V$-dimensional vector)

Part II — Text Representation

Bag of Words in Practice

The simplest text representation: count how many times each word appears.

Limitations

Loses word order entirely
Common words dominate the counts
High-dimensional, sparse matrices

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "The food was good",
    "The service was bad",
    "Good food but bad service"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
# ['bad', 'but', 'food', 'good', 'service', 'the', 'was']

print(X.toarray())
# [[0, 0, 1, 1, 0, 1, 1],
#  [1, 0, 0, 0, 1, 1, 1],
#  [1, 1, 1, 1, 1, 0, 0]]

Part II — Why TF-IDF?

The Problem with Just Counting Words

The Issue

In Bag of Words, "the" appears 50 times in a document — looks like the most "important" word. But it appears in every document. Useless signal.

The Insight

A word is truly important when it's:

Frequent in this document (the topic)
Rare across all documents (distinguishing)

Goal: Build a score that REWARDS frequent-in-doc and PENALIZES common-everywhere.

Part II — Building TF-IDF

TF-IDF: The Formula, Step by Step

① Term Frequency (TF)

"How frequent is term t in document d?" Normalize by doc length so long docs don't dominate.

$\text{TF}(t,d) = \dfrac{\text{count}(t \in d)}{|d|}$

② Inverse Document Frequency (IDF)

"How rare is t across all N documents?" Use log so it grows slowly with N.

$\text{IDF}(t) = \log\dfrac{N}{df_t}$

③ Multiply Them

High TF × High IDF = important word for this doc.

$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$

zero IDF = zero. Canceled. "regression": moderate TF (0.2) × high IDF (1.3) = highest score. Words common everywhere → IDF = 0 → eliminated. Words rare overall but used here → big TF-IDF score.

Part II — Text Representation

TF-IDF in Python

Key Parameters

max_features — limit vocabulary size
min_df — ignore very rare terms
max_df — ignore very common terms
ngram_range — include bigrams

TL;DR

TF-IDF = TF × log(N/df). Words that are frequent in a document but rare across the corpus get the highest score.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=1000,
    min_df=5,
    max_df=0.95,
    ngram_range=(1, 2)  # Unigrams + bigrams
)

X_tfidf = tfidf.fit_transform(documents)

# Most important terms per document
feature_names = tfidf.get_feature_names_out()
for i, doc in enumerate(X_tfidf):
    top_indices = doc.toarray().argsort()[0][-5:]
    top_terms = [feature_names[j] for j in top_indices]
    print(f"Doc {i}: {top_terms}")

Worked Example

TF-IDF by Hand

3 mini-documents. Compute TF-IDF for the word "gcash" in Doc 1.

Documents

"gcash users love the new gcash savings feature" (7 words)
"gcash reports strong quarterly growth numbers today" (7 words)
"bitcoin prices surge amid global market rally today" (8 words)

Formulas

$\text{TF}(t,d) = \frac{\text{count of } t \text{ in } d}{|d|}$

$\text{IDF}(t) = \log\!\left(\frac{N}{df_t}\right)$

$\text{TF-IDF} = \text{TF} \times \text{IDF}$

Step	Calculation	Value
TF("gcash", Doc 1)	2 occurrences / 7 words	$2/7 = 0.286$
df("gcash")	Appears in Doc 1, Doc 2	2 documents
IDF("gcash")	$\log(3/2)$	$0.176$
TF-IDF	$0.286 \times 0.176$	$\mathbf{0.050}$

Interpretation

"gcash" is moderately important in Doc 1: it appears often (high TF) but also in another doc (lowers IDF). A word like "savings" (only in Doc 1) would score higher.

Part II — Text Representation

TF-IDF Scores Across Documents

Heatmap of TF-IDF scores showing common words score low and distinctive words score high

Activity 1

TF-IDF Ranking Challenge

Instructions (5 min, pairs)

Given these 3 PH news headlines from a corpus of 100 articles:

"GCash launches new savings feature for Filipino users"
"BSP raises interest rates amid inflation concerns"
"Filipino fintech GCash reports 86M registered users"

Rank these words by likely TF-IDF score (highest first):

savings Filipino the GCash inflation reports

Hint: Think about which words are frequent in their document but rare across all 100 articles.

Live Demo

TF-IDF Explorer

min_df: 1

Algorithm: TF-IDF

1. TF(t,d) = count(t in d) / |d|

2. IDF(t) = log(N / df(t))

3. TF-IDF = TF × IDF

Knowledge Check

Which term has the HIGHEST TF-IDF score?

A) “the” — appears in every document

B) “analytics” — appears in 2 of 100 docs, 5× each

C) “data” — appears in 80 of 100 docs

D) “I” — common stopword

Click & hold to reveal answer

✓ Correct: B) “analytics”

High TF (appears 5 times in those docs) × high IDF (only 2/100 docs) = highest TF-IDF. Terms A, C, D have low IDF because they appear in most documents.

Part III

Reading Between
the Lines

Sentiment analysis classifies text polarity — positive, negative, or neutral — using lexicons or machine learning.

Part III — Sentiment Analysis

Three Approaches to Sentiment Analysis

Comparison of lexicon-based, ML-based, and pre-trained sentiment approaches

Part III — Sentiment Analysis

VADER Sentiment

What is VADER?

Valence Aware Dictionary and sEntiment Reasoner. Rule-based, tuned for social media text.

Compound Score

Ranges from -1 (most negative) to +1 (most positive). Threshold: >0.05 positive, <-0.05 negative.

VADER Scoring Algorithm

for each word in text:

Look up valence in lexicon

apply modifiers (caps, !, degree adverbs)

sum adjusted valences

normalize to [-1, +1] → compound score

from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

texts = [
    "This product is amazing! I love it!",
    "Terrible experience, never buying again",
    "It's okay, nothing special"
]

for text in texts:
    scores = sia.polarity_scores(text)
    print(f"{text}")
    print(f"  Compound: {scores['compound']:.2f}")

# Output:
# "This product is amazing..." → Compound: 0.86
# "Terrible experience..."     → Compound: -0.48
# "It's okay, nothing..."      → Compound: -0.09

Aha Moment

VADER Doesn't Speak Filipino

Same meaning, different language, completely different result.

This is why off-the-shelf NLP tools fail for Philippine social media data.

Part III — How TextBlob Works

Inside TextBlob's Two Scores

① Polarity (−1 to +1)

Average word polarity from the Pattern lexicon, weighted by intensifiers.

$\text{Polarity} = \dfrac{\sum_w p_w \cdot m_w}{\sum_w m_w}$

$p_w$ = word polarity, $m_w$ = intensifier modifier (e.g., "very" = 1.3×)

② Subjectivity (0 to 1)

Ratio of opinion-bearing words to all scored words.

$\text{Subjectivity} = \dfrac{|\text{opinion words}|}{|\text{scored words}|}$

Lexicon-based: No ML. Just dictionary lookup + averaging. Fast but limited to English & specific domains.

Part III — Sentiment Analysis

TextBlob Sentiment

Two Dimensions

Polarity: -1 (negative) to +1 (positive)
Subjectivity: 0 (factual) to 1 (opinionated)

Per-Sentence Analysis

Analyze each sentence separately for mixed-sentiment texts like product reviews.

from textblob import TextBlob

# Whole-sentence analysis
text = "The food was delicious but the service was slow"
blob = TextBlob(text)
print(f"Polarity:     {blob.sentiment.polarity:.2f}")
print(f"Subjectivity: {blob.sentiment.subjectivity:.2f}")
# Output: Polarity: 0.35, Subjectivity: 0.80

# Per-clause analysis (split manually)
for clause in ["The food was delicious",
               "The service was slow"]:
    pol = TextBlob(clause).sentiment.polarity
    print(f"'{clause}' → {pol:.2f}")

# Output:
# "The food was delicious" → 1.00
# "The service was slow"   → -0.30

Part III — Philippine Context

Philippine Social Media Sentiment

Philippine Social Media

With 98M internet users and 95M social media accounts (DataReportal 2026), the Philippines is one of the most active countries online. Sentiment analysis helps brands, government, and researchers understand public opinion.

Caveat

VADER was designed for English text. Filipino/Taglish posts may need custom lexicons or translated models for accurate results.

Live Demo

Sentiment Analyzer

Algorithm: VADER-like Scoring

1. for each word: score = lexicon[word] or 0

2. compound = normalize(Σ scores)

3. if compound ≥ 0.05 → POSITIVE

elif ≤ -0.05 → NEGATIVE

else → NEUTRAL

Part III — Sentiment Analysis

ML Sentiment Pipeline

TF-IDF + Naive Bayes

A simple yet effective pipeline: vectorize text with TF-IDF, then classify with Naive Bayes (a classifier that assumes features are independent given the class — "naive" because this is rarely true, yet it works surprisingly well). The Multinomial variant models word counts/frequencies, making it ideal for text.

When to Use ML-Based

Domain-specific language (medical, legal)
You have labeled training data
Lexicon approaches underperform

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Labeled training data
X_train = ["great product", "terrible quality", ...]
y_train = [1, 0, ...]  # 1=positive, 0=negative

# Pipeline: TF-IDF + Naive Bayes
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB())
])

pipeline.fit(X_train, y_train)

# Predict new text
predictions = pipeline.predict(["I love this!"])
# Output: [1]  (positive)

Part IV

Discovering
Hidden Themes

LDA topic modeling reveals latent topics. NER extracts named entities. Word clouds visualize term frequencies.

Part IV — Topic Modeling

LDA Topic Modeling

Quick Reminder

Supervised = labeled data, model learns input→output mapping. Unsupervised = no labels, model discovers structure on its own. LDA is unsupervised.

Latent Dirichlet Allocation

Unsupervised algorithm that discovers hidden topics in a collection of documents.

Key Assumptions

Documents are mixtures of topics
Topics are distributions over words
You choose number of topics (k)
Input must be word counts, not TF-IDF

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# LDA needs word COUNTS (not TF-IDF!)
cv = CountVectorizer(max_features=5000,
                     stop_words='english')
X_counts = cv.fit_transform(documents)

lda = LatentDirichletAllocation(
    n_components=5, random_state=42)
lda.fit(X_counts)

# Print top words per topic
names = cv.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    top = [names[j] for j in topic.argsort()[-10:]]
    print(f"Topic {i}: {', '.join(top)}")

# Topic 0: price, quality, value, product...
# Topic 1: delivery, shipping, days, arrive...

LDA Algorithm (Simplified)

input documents, k topics

initialize random topic assignments for all words

repeat until convergence:

for each word w in each document:

P(topic | doc) × P(word | topic) → reassign

output topic-word and doc-topic distributions

Part IV — Philippine Context

Handling Filipino Text

Before and after Filipino text preprocessing showing stopword removal

Summary

Method Comparison: Text Analytics Toolkit

Method	Input	Output	Best For	PH Limitation
Bag of Words	Raw text	Word count vectors	Simple classification, baselines	Ignores Taglish word order
TF-IDF	Raw text	Weighted term vectors	Search, document similarity	No Filipino IDF corpora available
VADER	English text	Polarity scores (-1 to +1)	Social media, informal text	Zero Filipino coverage
TextBlob	English text	Polarity + subjectivity	Quick sentiment + objectivity	English-only lexicon
ML Pipeline	Labeled data	Predicted classes	Domain-specific sentiment	Needs Filipino labeled dataset
LDA	Word counts	Topic distributions	Theme discovery, exploration	Filipino stopwords needed

Live Demo

LDA Topic Discovery

Speed: Iteration: 0 | Topics: 3 | Assigned: 0/30

Algorithm: LDA

1. assign each word a random topic

2. repeat until stable:

for each word w:

new = argmax P(topic|doc) × P(w|topic)

reassign w → new topic

3. Output: K clusters of related words

Session 1: Key Takeaways

Preprocessing is critical — lowercase, tokenize, remove stopwords, lemmatize
TF-IDF weights important terms higher than common words
Sentiment analysis has three approaches: lexicon, ML, and pre-trained
LDA discovers hidden topics in document collections
Filipino text needs custom stopwords and Taglish handling

Next: Analytics at Scale & Ethics

Big data tools, privacy regulations, algorithmic bias, and responsible AI.

CMSC 178DA | Week 11 · Session 2

Scale, Privacy
& Fairness

When data gets big, ethics must get bigger

Department of Computer Science

University of the Philippines Cebu

"With great data comes great responsibility."

The Philippine Data Explosion

86M+

GCash registered users

98M

Internet users in PH

95M

Social media accounts

Who protects this data?

Agenda

Session 2 Objectives

Big Data Tools

When pandas isn’t enough: Spark, cloud platforms, and the 5 Vs of big data.

Privacy & Compliance

GDPR, Philippine DPA (RA 10173), anonymization, and consent requirements.

Bias & Fairness

Sources of algorithmic bias, fairness metrics, mitigation strategies, and responsible AI.

Part I

When pandas
Is Not Enough

Big data demands distributed computing. Learn when and why to scale beyond a single machine.

Part I — Analytics at Scale

The 5 Vs of Big Data

Pentagon diagram showing Volume, Velocity, Variety, Veracity, Value

Part I — Analytics at Scale

When Do You Need Big Data Tools?

× You DON’T need Spark if:

Data fits in memory (<16 GB)
Processing is one-time, ad-hoc
Simple aggregations / filters
pandas + SQL handles it fine

✓ You DO need distributed tools when:

Data exceeds single machine memory
Processing must be parallelized
Real-time streaming is required
ML at scale (millions of records)

TL;DR

Most analytics tasks (<10 GB) don’t need Spark. Use the simplest tool that works.

Part I — Analytics at Scale

Apache Spark

Distributed Computing

In-memory processing (up to 100× faster than MapReduce in memory; typically 3–10× in practice)
Supports Python (PySpark), SQL, Scala, R
MLlib for machine learning at scale

When to Choose Spark

Datasets >100 GB, iterative ML algorithms, streaming data, or when a single machine can’t keep up.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Analytics") \
    .getOrCreate()

# Load CSV (distributed across cluster)
df = spark.read.csv("large_data.csv",
                    header=True,
                    inferSchema=True)

# SQL-like operations at scale
df.groupBy("region") \
  .agg({"sales": "sum"}) \
  .orderBy("sum(sales)", ascending=False) \
  .show()

# Result: distributed across worker nodes

Part I — Analytics at Scale

Cloud Analytics Platforms

Platform	Service	Strength	Pricing Model
AWS	Redshift, Athena, EMR	Most comprehensive ecosystem	Per-query or provisioned
GCP	BigQuery	Serverless SQL at scale	Per-TB scanned
Azure	Synapse Analytics	Enterprise integration (Office 365)	DWU-based

-- BigQuery example: analyze sales by region (serverless, no cluster setup)
SELECT region, SUM(sales) as total_sales, COUNT(*) as num_transactions
FROM `project.dataset.sales_table`
WHERE date >= '2024-01-01'
GROUP BY region
ORDER BY total_sales DESC

Knowledge Check

Your dataset is 500 MB of CSV files.
Which tool should you use?

A) Apache Spark cluster

B) pandas on your laptop

C) Google BigQuery

D) Hadoop MapReduce

Click & hold to reveal answer

✓ Correct: B) pandas on your laptop

500 MB fits comfortably in memory. No need for distributed computing overhead. Use the simplest tool that works!

Part II

The Right to
Be Forgotten

Privacy regulations like GDPR and the Philippine DPA define how data can be collected, used, and stored.

Part II — Data Privacy

Data Privacy Regulation Timeline

Timeline of privacy regulations from 1995 EU Directive to 2020 CCPA

GDPR (EU, 2018)

Gold standard for privacy. Right to deletion, data portability, fines up to 4% of global revenue.

Philippine DPA (2012)

RA 10173. Enforced by NPC. Consent, purpose limitation, breach notification within 72 hours.

CCPA (California, enacted 2018, effective 2020)

Consumer right to know, delete, and opt-out of data sales. Applies to large businesses.

Part II — Philippine Context

Philippine Data Privacy Act (RA 10173)

Consent

Freely given, specific, and informed

Purpose

Use only for stated purpose

Minimization

Collect only what’s needed

Accuracy

Keep data up to date

Retention

Delete when no longer needed

Part II — Data Privacy

Anonymization Techniques

Technique	Description	Example
Masking	Hide partial data	“Juan D.”
Generalization	Broaden categories	Age 25 → “20–30”
Suppression	Remove identifiers	Remove SSN column
Noise Addition	Add random values	Salary ± 5%
K-Anonymity	Ensure k similar records	5+ with same quasi-identifiers

df['name'] = df['name'].str[0] + '***'           # Masking
df['age_group'] = pd.cut(df['age'],               # Generalization
    bins=[0, 20, 30, 40, 50, 100],
    labels=['<20', '20-30', '30-40', '40-50', '50+'])

Part II — Data Privacy

K-Anonymity

Definition

A dataset satisfies k-anonymity if every combination of quasi-identifiers (attributes that alone are harmless but together can identify someone, e.g. age, ZIP, gender) matches at least k other records.

Why k ≥ 5?

With k=5, an attacker can narrow a person down to at most 1-in-5 records — not enough to re-identify.

K-Anonymity Check

group by quasi-identifiers (age, zip, gender)

for each group:

if count < k: fail

return all groups ≥ k → dataset is k-anonymous

Before and after k-anonymity showing generalized data

Worked Example

K-Anonymity Step by Step (k=3)

Before: Identifiable

Name	Age	ZIP	Disease
Maria Santos	28	6000	Flu
Juan Cruz	29	6001	Diabetes
Ana Reyes	27	6000	Flu
Pedro Lim	42	6045	Asthma
Rosa Garcia	44	6046	Flu
Carlo Tan	43	6045	Asthma

After: k=3 Anonymized

Name	Age	ZIP	Disease
***	25-30	600*	Flu
***	25-30	600*	Diabetes
***	25-30	600*	Flu
***	40-45	604*	Asthma
***	40-45	604*	Flu
***	40-45	604*	Asthma

Part III

When Algorithms
Discriminate

Bias in data becomes bias in decisions. Understanding sources and metrics is the first step toward fairness.

Part III — Algorithmic Bias

Four Sources of Algorithmic Bias

Four-quadrant diagram of bias sources: Historical, Representation, Measurement, Aggregation

Part III — Case Studies

Real-World Bias Failures

COMPAS (Criminal Justice)

Predicted recidivism risk for sentencing. ProPublica found it produced higher false positive rates for Black defendants than white defendants. Used in real sentencing decisions.

Amazon Hiring Tool (HR)

Trained on 10 years of (mostly male) hiring data. The model learned to penalize resumes containing the word “women’s”. Amazon scrapped the entire program.

Part III — Algorithmic Bias

Detecting Bias

Confusion Matrix

A 2x2 table comparing predicted vs. actual labels. Four cells: TP (correctly flagged), FP (wrongly flagged), TN (correctly cleared), FN (wrongly missed).

Check Metrics by Group

Split predictions by demographic and compare error rates. Significant differences indicate disparate impact (when a model's error rates differ substantially across protected groups).

What to Look For

Unequal false positive rates (FPR)
Unequal false negative rates (FNR)
Different accuracy across groups

Bias Detection Algorithm

for each group g in demographics:

Compute confusion matrix for g

$\text{FPR}_g = \frac{FP}{FP+TN}$

$\text{FNR}_g = \frac{FN}{FN+TP}$

if max(FPR) / min(FPR) > 1.25:

flag disparate impact

from sklearn.metrics import confusion_matrix

# Check metrics by demographic group
for group in df['demographic'].unique():
    subset = df[df['demographic'] == group]
    cm = confusion_matrix(
        subset['actual'],
        subset['predicted']
    )

    # False positive rate
    fpr = cm[0, 1] / (cm[0, 0] + cm[0, 1])
    # False negative rate
    fnr = cm[1, 0] / (cm[1, 0] + cm[1, 1])

    print(f"{group}: FPR={fpr:.3f}, FNR={fnr:.3f}")

# If FPR differs significantly across groups
# → your model has disparate impact

Worked Example

FPR by Gender: Is This Fair?

A GCash credit model flags applicants as "high risk." Let's compute FPR by gender.

Male Applicants

FP = 3, TN = 47 → $\text{FPR}_{\text{male}} = \frac{3}{3+47} = 0.06$ (6%)

Female Applicants

FP = 8, TN = 42 → $\text{FPR}_{\text{female}} = \frac{8}{8+42} = 0.16$ (16%)

× Disparate Impact Detected

Female FPR is 2.7x higher

Women are incorrectly flagged as "high risk" almost 3 times more often than men.

What This Means

16% of creditworthy women are wrongly denied vs. only 6% of men. Same model, same threshold, very different outcomes.

The Fix?

Adjust thresholds per group (post-processing), rebalance training data (pre-processing), or add fairness constraints (in-processing).

Part III — Why Fairness?

"Fair" Means Different Things

There's No Single Definition

"Treat people equally" sounds simple. But equally how? Equal treatment? Equal outcomes? Equal error rates? Each is a different mathematical formula — and they're incompatible.

Kleinberg's Impossibility (2017)

Mathematically proven: you cannot satisfy all common fairness metrics simultaneously (unless your model is perfect, which it never is).

The real question: Which fairness definition matters most for your problem? Lending? Medical diagnosis? Criminal sentencing? Each requires a different choice.

Part III — Algorithmic Bias

Fairness Metrics

Metric	Definition	When to Use	Example
Demographic Parity	Equal positive prediction rates across groups	Hiring, lending	Same loan approval rate for all demographics
Equalized Odds	Equal TPR and FPR across groups	Criminal justice	Same error rates regardless of race
Calibration	Equal precision across groups	Medical diagnosis	70% confidence means 70% correct for all groups

Part III — Algorithmic Bias

Disparate Error Rates Across Groups

Grouped bar chart showing FPR and FNR across demographic groups

Activity + Discussion

Bias in Practice

Activity 2: Ethical Dilemma (4 min, pairs)

Your team builds a GCash credit scoring model. After auditing, you discover the false positive rate for female applicants is 2x higher than for males — women are incorrectly denied credit twice as often.

What stage of mitigation would you apply (pre-, in-, or post-processing)?
What specific technique would you use?
Who should be notified about this finding?

Discussion: Facial Recognition in PH Malls (4 min)

Several Philippine malls have deployed facial recognition for security. Commercial systems have higher error rates for darker skin tones and women.

Should facial recognition be allowed in PH malls? Under what conditions?
Who benefits? Who is harmed?
What safeguards would you require under RA 10173?

Live Demo

Bias Detector Simulator

ID	Gender	Income	Actual	Predicted

Press "1. Show Data" to start.

Algorithm: Bias Detection

1. split data by protected group

2. for each group:

FPR = FP / (FP + TN)

FNR = FN / (FN + TP)

3. if |FPR_a − FPR_b| > threshold

→ DISPARATE IMPACT ⚠

Part III — Algorithmic Bias

Mitigating Bias: Three Stages

Pre-processing

Fix the data before training. Rebalance datasets, remove proxy features, use synthetic oversampling (SMOTE — generates synthetic minority-class samples by interpolating between existing ones).

In-processing

Fix the algorithm. Add fairness constraints to the loss function, use adversarial debiasing (train an adversary to detect demographic info from predictions — penalize the model if it leaks), or fair representation learning.

Post-processing

Fix the output. Adjust decision thresholds per group, calibrate probabilities, or audit and correct predictions.

Part IV

Building AI That
Serves Everyone

Explainability, accountability, and ethics must be designed in — not bolted on after deployment.

Part IV — Why Explainability?

The Black Box Problem

The Issue

Deep models & ensembles can have millions of parameters. Even the people who built them can't easily explain a single prediction.

High-Stakes Settings

"Why was my loan denied?"
"Why did this AI recommend prison?"
"Why was my resume rejected?"

GDPR Article 22: users have a right to meaningful explanation when subjected to automated decisions. Same in PH DPA Sec 16.

Part IV — SHAP Derivation

SHAP: Fair Credit from Game Theory

Origin: Shapley (1953)

Cooperative game theory: how to fairly divide rewards among teammates whose individual contributions are entangled.

The Formula

$\phi_i = \displaystyle\sum_{S \subseteq F \setminus \{i\}} \frac{|S|!\,(|F|-|S|-1)!}{|F|!} \bigl[f(S \cup \{i\}) - f(S)\bigr]$

$\phi_i$ = SHAP value for feature $i$ $F$ = all features $S$ = subset without $i$

Average the marginal contribution of $i$ across all possible orderings in which it could be added.

Three guarantees: efficiency (sum to prediction), symmetry (equal features get equal credit), null (zero-impact features get zero).

Part IV — LIME Derivation

LIME: Local Linear Approximation

Core Idea (Ribeiro et al., 2016)

The complex model $f$ is non-linear globally — but in a small neighborhood around any point $x$, you can fit a simple linear model $g$ that explains it.

The Optimization

$\xi(x) = \arg\min_{g \in G} \mathcal{L}\bigl(f, g, \pi_x\bigr) + \Omega(g)$

$g$ = simple model (linear) $\pi_x$ = neighborhood weights (closer = more weight) $\Omega$ = simplicity penalty

Trade-off: SHAP is mathematically rigorous but slow. LIME is fast but approximations only valid locally (one point at a time).

Part IV — Responsible AI

Explainability: SHAP & LIME

Why Explainability?

GDPR right to “meaningful information about the logic involved” (Art. 13–15)
Build trust with stakeholders
Debug model errors and biases
Regulatory compliance

Key Tools

SHAP: Shapley values — fair attribution of each feature’s contribution
LIME: Local explanations via interpretable surrogate models

SHAP waterfall plot showing feature contributions to prediction

Part IV — Why FATE(S)?

From FAT to FATE(S): An Evolving Framework

2018: FAT

ACM launches FAccT conference. Three pillars: Fairness, Accountability, Transparency. Reaction to ProPublica's COMPAS exposé and other public failures.

2018: FATE (+ Ethics)

Microsoft Research adds Ethics — recognizing that legal compliance ≠ moral correctness. The right question: should we even build this?

2020+: FATE(S) (+ Safety)

Columbia DSI adds Safety — preventing harm to users and bystanders, especially as models get deployed in safety-critical settings (healthcare, autonomous vehicles).

Important: these pillars often conflict. Maximizing one can hurt another. Next slide shows the tensions.

Part IV — Real Conflicts

FATE(S) is Not a Checklist — Pillars Conflict

Maximizing one pillar can hurt another. Real engineering means choosing trade-offs, not satisfying all five.

Fairness ⟷ Accuracy

Example: A loan model has 92% accuracy overall — but enforcing demographic parity drops it to 87%.

Some accuracy must be "spent" on fairness. (Kleinberg 2017)

Transparency ⟷ Privacy

Example: Showing how the model uses your medical history is transparent — but reveals sensitive personal data.

Full transparency can leak protected information.

Safety ⟷ Helpfulness

Example: A chatbot that refuses medical questions is "safe" but useless for the user actually asking.

Over-restriction = unhelpful. Under-restriction = harm.

Accountability ⟷ Speed

Example: Auditing every prediction creates a paper trail (good!) — but slows deployment from minutes to weeks.

Bureaucracy is the cost of accountability.

Ethics ⟷ Profit

Example: "Should we train on user data we collected?" — legal under TOS but ethically dubious without clear consent.

Just because it's legal doesn't mean it's right.

The Right Mindset

FATE(S) is a compass, not a checklist. Engineers make explicit trade-offs and document the reasoning.

Document what you sacrificed and why.

Part IV — Responsible AI

The FATE(S) Framework

Fairness

Equal treatment across demographic groups

Accountability

Clear ownership and responsibility for outcomes

Transparency

Explainable decisions and open processes

Ethics

Consider societal impact and human values

Safety

Prevent harm to users and communities

Part IV — Philippine Context

Ethics Challenges in the Philippines

GCash Credit Scoring

How do you score creditworthiness for informal economy workers with no traditional credit history? What biases might emerge?

DOH Disease Prediction

Rural areas have less data, worse connectivity. Models trained on urban data may fail in provinces where they’re needed most.

Facial Recognition

Commercial facial recognition has higher error rates for darker skin tones and women. Deployed in Philippine malls and airports.

Social Media Monitoring

With 95M accounts, social media surveillance raises privacy concerns. Where is the line between public safety and privacy?

Session 2: Key Takeaways

Big data tools are needed only when scale demands it — don’t over-engineer
Philippine DPA (RA 10173) governs data privacy; know its five principles
Algorithmic bias enters through data, features, and modeling choices
Fairness metrics help quantify bias — but you can’t satisfy all of them
Responsible AI (FATE) requires continuous attention, not one-time audits

Lab 11: Bias Audit Project

Audit a model for bias, calculate fairness metrics, and propose mitigation strategies.

Appendix A — Optional

Sentiment Math Deep-Dive

VADER and TextBlob both look like simple lexicon lookups, but their published algorithms specify exact constants. These three slides show the formulas behind the high-level rules.

Appendix A — VADER Math

The Compound Score Formula

Step 1 — Sum modified valences

$x = \displaystyle\sum_i v_i'$

$v_i'$ = lexicon valence after CAPS, punctuation, booster, and negation modifiers.

Step 2 — Squash to [−1, +1]

$\text{compound} = \dfrac{x}{\sqrt{x^2 + \alpha}}, \quad \alpha = 15$

Softsign-style squashing. As $|x| \to \infty$, compound → ±1. At $x=0$, slope is $1/\sqrt{15} \approx 0.258$.

Why $\alpha = 15$? Empirically chosen so that typical sentence-level sums saturate near ±1. Larger $\alpha$ = gentler curve, smaller $\alpha$ = more aggressive saturation.

Appendix A — VADER Math

The Magic Numbers Behind "Apply Modifiers"

① Punctuation

$v_i \leftarrow v_i + \text{sign}(v_i) \cdot \min(n_!, 4) \cdot 0.292$

Each `!` (up to 4) adds +0.292 in the same direction. Question marks add +0.18 each (up to 3).

② ALL CAPS

$v_i \leftarrow v_i + \text{sign}(v_i) \cdot 0.733$

Only triggers when sentence is mixed case. "This is GREAT" > "this is great"; "THIS IS GREAT" ≈ "this is great".

③ Degree Adverbs (boosters)

$v_i \leftarrow v_i + \text{sign}(v_i) \cdot B(t_{i-k}) \cdot s_k$

$B$ = ±0.293 (very/barely). Distance damping $s_1{=}1.0,\ s_2{=}0.95,\ s_3{=}0.9$ over the previous 3 tokens.

④ Negation

$v_i \leftarrow v_i \cdot (-0.74)$

If not / never / n't is within 3 tokens behind. Not −1: "not great" softens to negative without flipping fully.

⑤ Contrastive "but"

$v_i \leftarrow \begin{cases} 0.5 \cdot v_i & i < \text{idx(but)} \\ 1.5 \cdot v_i & i > \text{idx(but)} \end{cases}$

Pre-"but" valences shrink, post-"but" valences amplify — matches the human reading that the second clause carries the speaker's true position. "Food was great BUT service was terrible" → net negative.

Source: Hutto & Gilbert (2014), "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", ICWSM. Constants are tuned on a 5,000-tweet labeled dataset.

Appendix A — TextBlob Math

Corrected Worked Example

What Slide 18a glossed over

The earlier slide showed 0.8 × 1.3 = 1.04 → clipped to 1.0. In real Pattern code, clipping happens after the weighted average — not in the numerator.

Polarity (faithful)

$P = \text{clip}\!\left(\dfrac{\sum_j p_j m_j}{\sum_j m_j},\ -1,\ +1\right)$

For the example: $\dfrac{0.8 \cdot 1.3}{1.3} = 0.8$. The 1.3× modifier cancels in numerator and denominator when there's only one assessment.

Subjectivity (unweighted!)

$S = \dfrac{1}{N}\displaystyle\sum_j s_j$

Plain average over $N$ scored adjectives — no modifier weighting.

Mining Meaningfrom Text

"The limits of my language mean the limits of my world."

When NLP Goes Wrong

Session 1 Objectives

Text Preprocessing

Vectorization

Sentiment & Topics

Meet Our Data: PH Social Media

The Challenge

Turning NoiseInto Signal

Why Text Analytics?

Text Data Sources

70 Years of Text Analytics

Why this history matters

Where this lecture sits

Three Paradigms, Side by Side

The "embarrassingly effective" baseline

Today's stack is hybrid

The Preprocessing Pipeline

Preprocessing in Python

NLTK Library

Key Functions

Caveat: POS Tagging

Preprocessing Step by Step

Notice: Filipino words survive

Emojis & hashtags removed

Text Preprocessing Simulator

Stemming vs. Lemmatization

Making WordsCountable

From Words to Numbers

The Core Problem

The Idea

Bag of Words in Practice

Limitations

The Problem with Just Counting Words

The Issue

The Insight

TF-IDF: The Formula, Step by Step

① Term Frequency (TF)

② Inverse Document Frequency (IDF)

③ Multiply Them

TF-IDF in Python

Key Parameters

TF-IDF by Hand

Documents

Formulas

Interpretation

TF-IDF Scores Across Documents

Low Scores (Blue)

High Scores (Dark Blue)

TF-IDF Ranking Challenge

Instructions (5 min, pairs)

TF-IDF Explorer

Which term has the HIGHEST TF-IDF score?

✓ Correct: B) “analytics”

Reading Betweenthe Lines

Three Approaches to Sentiment Analysis

Lexicon-Based

ML-Based

Pre-trained

VADER Sentiment

What is VADER?

Compound Score

VADER Doesn't Speak Filipino

Inside TextBlob's Two Scores

① Polarity (−1 to +1)

② Subjectivity (0 to 1)

TextBlob Sentiment

Two Dimensions

Per-Sentence Analysis

Philippine Social Media Sentiment

Philippine Social Media

Caveat

Sentiment Analyzer

ML Sentiment Pipeline

TF-IDF + Naive Bayes

When to Use ML-Based

DiscoveringHidden Themes

LDA Topic Modeling

Quick Reminder

Mining Meaning
from Text

Turning Noise
Into Signal

Making Words
Countable

Reading Between
the Lines

Discovering
Hidden Themes

Scale, Privacy
& Fairness

When pandas
Is Not Enough

Your dataset is 500 MB of CSV files.
Which tool should you use?

The Right to
Be Forgotten

When Algorithms
Discriminate