CMSC 178DA | Week 11 · Session 1

Mining Meaning
from Text

From raw words to actionable insights

Department of Computer Science

University of the Philippines Cebu

"80% of enterprise data is unstructured — and much of it is text."

"The limits of my language mean the limits of my world."

— Ludwig Wittgenstein, 1921

Today: how to teach machines to understand text — and how to use that power responsibly.

Case Study

When NLP Goes Wrong

Amazon's AI Recruiting Tool (2018)

Reuters exclusive report, October 2018

Amazon trained an NLP model on 10 years of resumes to automate hiring. The model learned to penalize resumes containing the word "women's" (e.g., "women's chess club captain") and downgrade graduates of all-women's colleges.

Why? The training data was 10 years of mostly male hires. The model learned that "male" patterns predicted success. Amazon scrapped the entire program.

Text analytics is powerful — but the data you train on encodes the biases of the world that produced it.

Agenda

Session 1 Objectives

Text Preprocessing

Tokenize, clean, and normalize raw text into analysis-ready tokens.

Vectorization

Convert words into numbers using Bag of Words and TF-IDF representations.

Sentiment & Topics

Classify polarity with sentiment analysis and discover themes with LDA topic modeling.

Running Example

Meet Our Data: PH Social Media

Throughout this session, we will preprocess, vectorize, and analyze these posts. They represent real patterns in Philippine social media data.

The Challenge

Code-switching (Taglish), informal spelling, emojis, and sarcasm make PH social media data uniquely difficult for NLP tools built for English.

#PlatformPostSentiment
1Twitter"Grabe ang bilis ng GCash today! Love it"Positive
2Facebook"Ang bagal ng internet dito sa probinsya"Negative
3Twitter"Just tried the new Jollibee menu, it's okay naman"Neutral
4Facebook"Sobrang init ngayon, parang oven ang Cebu"Negative
5Twitter"Congrats sa mga bagong graduates! Proud kami!"Positive
6Facebook"The new MRT extension is a game changer"Positive
7Twitter"Nag-update na ba kayo ng PhilSys ID? Hassle amp"Negative
8Facebook"May pasok ba bukas? Walang announcement eh"Neutral
Part I

Turning Noise
Into Signal

Natural language is messy. Preprocessing cleans, normalizes, and tokenizes text before any model can learn.

Part I — Text Preprocessing

Why Text Analytics?

Unstructured text data is everywhere — and growing faster than any other data type.

Text Data Sources

  • Customer reviews & feedback
  • Social media posts & comments
  • Support tickets & emails
  • News articles & reports
  • Survey open-ended responses
Bar chart showing 80% of enterprise data is unstructured text
Part I — Brief History

70 Years of Text Analytics

1950s 1970s 1990s 2003 2013 2017 2026 SYMBOLIC ERA Hand-coded grammars ELIZA (1966) SHRDLU (1970) EXPERT SYSTEMS Lexicons + parsers WordNet (1985) Hand-rules don't scale STATISTICAL NLP n-grams, HMMs IBM MT, Penn Treebank data > rules CLASSICAL ML TF-IDF, LDA, SVMs VADER (2014), TextBlob ★ today's lecture lives here EMBEDDINGS word2vec, GloVe RNNs, LSTMs, seq2seq "king − man + woman ≈ queen" TRANSFORMERS BERT, GPT, T5 "Attention is All You Need" Vaswani et al. (2017) LLM ERA — NOW GPT-4, Claude 4, Gemini ChatGPT (Nov 2022) multimodal, agentic, RAG
Part I — Brief History

Three Paradigms, Side by Side

Paradigm Era Core Idea Strengths Weaknesses
Symbolic / Rule-based
ELIZA, hand-rules
1950s–80s Encode language as explicit grammars and dictionaries Transparent, debuggable, no training data needed Brittle; doesn't generalize; rules explode in complexity
Statistical / Classical ML
BoW, TF-IDF, VADER, LDA, SVM
1990s–2010s Count word frequencies, learn weights from labeled corpora Fast, cheap, interpretable; works on small data Loses word order & meaning; lexicons are language-bound (e.g. no Filipino)
Neural / Transformers
BERT, GPT, Claude, Llama
2017–today Learn contextual representations from billions of tokens via self-attention State-of-the-art on every benchmark; multilingual; multimodal Expensive; opaque; hallucinates; data & compute hungry

The "embarrassingly effective" baseline

A 1990s-style TF-IDF + logistic regression often beats a fine-tuned BERT for short, domain-specific text classification — at 1/1000th the compute cost. Always benchmark the simple thing first.

Today's stack is hybrid

Modern RAG systems use TF-IDF/BM25 (Era 3) to retrieve documents, then an LLM (Era 7) to generate the answer. Old methods aren't dead — they're load-bearing for the new ones.

Part I — Text Preprocessing

The Preprocessing Pipeline

Raw text needs cleaning before any algorithm can use it. Each step transforms the data into a more useful form.

Pipeline (4 Steps)
1. lowercase: text → text.lower()
2. tokenize: split into words
3. remove stopwords (the, ang, ng, ...)
4. lemmatize: word → base form
Order matters: lowercase before tokenize; stopwords before lemmatize.
Raw Text "The dogs are Running FAST!" .lower() Lowercase "the dogs are running fast!" word_tokenize() Tokenize ["the","dogs","are","running", "fast"] remove stopwords Remove Stopwords (drop "the", "are") ["dogs","running","fast"] lemmatize() Lemmatize ["dog","run","fast"] Clean tokens ready for vectorization 5 raw words → 3 meaningful tokens (40% reduction)
Part I — Text Preprocessing

Preprocessing in Python

NLTK Library

The Natural Language Toolkit provides tokenizers, stemmers, lemmatizers, and stopword lists for 20+ languages.

Key Functions

  • word_tokenize() — split into words
  • stopwords.words() — common words list
  • WordNetLemmatizer() — dictionary lookup

Caveat: POS Tagging

WordNet defaults to noun POS — “running” stays as-is. Pass pos='v' for verb lemmatization to get “run”.

In [1]:
import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer def preprocess(text): text = text.lower() tokens = word_tokenize(text) stops = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stops and t.isalpha()] lemmatizer = WordNetLemmatizer() return [lemmatizer.lemmatize(t) for t in tokens] preprocess("The dogs are running FAST!")
Out[1]:

['dog', 'running', 'fast']

Trace: "The dogs are running FAST!" → lowercase → ["the","dogs","are","running","fast"] → drop "the","are" → lemmatize → 3 tokens
Worked Example

Preprocessing Step by Step

Watch a real Taglish tweet get transformed through each pipeline stage.

StepOperationResult
RawGrabe ang ganda ng new GCash update!! 💯🔥 #fintech
1. Lowercase.lower()grabe ang ganda ng new gcash update!! 💯🔥 #fintech
2. Tokenizeword_tokenize()['grabe', 'ang', 'ganda', 'ng', 'new', 'gcash', 'update', '!', '!', '💯', '🔥', '#', 'fintech']
3. Remove stopwords + non-alphaisalpha() + stopword filter['grabe', 'ganda', 'new', 'gcash', 'update', 'fintech']
4. LemmatizeWordNetLemmatizer()['grabe', 'ganda', 'new', 'gcash', 'update', 'fintech']
Live Demo

Text Preprocessing Simulator

Grabe ang ganda ng new GCash update!! 💯🔥 #fintech
StepOperationResult
Press "1. Lowercase" to start.
Algorithm: Text Preprocessing
1. lowercase: x = x.lower()
2. tokenize: split on whitespace + punctuation
3. remove stopwords (ang, ng, sa, the, is…)
4. lemmatize: dictionary lookup → base form
Part I — Text Preprocessing

Stemming vs. Lemmatization

Comparison of stemmer vs lemmatizer outputs for various words
Part II

Making Words
Countable

Machines need numbers, not words. Bag of Words and TF-IDF convert text into vector space.

Part II — Why BoW?

From Words to Numbers

The Core Problem

Algorithms work on numbers, not text. We need a way to convert "The food was good" into a vector a computer can crunch.

The Idea

Treat each document as an unordered "bag" of words. Build a vocabulary, then count occurrences.

Vector Representation
document d = ($\text{count}(w_1, d), \text{count}(w_2, d), \dots, \text{count}(w_V, d)$)
where $V$ = vocabulary size
d ∈ $\mathbb{R}^V$   (a $V$-dimensional vector)
Step-by-Step: Text → Vector ① Raw Documents d₁: "The food was good" d₂: "The service was bad" d₃: "Good food but bad service" build vocabulary ② Vocabulary V (sorted, unique) ["bad", "but", "food", "good", "service", "the", "was"] |V| = 7 count occurrences ③ Document-Term Matrix bad but food good service the was d₁: 0 0 1 1 0 1 1 d₂: 1 0 0 0 1 1 1 d₃: 1 1 1 1 1 0 0 Each row = a vector in $\mathbb{R}^7$ that ML algorithms can use ⚠ "the" appears in 2/3 docs — uninformative. Next slide: TF-IDF fixes this.
Part II — Text Representation

Bag of Words in Practice

The simplest text representation: count how many times each word appears.

Limitations

  • Loses word order entirely
  • Common words dominate the counts
  • High-dimensional, sparse matrices
from sklearn.feature_extraction.text import CountVectorizer corpus = [ "The food was good", "The service was bad", "Good food but bad service" ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out()) # ['bad', 'but', 'food', 'good', 'service', 'the', 'was'] print(X.toarray()) # [[0, 0, 1, 1, 0, 1, 1], # [1, 0, 0, 0, 1, 1, 1], # [1, 1, 1, 1, 1, 0, 0]]
Part II — Why TF-IDF?

The Problem with Just Counting Words

The Issue

In Bag of Words, "the" appears 50 times in a document — looks like the most "important" word. But it appears in every document. Useless signal.

The Insight

A word is truly important when it's:

  • Frequent in this document (the topic)
  • Rare across all documents (distinguishing)
Goal: Build a score that REWARDS frequent-in-doc and PENALIZES common-everywhere.
"the" vs "data" vs "regression" in 1 doc, 100 docs total ❌ Bag of Words (raw counts in document) 50 "the" 5 "data" 3 "regression" Says "the" is most important TF-IDF corrects this ✓ Ideal Weighting (important = frequent + rare) low "the" HIGH "data" HIGH "regression" Says topic words matter most ✓ The Goal: A Score That Penalizes Common Words If "the" appears in 100/100 docs → it carries no info → DOWN-WEIGHT it If "regression" appears in 5/100 docs → it's distinguishing → UP-WEIGHT it TF-IDF = "frequent here" × "rare overall" Two signals multiplied → penalizes the common, rewards the rare
Part II — Building TF-IDF

TF-IDF: The Formula, Step by Step

① Term Frequency (TF)

"How frequent is term t in document d?" Normalize by doc length so long docs don't dominate.

$\text{TF}(t,d) = \dfrac{\text{count}(t \in d)}{|d|}$

② Inverse Document Frequency (IDF)

"How rare is t across all N documents?" Use log so it grows slowly with N.

$\text{IDF}(t) = \log\dfrac{N}{df_t}$

③ Multiply Them

High TF × High IDF = important word for this doc.

$\text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t)$
3 words × 3 metrics → final score ① TF in this doc ② IDF across N=100 docs ③ TF × IDF final weight × = "the" 0.50 (high) log(100/100)=0 0.00 (canceled!) "data" 0.30 log(100/20)=0.7 0.21 "regression" 0.20 log(100/5)=1.30 0.26 (HIGHEST) ✓✓ 💡 The Magic of Multiplication "the": high TF (0.5) × zero IDF = zero. Canceled. "regression": moderate TF (0.2) × high IDF (1.3) = highest score. Words common everywhere → IDF = 0 → eliminated. Words rare overall but used here → big TF-IDF score.
Part II — Text Representation

TF-IDF in Python

Key Parameters

  • max_features — limit vocabulary size
  • min_df — ignore very rare terms
  • max_df — ignore very common terms
  • ngram_range — include bigrams
TL;DR

TF-IDF = TF × log(N/df). Words that are frequent in a document but rare across the corpus get the highest score.

from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer( max_features=1000, min_df=5, max_df=0.95, ngram_range=(1, 2) # Unigrams + bigrams ) X_tfidf = tfidf.fit_transform(documents) # Most important terms per document feature_names = tfidf.get_feature_names_out() for i, doc in enumerate(X_tfidf): top_indices = doc.toarray().argsort()[0][-5:] top_terms = [feature_names[j] for j in top_indices] print(f"Doc {i}: {top_terms}")
Worked Example

TF-IDF by Hand

3 mini-documents. Compute TF-IDF for the word "gcash" in Doc 1.

Documents

  1. "gcash users love the new gcash savings feature" (7 words)
  2. "gcash reports strong quarterly growth numbers today" (7 words)
  3. "bitcoin prices surge amid global market rally today" (8 words)

Formulas

$\text{TF}(t,d) = \frac{\text{count of } t \text{ in } d}{|d|}$

$\text{IDF}(t) = \log\!\left(\frac{N}{df_t}\right)$

$\text{TF-IDF} = \text{TF} \times \text{IDF}$

StepCalculationValue
TF("gcash", Doc 1)2 occurrences / 7 words$2/7 = 0.286$
df("gcash")Appears in Doc 1, Doc 22 documents
IDF("gcash")$\log(3/2)$$0.176$
TF-IDF$0.286 \times 0.176$$\mathbf{0.050}$

Interpretation

"gcash" is moderately important in Doc 1: it appears often (high TF) but also in another doc (lowers IDF). A word like "savings" (only in Doc 1) would score higher.

Part II — Text Representation

TF-IDF Scores Across Documents

Heatmap of TF-IDF scores showing common words score low and distinctive words score high
Activity 1

TF-IDF Ranking Challenge

Instructions (5 min, pairs)

Given these 3 PH news headlines from a corpus of 100 articles:

  1. "GCash launches new savings feature for Filipino users"
  2. "BSP raises interest rates amid inflation concerns"
  3. "Filipino fintech GCash reports 86M registered users"

Rank these words by likely TF-IDF score (highest first):

savings   Filipino   the   GCash   inflation   reports

Hint: Think about which words are frequent in their document but rare across all 100 articles.

Live Demo

TF-IDF Explorer

1
Algorithm: TF-IDF
1. TF(t,d) = count(t in d) / |d|
2. IDF(t) = log(N / df(t))
3. TF-IDF = TF × IDF
TF-IDF Heatmap
Knowledge Check

Which term has the HIGHEST TF-IDF score?

A) “the” — appears in every document
B) “analytics” — appears in 2 of 100 docs, 5× each
C) “data” — appears in 80 of 100 docs
D) “I” — common stopword
Click & hold to reveal answer

✓ Correct: B) “analytics”

High TF (appears 5 times in those docs) × high IDF (only 2/100 docs) = highest TF-IDF. Terms A, C, D have low IDF because they appear in most documents.

Part III

Reading Between
the Lines

Sentiment analysis classifies text polarity — positive, negative, or neutral — using lexicons or machine learning.

Part III — Sentiment Analysis

Three Approaches to Sentiment Analysis

Comparison of lexicon-based, ML-based, and pre-trained sentiment approaches
Part III — Sentiment Analysis

VADER Sentiment

What is VADER?

Valence Aware Dictionary and sEntiment Reasoner. Rule-based, tuned for social media text.

Compound Score

Ranges from -1 (most negative) to +1 (most positive). Threshold: >0.05 positive, <-0.05 negative.

VADER Scoring Algorithm
for each word in text:
Look up valence in lexicon
apply modifiers (caps, !, degree adverbs)
sum adjusted valences
normalize to [-1, +1] → compound score
from nltk.sentiment import SentimentIntensityAnalyzer nltk.download('vader_lexicon') sia = SentimentIntensityAnalyzer() texts = [ "This product is amazing! I love it!", "Terrible experience, never buying again", "It's okay, nothing special" ] for text in texts: scores = sia.polarity_scores(text) print(f"{text}") print(f" Compound: {scores['compound']:.2f}") # Output: # "This product is amazing..." → Compound: 0.86 # "Terrible experience..." → Compound: -0.48 # "It's okay, nothing..." → Compound: -0.09
Aha Moment

VADER Doesn't Speak Filipino

Same meaning, different language, completely different result.

This is why off-the-shelf NLP tools fail for Philippine social media data.

Part III — How TextBlob Works

Inside TextBlob's Two Scores

① Polarity (−1 to +1)

Average word polarity from the Pattern lexicon, weighted by intensifiers.

$\text{Polarity} = \dfrac{\sum_w p_w \cdot m_w}{\sum_w m_w}$

$p_w$ = word polarity, $m_w$ = intensifier modifier (e.g., "very" = 1.3×)

② Subjectivity (0 to 1)

Ratio of opinion-bearing words to all scored words.

$\text{Subjectivity} = \dfrac{|\text{opinion words}|}{|\text{scored words}|}$
Lexicon-based: No ML. Just dictionary lookup + averaging. Fast but limited to English & specific domains.
Worked Example: "The food was very delicious" ① Lookup each word in lexicon Word Polarity (p) Subjectivity Modifier (m) "the" — (stopword) — (stopword) "food" — (neutral) "was" "very" 1.3 (boosts next) "delicious" +0.8 0.9 → "very delicious" 0.8 × 1.3 = 1.04 → clipped to 1.0 ② Compute averages Polarity = (1.0) / 1 scored word = +1.00 Subjectivity = (0.9) / 1 scored word = 0.90 Only "delicious" contributes — stopwords and neutrals don't count. "very" boosts the score but doesn't have its own polarity. Result: VERY POSITIVE (+1.0) and HIGHLY OPINIONATED (0.9) ⚠ Same logic on "Grabe ang sarap!" → polarity 0 (no Filipino words in lexicon)
Part III — Sentiment Analysis

TextBlob Sentiment

Two Dimensions

  • Polarity: -1 (negative) to +1 (positive)
  • Subjectivity: 0 (factual) to 1 (opinionated)

Per-Sentence Analysis

Analyze each sentence separately for mixed-sentiment texts like product reviews.

from textblob import TextBlob # Whole-sentence analysis text = "The food was delicious but the service was slow" blob = TextBlob(text) print(f"Polarity: {blob.sentiment.polarity:.2f}") print(f"Subjectivity: {blob.sentiment.subjectivity:.2f}") # Output: Polarity: 0.35, Subjectivity: 0.80 # Per-clause analysis (split manually) for clause in ["The food was delicious", "The service was slow"]: pol = TextBlob(clause).sentiment.polarity print(f"'{clause}' → {pol:.2f}") # Output: # "The food was delicious" → 1.00 # "The service was slow" → -0.30
Part III — Philippine Context

Philippine Social Media Sentiment

Bar chart showing sentiment distribution of Philippine social media posts
Live Demo

Sentiment Analyzer

Algorithm: VADER-like Scoring
1. for each word: score = lexicon[word] or 0
2. compound = normalize(Σ scores)
3. if compound ≥ 0.05 → POSITIVE
elif ≤ -0.05 → NEGATIVE
else → NEUTRAL
Sentiment Gauge
Part III — Sentiment Analysis

ML Sentiment Pipeline

TF-IDF + Naive Bayes

A simple yet effective pipeline: vectorize text with TF-IDF, then classify with Naive Bayes (a classifier that assumes features are independent given the class — "naive" because this is rarely true, yet it works surprisingly well). The Multinomial variant models word counts/frequencies, making it ideal for text.

When to Use ML-Based

  • Domain-specific language (medical, legal)
  • You have labeled training data
  • Lexicon approaches underperform
from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split # Labeled training data X_train = ["great product", "terrible quality", ...] y_train = [1, 0, ...] # 1=positive, 0=negative # Pipeline: TF-IDF + Naive Bayes pipeline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', MultinomialNB()) ]) pipeline.fit(X_train, y_train) # Predict new text predictions = pipeline.predict(["I love this!"]) # Output: [1] (positive)
Part IV

Discovering
Hidden Themes

LDA topic modeling reveals latent topics. NER extracts named entities. Word clouds visualize term frequencies.

Part IV — Topic Modeling

LDA Topic Modeling

Quick Reminder

Supervised = labeled data, model learns input→output mapping. Unsupervised = no labels, model discovers structure on its own. LDA is unsupervised.

Latent Dirichlet Allocation

Unsupervised algorithm that discovers hidden topics in a collection of documents.

Key Assumptions

  • Documents are mixtures of topics
  • Topics are distributions over words
  • You choose number of topics (k)
  • Input must be word counts, not TF-IDF
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer # LDA needs word COUNTS (not TF-IDF!) cv = CountVectorizer(max_features=5000, stop_words='english') X_counts = cv.fit_transform(documents) lda = LatentDirichletAllocation( n_components=5, random_state=42) lda.fit(X_counts) # Print top words per topic names = cv.get_feature_names_out() for i, topic in enumerate(lda.components_): top = [names[j] for j in topic.argsort()[-10:]] print(f"Topic {i}: {', '.join(top)}") # Topic 0: price, quality, value, product... # Topic 1: delivery, shipping, days, arrive...
LDA Algorithm (Simplified)
input documents, k topics
initialize random topic assignments for all words
repeat until convergence:
for each word w in each document:
P(topic | doc) × P(word | topic) → reassign
output topic-word and doc-topic distributions
Part IV — Philippine Context

Handling Filipino Text

Before and after Filipino text preprocessing showing stopword removal
Summary

Method Comparison: Text Analytics Toolkit

MethodInputOutputBest ForPH Limitation
Bag of WordsRaw textWord count vectorsSimple classification, baselinesIgnores Taglish word order
TF-IDFRaw textWeighted term vectorsSearch, document similarityNo Filipino IDF corpora available
VADEREnglish textPolarity scores (-1 to +1)Social media, informal textZero Filipino coverage
TextBlobEnglish textPolarity + subjectivityQuick sentiment + objectivityEnglish-only lexicon
ML PipelineLabeled dataPredicted classesDomain-specific sentimentNeeds Filipino labeled dataset
LDAWord countsTopic distributionsTheme discovery, explorationFilipino stopwords needed
Live Demo

LDA Topic Discovery

Iteration: 0 | Topics: 3 | Assigned: 0/30
Algorithm: LDA
1. assign each word a random topic
2. repeat until stable:
for each word w:
new = argmax P(topic|doc) × P(w|topic)
reassign w → new topic
3. Output: K clusters of related words

Session 1: Key Takeaways

  1. Preprocessing is critical — lowercase, tokenize, remove stopwords, lemmatize
  2. TF-IDF weights important terms higher than common words
  3. Sentiment analysis has three approaches: lexicon, ML, and pre-trained
  4. LDA discovers hidden topics in document collections
  5. Filipino text needs custom stopwords and Taglish handling

Next: Analytics at Scale & Ethics

Big data tools, privacy regulations, algorithmic bias, and responsible AI.

CMSC 178DA | Week 11 · Session 2

Scale, Privacy
& Fairness

When data gets big, ethics must get bigger

Department of Computer Science

University of the Philippines Cebu

"With great data comes great responsibility."

The Philippine Data Explosion

86M+

GCash registered users

98M

Internet users in PH

95M

Social media accounts

Who protects this data?

Agenda

Session 2 Objectives

Big Data Tools

When pandas isn’t enough: Spark, cloud platforms, and the 5 Vs of big data.

Privacy & Compliance

GDPR, Philippine DPA (RA 10173), anonymization, and consent requirements.

Bias & Fairness

Sources of algorithmic bias, fairness metrics, mitigation strategies, and responsible AI.

Part I

When pandas
Is Not Enough

Big data demands distributed computing. Learn when and why to scale beyond a single machine.

Part I — Analytics at Scale

The 5 Vs of Big Data

Pentagon diagram showing Volume, Velocity, Variety, Veracity, Value
Part I — Analytics at Scale

When Do You Need Big Data Tools?

× You DON’T need Spark if:
  • Data fits in memory (<16 GB)
  • Processing is one-time, ad-hoc
  • Simple aggregations / filters
  • pandas + SQL handles it fine
✓ You DO need distributed tools when:
  • Data exceeds single machine memory
  • Processing must be parallelized
  • Real-time streaming is required
  • ML at scale (millions of records)
TL;DR

Most analytics tasks (<10 GB) don’t need Spark. Use the simplest tool that works.

Part I — Analytics at Scale

Apache Spark

Distributed Computing

  • In-memory processing (up to 100× faster than MapReduce in memory; typically 3–10× in practice)
  • Supports Python (PySpark), SQL, Scala, R
  • MLlib for machine learning at scale

When to Choose Spark

Datasets >100 GB, iterative ML algorithms, streaming data, or when a single machine can’t keep up.

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Analytics") \ .getOrCreate() # Load CSV (distributed across cluster) df = spark.read.csv("large_data.csv", header=True, inferSchema=True) # SQL-like operations at scale df.groupBy("region") \ .agg({"sales": "sum"}) \ .orderBy("sum(sales)", ascending=False) \ .show() # Result: distributed across worker nodes
Part I — Analytics at Scale

Cloud Analytics Platforms

PlatformServiceStrengthPricing Model
AWSRedshift, Athena, EMRMost comprehensive ecosystemPer-query or provisioned
GCPBigQueryServerless SQL at scalePer-TB scanned
AzureSynapse AnalyticsEnterprise integration (Office 365)DWU-based
-- BigQuery example: analyze sales by region (serverless, no cluster setup) SELECT region, SUM(sales) as total_sales, COUNT(*) as num_transactions FROM `project.dataset.sales_table` WHERE date >= '2024-01-01' GROUP BY region ORDER BY total_sales DESC
Knowledge Check

Your dataset is 500 MB of CSV files.
Which tool should you use?

A) Apache Spark cluster
B) pandas on your laptop
C) Google BigQuery
D) Hadoop MapReduce
Click & hold to reveal answer

✓ Correct: B) pandas on your laptop

500 MB fits comfortably in memory. No need for distributed computing overhead. Use the simplest tool that works!

Part II

The Right to
Be Forgotten

Privacy regulations like GDPR and the Philippine DPA define how data can be collected, used, and stored.

Part II — Data Privacy

Data Privacy Regulation Timeline

Timeline of privacy regulations from 1995 EU Directive to 2020 CCPA
Part II — Philippine Context

Philippine Data Privacy Act (RA 10173)

Consent

Freely given, specific, and informed

Purpose

Use only for stated purpose

Minimization

Collect only what’s needed

Accuracy

Keep data up to date

Retention

Delete when no longer needed

Part II — Data Privacy

Anonymization Techniques

TechniqueDescriptionExample
MaskingHide partial data“Juan D.”
GeneralizationBroaden categoriesAge 25 → “20–30”
SuppressionRemove identifiersRemove SSN column
Noise AdditionAdd random valuesSalary ± 5%
K-AnonymityEnsure k similar records5+ with same quasi-identifiers
df['name'] = df['name'].str[0] + '***' # Masking df['age_group'] = pd.cut(df['age'], # Generalization bins=[0, 20, 30, 40, 50, 100], labels=['<20', '20-30', '30-40', '40-50', '50+'])
Part II — Data Privacy

K-Anonymity

Definition

A dataset satisfies k-anonymity if every combination of quasi-identifiers (attributes that alone are harmless but together can identify someone, e.g. age, ZIP, gender) matches at least k other records.

Why k ≥ 5?

With k=5, an attacker can narrow a person down to at most 1-in-5 records — not enough to re-identify.

K-Anonymity Check
group by quasi-identifiers (age, zip, gender)
for each group:
if count < k: fail
return all groups ≥ k → dataset is k-anonymous
Before and after k-anonymity showing generalized data
Worked Example

K-Anonymity Step by Step (k=3)

Before: Identifiable

NameAgeZIPDisease
Maria Santos286000Flu
Juan Cruz296001Diabetes
Ana Reyes276000Flu
Pedro Lim426045Asthma
Rosa Garcia446046Flu
Carlo Tan436045Asthma

After: k=3 Anonymized

NameAgeZIPDisease
***25-30600*Flu
***25-30600*Diabetes
***25-30600*Flu
***40-45604*Asthma
***40-45604*Flu
***40-45604*Asthma
Part III

When Algorithms
Discriminate

Bias in data becomes bias in decisions. Understanding sources and metrics is the first step toward fairness.

Part III — Algorithmic Bias

Four Sources of Algorithmic Bias

Four-quadrant diagram of bias sources: Historical, Representation, Measurement, Aggregation
Part III — Case Studies

Real-World Bias Failures

COMPAS (Criminal Justice)

Predicted recidivism risk for sentencing. ProPublica found it produced higher false positive rates for Black defendants than white defendants. Used in real sentencing decisions.

Amazon Hiring Tool (HR)

Trained on 10 years of (mostly male) hiring data. The model learned to penalize resumes containing the word “women’s”. Amazon scrapped the entire program.

Part III — Algorithmic Bias

Detecting Bias

Confusion Matrix

A 2x2 table comparing predicted vs. actual labels. Four cells: TP (correctly flagged), FP (wrongly flagged), TN (correctly cleared), FN (wrongly missed).

Check Metrics by Group

Split predictions by demographic and compare error rates. Significant differences indicate disparate impact (when a model's error rates differ substantially across protected groups).

What to Look For

  • Unequal false positive rates (FPR)
  • Unequal false negative rates (FNR)
  • Different accuracy across groups
Bias Detection Algorithm
for each group g in demographics:
Compute confusion matrix for g
$\text{FPR}_g = \frac{FP}{FP+TN}$
$\text{FNR}_g = \frac{FN}{FN+TP}$
if max(FPR) / min(FPR) > 1.25:
flag disparate impact
from sklearn.metrics import confusion_matrix # Check metrics by demographic group for group in df['demographic'].unique(): subset = df[df['demographic'] == group] cm = confusion_matrix( subset['actual'], subset['predicted'] ) # False positive rate fpr = cm[0, 1] / (cm[0, 0] + cm[0, 1]) # False negative rate fnr = cm[1, 0] / (cm[1, 0] + cm[1, 1]) print(f"{group}: FPR={fpr:.3f}, FNR={fnr:.3f}") # If FPR differs significantly across groups # → your model has disparate impact
Worked Example

FPR by Gender: Is This Fair?

A GCash credit model flags applicants as "high risk." Let's compute FPR by gender.

Male Applicants

FP = 3, TN = 47 → $\text{FPR}_{\text{male}} = \frac{3}{3+47} = 0.06$ (6%)

Female Applicants

FP = 8, TN = 42 → $\text{FPR}_{\text{female}} = \frac{8}{8+42} = 0.16$ (16%)

× Disparate Impact Detected

Female FPR is 2.7x higher

Women are incorrectly flagged as "high risk" almost 3 times more often than men.

What This Means

16% of creditworthy women are wrongly denied vs. only 6% of men. Same model, same threshold, very different outcomes.

The Fix?

Adjust thresholds per group (post-processing), rebalance training data (pre-processing), or add fairness constraints (in-processing).

Part III — Why Fairness?

"Fair" Means Different Things

There's No Single Definition

"Treat people equally" sounds simple. But equally how? Equal treatment? Equal outcomes? Equal error rates? Each is a different mathematical formula — and they're incompatible.

Kleinberg's Impossibility (2017)

Mathematically proven: you cannot satisfy all common fairness metrics simultaneously (unless your model is perfect, which it never is).

The real question: Which fairness definition matters most for your problem? Lending? Medical diagnosis? Criminal sentencing? Each requires a different choice.
Three Competing Definitions of "Fair" ① Demographic Parity "Equal positive rates across groups" P(ŷ=1 | A=male) = P(ŷ=1 | A=female) Use case: Hiring, university admissions — equal access ⚠ Ignores ground truth — could deny qualified candidates to balance numbers ② Equalized Odds "Equal TPR and FPR across groups" TPR_male = TPR_female AND FPR_male = FPR_female Use case: Criminal justice — equal accuracy regardless of group ⚠ Conflicts with calibration — can't have both ③ Calibration "Equal precision across groups" P(y=1 | ŷ=p, A=group) = p   for all groups Use case: Medical risk scores — "70% confidence" means same thing ⚠ Conflicts with equalized odds when base rates differ Pick ONE based on what your stakeholders value most
Part III — Algorithmic Bias

Fairness Metrics

MetricDefinitionWhen to UseExample
Demographic ParityEqual positive prediction rates across groupsHiring, lendingSame loan approval rate for all demographics
Equalized OddsEqual TPR and FPR across groupsCriminal justiceSame error rates regardless of race
CalibrationEqual precision across groupsMedical diagnosis70% confidence means 70% correct for all groups
Part III — Algorithmic Bias

Disparate Error Rates Across Groups

Grouped bar chart showing FPR and FNR across demographic groups
Activity + Discussion

Bias in Practice

Live Demo

Bias Detector Simulator

IDGenderIncomeActual
Press "1. Show Data" to start.
Algorithm: Bias Detection
1. split data by protected group
2. for each group:
FPR = FP / (FP + TN)
FNR = FN / (FN + TP)
3. if |FPR_a − FPR_b| > threshold
→ DISPARATE IMPACT ⚠
Fairness Metrics by Gender
Part III — Algorithmic Bias

Mitigating Bias: Three Stages

Pre-processing

Fix the data before training. Rebalance datasets, remove proxy features, use synthetic oversampling (SMOTE — generates synthetic minority-class samples by interpolating between existing ones).

In-processing

Fix the algorithm. Add fairness constraints to the loss function, use adversarial debiasing (train an adversary to detect demographic info from predictions — penalize the model if it leaks), or fair representation learning.

Post-processing

Fix the output. Adjust decision thresholds per group, calibrate probabilities, or audit and correct predictions.

Part IV

Building AI That
Serves Everyone

Explainability, accountability, and ethics must be designed in — not bolted on after deployment.

Part IV — Why Explainability?

The Black Box Problem

The Issue

Deep models & ensembles can have millions of parameters. Even the people who built them can't easily explain a single prediction.

High-Stakes Settings

  • "Why was my loan denied?"
  • "Why did this AI recommend prison?"
  • "Why was my resume rejected?"
GDPR Article 22: users have a right to meaningful explanation when subjected to automated decisions. Same in PH DPA Sec 16.
Without Explainability vs With Explainability ❌ Black Box income age credit history zip code ? model (opaque) DENY "Why?" — no answer. ✓ With SHAP/LIME income age credit history zip code model (explanations attached) + income contributed +0.4 − credit history contributed −0.6 ≈ age, zip ~0 Two Approaches to Open the Black Box SHAP — Global & Local Game theory: fair credit for each feature "How much did income contribute to THIS prediction?" LIME — Local Only Approximate model with simple model "In this neighborhood, model behaves like a line" Both are model-agnostic — work with ANY classifier (NN, RF, XGBoost, ...)
Part IV — SHAP Derivation

SHAP: Fair Credit from Game Theory

Origin: Shapley (1953)

Cooperative game theory: how to fairly divide rewards among teammates whose individual contributions are entangled.

The Formula

$\phi_i = \displaystyle\sum_{S \subseteq F \setminus \{i\}} \frac{|S|!\,(|F|-|S|-1)!}{|F|!} \bigl[f(S \cup \{i\}) - f(S)\bigr]$

$\phi_i$ = SHAP value for feature $i$   $F$ = all features   $S$ = subset without $i$

Average the marginal contribution of $i$ across all possible orderings in which it could be added.

Three guarantees: efficiency (sum to prediction), symmetry (equal features get equal credit), null (zero-impact features get zero).
Worked Example: Loan Denial Prediction (P = 0.18) Baseline (avg approval rate): 0.50 → final prediction: 0.18 → drop of 0.32 0.0 0.25 0.50 0.75 Baseline 0.50 credit hist −0.30 poor history income +0.10 decent age −0.08 young zip code −0.04 low-credit area Final 0.18 Sum of SHAP values = Prediction − Baseline (efficiency property) 0.50 + (−0.30) + (+0.10) + (−0.08) + (−0.04) = 0.18 ✓ Story: "Loan denied because credit history dragged prediction down by 0.30" Auditable, defensible, GDPR-compliant explanation per individual prediction.
Part IV — LIME Derivation

LIME: Local Linear Approximation

Core Idea (Ribeiro et al., 2016)

The complex model $f$ is non-linear globally — but in a small neighborhood around any point $x$, you can fit a simple linear model $g$ that explains it.

The Optimization

$\xi(x) = \arg\min_{g \in G} \mathcal{L}\bigl(f, g, \pi_x\bigr) + \Omega(g)$

$g$ = simple model (linear)   $\pi_x$ = neighborhood weights (closer = more weight)   $\Omega$ = simplicity penalty

Trade-off: SHAP is mathematically rigorous but slow. LIME is fast but approximations only valid locally (one point at a time).
Visual Intuition: Approximate Locally f (complex) Class A Class B x (point to explain) π_x neighborhood g (linear, local) 3 Steps to Explain Point x 1. Sample perturbations around x (slightly modify each feature) 2. Get f's prediction for each perturbed sample 3. Fit linear model g on (sample, prediction) pairs, weighted by distance to x → Coefficients of g show which features push toward each class — that's the explanation.
Part IV — Responsible AI

Explainability: SHAP & LIME

Why Explainability?

  • GDPR right to “meaningful information about the logic involved” (Art. 13–15)
  • Build trust with stakeholders
  • Debug model errors and biases
  • Regulatory compliance

Key Tools

  • SHAP: Shapley values — fair attribution of each feature’s contribution
  • LIME: Local explanations via interpretable surrogate models
SHAP waterfall plot showing feature contributions to prediction
Part IV — Why FATE(S)?

From FAT to FATE(S): An Evolving Framework

2018: FAT

ACM launches FAccT conference. Three pillars: Fairness, Accountability, Transparency. Reaction to ProPublica's COMPAS exposé and other public failures.

2018: FATE (+ Ethics)

Microsoft Research adds Ethics — recognizing that legal compliance ≠ moral correctness. The right question: should we even build this?

2020+: FATE(S) (+ Safety)

Columbia DSI adds Safety — preventing harm to users and bystanders, especially as models get deployed in safety-critical settings (healthcare, autonomous vehicles).

Important: these pillars often conflict. Maximizing one can hurt another. Next slide shows the tensions.
The 5 Pillars (and How They Pull Against Each Other) Fairness equal treatment Account- ability clear ownership Ethics human values Transp- arency explainable Safety prevent harm (2020+) tension tension tension
Part IV — Real Conflicts

FATE(S) is Not a Checklist — Pillars Conflict

Maximizing one pillar can hurt another. Real engineering means choosing trade-offs, not satisfying all five.

Part IV — Responsible AI

The FATE(S) Framework

Fairness

Equal treatment across demographic groups

Accountability

Clear ownership and responsibility for outcomes

Transparency

Explainable decisions and open processes

Ethics

Consider societal impact and human values

Safety

Prevent harm to users and communities

Part IV — Philippine Context

Ethics Challenges in the Philippines

GCash Credit Scoring

How do you score creditworthiness for informal economy workers with no traditional credit history? What biases might emerge?

DOH Disease Prediction

Rural areas have less data, worse connectivity. Models trained on urban data may fail in provinces where they’re needed most.

Facial Recognition

Commercial facial recognition has higher error rates for darker skin tones and women. Deployed in Philippine malls and airports.

Social Media Monitoring

With 95M accounts, social media surveillance raises privacy concerns. Where is the line between public safety and privacy?

Session 2: Key Takeaways

  1. Big data tools are needed only when scale demands it — don’t over-engineer
  2. Philippine DPA (RA 10173) governs data privacy; know its five principles
  3. Algorithmic bias enters through data, features, and modeling choices
  4. Fairness metrics help quantify bias — but you can’t satisfy all of them
  5. Responsible AI (FATE) requires continuous attention, not one-time audits

Lab 11: Bias Audit Project

Audit a model for bias, calculate fairness metrics, and propose mitigation strategies.

Appendix A — Optional

Sentiment Math Deep-Dive

VADER and TextBlob both look like simple lexicon lookups, but their published algorithms specify exact constants. These three slides show the formulas behind the high-level rules.

Appendix A — VADER Math

The Compound Score Formula

Step 1 — Sum modified valences

$x = \displaystyle\sum_i v_i'$

$v_i'$ = lexicon valence after CAPS, punctuation, booster, and negation modifiers.

Step 2 — Squash to [−1, +1]

$\text{compound} = \dfrac{x}{\sqrt{x^2 + \alpha}}, \quad \alpha = 15$

Softsign-style squashing. As $|x| \to \infty$, compound → ±1. At $x=0$, slope is $1/\sqrt{15} \approx 0.258$.

Why $\alpha = 15$? Empirically chosen so that typical sentence-level sums saturate near ±1. Larger $\alpha$ = gentler curve, smaller $\alpha$ = more aggressive saturation.
Compound Squashing Curve y = x / √(x² + 15)  ·  saturating, monotonic, smooth +1 +0.5 0 −0.5 −1 −12 −6 0 +6 +12 x = sum of valences linear x x=2 → 0.46 x=5 → 0.79 x=10 → 0.93 x=−1 → −0.25 Doubling x doesn't double compound — the curve damps runaway sums. A 50-word and a 200-word rant can both land near −0.95.
Appendix A — VADER Math

The Magic Numbers Behind "Apply Modifiers"

① Punctuation

$v_i \leftarrow v_i + \text{sign}(v_i) \cdot \min(n_!, 4) \cdot 0.292$

Each `!` (up to 4) adds +0.292 in the same direction. Question marks add +0.18 each (up to 3).

② ALL CAPS

$v_i \leftarrow v_i + \text{sign}(v_i) \cdot 0.733$

Only triggers when sentence is mixed case. "This is GREAT" > "this is great"; "THIS IS GREAT""this is great".

③ Degree Adverbs (boosters)

$v_i \leftarrow v_i + \text{sign}(v_i) \cdot B(t_{i-k}) \cdot s_k$

$B$ = ±0.293 (very/barely). Distance damping $s_1{=}1.0,\ s_2{=}0.95,\ s_3{=}0.9$ over the previous 3 tokens.

④ Negation

$v_i \leftarrow v_i \cdot (-0.74)$

If not / never / n't is within 3 tokens behind. Not −1: "not great" softens to negative without flipping fully.

⑤ Contrastive "but"

$v_i \leftarrow \begin{cases} 0.5 \cdot v_i & i < \text{idx(but)} \\ 1.5 \cdot v_i & i > \text{idx(but)} \end{cases}$

Pre-"but" valences shrink, post-"but" valences amplify — matches the human reading that the second clause carries the speaker's true position. "Food was great BUT service was terrible" → net negative.

Source: Hutto & Gilbert (2014), "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", ICWSM. Constants are tuned on a 5,000-tweet labeled dataset.
Appendix A — TextBlob Math

Corrected Worked Example

What Slide 18a glossed over

The earlier slide showed 0.8 × 1.3 = 1.04 → clipped to 1.0. In real Pattern code, clipping happens after the weighted average — not in the numerator.

Polarity (faithful)

$P = \text{clip}\!\left(\dfrac{\sum_j p_j m_j}{\sum_j m_j},\ -1,\ +1\right)$

For the example: $\dfrac{0.8 \cdot 1.3}{1.3} = 0.8$. The 1.3× modifier cancels in numerator and denominator when there's only one assessment.

Subjectivity (unweighted!)

$S = \dfrac{1}{N}\displaystyle\sum_j s_j$

Plain average over $N$ scored adjectives — no modifier weighting.

"The food was very delicious" — Faithful Trace ① POS-tag & collect adjective assessments the/DT food/NN was/VBD very/RB delicious/JJ Adjective scanned: delicious → lexicon: p=0.8, s=0.9 Preceding adverb: very → modifier m=1.3 attaches ② Build assessment list a₁ = (p, s, m) = (0.8, 0.9, 1.3)    N = 1 assessment ③ Polarity = weighted average, clip at end numerator = Σ pmⱼ = 0.8 × 1.3 = 1.04 denominator = Σ mⱼ = 1.3 P = clip(1.04 / 1.3, −1, +1) = clip(0.80, −1, +1) = +0.80 ④ Subjectivity = simple mean (no weighting) S = (1 / N) · Σ sⱼ = (1/1) · 0.9 = 0.90 Polarity = +0.80  ·  Subjectivity = 0.90  ·  Slide 18a's "1.0" was a stylization, not the real output.