CMSC 178DA | Week 11 · Session 1

Mining Meaning
from Text

From raw words to actionable insights

Department of Computer Science

University of the Philippines Cebu

"80% of enterprise data is unstructured — and much of it is text."

"The limits of my language mean the limits of my world."

— Ludwig Wittgenstein, 1921

Today: how to teach machines to understand text — and how to use that power responsibly.

Agenda

Session 1 Objectives

Text Preprocessing

Tokenize, clean, and normalize raw text into analysis-ready tokens.

Vectorization

Convert words into numbers using Bag of Words and TF-IDF representations.

Sentiment & Topics

Classify polarity with sentiment analysis and discover themes with LDA topic modeling.

Part I

Turning Noise
Into Signal

Natural language is messy. Preprocessing cleans, normalizes, and tokenizes text before any model can learn.

Part I — Text Preprocessing

Why Text Analytics?

Unstructured text data is everywhere — and growing faster than any other data type.

Text Data Sources

  • Customer reviews & feedback
  • Social media posts & comments
  • Support tickets & emails
  • News articles & reports
  • Survey open-ended responses
Bar chart showing 80% of enterprise data is unstructured text
Part I — Text Preprocessing

The Preprocessing Pipeline

Flowchart: Raw Text to Lowercase to Tokenize to Remove Stopwords to Lemmatize
Part I — Text Preprocessing

Preprocessing in Python

NLTK Library

The Natural Language Toolkit provides tokenizers, stemmers, lemmatizers, and stopword lists for 20+ languages.

Key Functions

  • word_tokenize() — split into words
  • stopwords.words() — common words list
  • WordNetLemmatizer() — dictionary lookup

Caveat: POS Tagging

WordNet defaults to noun POS — “running” stays as-is. Pass pos='v' for verb lemmatization to get “run”.

import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer def preprocess(text): text = text.lower() tokens = word_tokenize(text) stop_words = set(stopwords.words('english')) tokens = [t for t in tokens if t not in stop_words and t.isalpha()] lemmatizer = WordNetLemmatizer() tokens = [lemmatizer.lemmatize(t) for t in tokens] return tokens # Example preprocess("The dogs are running FAST!") # Output: ['dog', 'running', 'fast']
Part I — Text Preprocessing

Stemming vs. Lemmatization

Comparison of stemmer vs lemmatizer outputs for various words
Part II

Making Words
Countable

Machines need numbers, not words. Bag of Words and TF-IDF convert text into vector space.

Part II — Text Representation

Bag of Words

The simplest text representation: count how many times each word appears.

Limitations

  • Loses word order entirely
  • Common words dominate the counts
  • High-dimensional, sparse matrices
from sklearn.feature_extraction.text import CountVectorizer corpus = [ "The food was good", "The service was bad", "Good food but bad service" ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names_out()) # ['bad', 'but', 'food', 'good', 'service', 'the', 'was'] print(X.toarray()) # [[0, 0, 1, 1, 0, 1, 1], # [1, 0, 0, 0, 1, 1, 1], # [1, 1, 1, 1, 1, 0, 0]]
Part II — Text Representation

TF-IDF: Important Words Get Higher Weight

Three-panel chart showing TF, IDF, and TF-IDF scores for the, data, and regression
Part II — Text Representation

TF-IDF in Python

Key Parameters

  • max_features — limit vocabulary size
  • min_df — ignore very rare terms
  • max_df — ignore very common terms
  • ngram_range — include bigrams
TL;DR

TF-IDF = TF × log(N/df). Words that are frequent in a document but rare across the corpus get the highest score.

from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer( max_features=1000, min_df=5, max_df=0.95, ngram_range=(1, 2) # Unigrams + bigrams ) X_tfidf = tfidf.fit_transform(documents) # Most important terms per document feature_names = tfidf.get_feature_names_out() for i, doc in enumerate(X_tfidf): top_indices = doc.toarray().argsort()[0][-5:] top_terms = [feature_names[j] for j in top_indices] print(f"Doc {i}: {top_terms}")
Part II — Text Representation

TF-IDF Scores Across Documents

Heatmap of TF-IDF scores showing common words score low and distinctive words score high
Knowledge Check

Which term has the HIGHEST TF-IDF score?

A) “the” — appears in every document
B) “analytics” — appears in 2 of 100 docs, 5× each
C) “data” — appears in 80 of 100 docs
D) “I” — common stopword
Click & hold to reveal answer

✓ Correct: B) “analytics”

High TF (appears 5 times in those docs) × high IDF (only 2/100 docs) = highest TF-IDF. Terms A, C, D have low IDF because they appear in most documents.

Part III

Reading Between
the Lines

Sentiment analysis classifies text polarity — positive, negative, or neutral — using lexicons or machine learning.

Part III — Sentiment Analysis

Three Approaches to Sentiment Analysis

Comparison of lexicon-based, ML-based, and pre-trained sentiment approaches
Part III — Sentiment Analysis

VADER Sentiment

What is VADER?

Valence Aware Dictionary and sEntiment Reasoner. Rule-based, tuned for social media text.

Compound Score

Ranges from -1 (most negative) to +1 (most positive). Threshold: >0.05 positive, <-0.05 negative.

from nltk.sentiment import SentimentIntensityAnalyzer nltk.download('vader_lexicon') sia = SentimentIntensityAnalyzer() texts = [ "This product is amazing! I love it!", "Terrible experience, never buying again", "It's okay, nothing special" ] for text in texts: scores = sia.polarity_scores(text) print(f"{text}") print(f" Compound: {scores['compound']:.2f}") # Output: # "This product is amazing..." → Compound: 0.86 # "Terrible experience..." → Compound: -0.48 # "It's okay, nothing..." → Compound: -0.09
Part III — Sentiment Analysis

TextBlob Sentiment

Two Dimensions

  • Polarity: -1 (negative) to +1 (positive)
  • Subjectivity: 0 (factual) to 1 (opinionated)

Per-Sentence Analysis

Analyze each sentence separately for mixed-sentiment texts like product reviews.

from textblob import TextBlob # Whole-sentence analysis text = "The food was delicious but the service was slow" blob = TextBlob(text) print(f"Polarity: {blob.sentiment.polarity:.2f}") print(f"Subjectivity: {blob.sentiment.subjectivity:.2f}") # Output: Polarity: 0.35, Subjectivity: 0.80 # Per-clause analysis (split manually) for clause in ["The food was delicious", "The service was slow"]: pol = TextBlob(clause).sentiment.polarity print(f"'{clause}' → {pol:.2f}") # Output: # "The food was delicious" → 1.00 # "The service was slow" → -0.30
Part III — Philippine Context

Philippine Social Media Sentiment

Bar chart showing sentiment distribution of Philippine social media posts
Part III — Sentiment Analysis

ML Sentiment Pipeline

TF-IDF + Naive Bayes

A simple yet effective pipeline: vectorize text with TF-IDF, then classify with Multinomial Naive Bayes.

When to Use ML-Based

  • Domain-specific language (medical, legal)
  • You have labeled training data
  • Lexicon approaches underperform
from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split # Labeled training data X_train = ["great product", "terrible quality", ...] y_train = [1, 0, ...] # 1=positive, 0=negative # Pipeline: TF-IDF + Naive Bayes pipeline = Pipeline([ ('tfidf', TfidfVectorizer()), ('clf', MultinomialNB()) ]) pipeline.fit(X_train, y_train) # Predict new text predictions = pipeline.predict(["I love this!"]) # Output: [1] (positive)
Part IV

Discovering
Hidden Themes

LDA topic modeling reveals latent topics. NER extracts named entities. Word clouds visualize term frequencies.

Part IV — Topic Modeling

LDA Topic Modeling

Latent Dirichlet Allocation

Unsupervised algorithm that discovers hidden topics in a collection of documents.

Key Assumptions

  • Documents are mixtures of topics
  • Topics are distributions over words
  • You choose number of topics (k)
  • Input must be word counts, not TF-IDF
from sklearn.decomposition import LatentDirichletAllocation from sklearn.feature_extraction.text import CountVectorizer # LDA needs word COUNTS (not TF-IDF!) cv = CountVectorizer(max_features=5000, stop_words='english') X_counts = cv.fit_transform(documents) lda = LatentDirichletAllocation( n_components=5, random_state=42) lda.fit(X_counts) # Print top words per topic names = cv.get_feature_names_out() for i, topic in enumerate(lda.components_): top = [names[j] for j in topic.argsort()[-10:]] print(f"Topic {i}: {', '.join(top)}") # Topic 0: price, quality, value, product... # Topic 1: delivery, shipping, days, arrive...
Part IV — Philippine Context

Handling Filipino Text

Before and after Filipino text preprocessing showing stopword removal

Session 1: Key Takeaways

  1. Preprocessing is critical — lowercase, tokenize, remove stopwords, lemmatize
  2. TF-IDF weights important terms higher than common words
  3. Sentiment analysis has three approaches: lexicon, ML, and pre-trained
  4. LDA discovers hidden topics in document collections
  5. Filipino text needs custom stopwords and Taglish handling

Next: Analytics at Scale & Ethics

Big data tools, privacy regulations, algorithmic bias, and responsible AI.

CMSC 178DA | Week 11 · Session 2

Scale, Privacy
& Fairness

When data gets big, ethics must get bigger

Department of Computer Science

University of the Philippines Cebu

"With great data comes great responsibility."

The Philippine Data Explosion

86M+

GCash registered users

98M

Internet users in PH

95M

Social media accounts

Who protects this data?

Agenda

Session 2 Objectives

Big Data Tools

When pandas isn’t enough: Spark, cloud platforms, and the 5 Vs of big data.

Privacy & Compliance

GDPR, Philippine DPA (RA 10173), anonymization, and consent requirements.

Bias & Fairness

Sources of algorithmic bias, fairness metrics, mitigation strategies, and responsible AI.

Part I

When pandas
Is Not Enough

Big data demands distributed computing. Learn when and why to scale beyond a single machine.

Part I — Analytics at Scale

The 5 Vs of Big Data

Pentagon diagram showing Volume, Velocity, Variety, Veracity, Value
Part I — Analytics at Scale

When Do You Need Big Data Tools?

× You DON’T need Spark if:
  • Data fits in memory (<16 GB)
  • Processing is one-time, ad-hoc
  • Simple aggregations / filters
  • pandas + SQL handles it fine
✓ You DO need distributed tools when:
  • Data exceeds single machine memory
  • Processing must be parallelized
  • Real-time streaming is required
  • ML at scale (millions of records)
TL;DR

Most analytics tasks (<10 GB) don’t need Spark. Use the simplest tool that works.

Part I — Analytics at Scale

Apache Spark

Distributed Computing

  • In-memory processing (up to 100× faster than MapReduce in memory; typically 3–10× in practice)
  • Supports Python (PySpark), SQL, Scala, R
  • MLlib for machine learning at scale

When to Choose Spark

Datasets >100 GB, iterative ML algorithms, streaming data, or when a single machine can’t keep up.

from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("Analytics") \ .getOrCreate() # Load CSV (distributed across cluster) df = spark.read.csv("large_data.csv", header=True, inferSchema=True) # SQL-like operations at scale df.groupBy("region") \ .agg({"sales": "sum"}) \ .orderBy("sum(sales)", ascending=False) \ .show() # Result: distributed across worker nodes
Part I — Analytics at Scale

Cloud Analytics Platforms

PlatformServiceStrengthPricing Model
AWSRedshift, Athena, EMRMost comprehensive ecosystemPer-query or provisioned
GCPBigQueryServerless SQL at scalePer-TB scanned
AzureSynapse AnalyticsEnterprise integration (Office 365)DWU-based
-- BigQuery example: analyze sales by region (serverless, no cluster setup) SELECT region, SUM(sales) as total_sales, COUNT(*) as num_transactions FROM `project.dataset.sales_table` WHERE date >= '2024-01-01' GROUP BY region ORDER BY total_sales DESC
Knowledge Check

Your dataset is 500 MB of CSV files.
Which tool should you use?

A) Apache Spark cluster
B) pandas on your laptop
C) Google BigQuery
D) Hadoop MapReduce
Click & hold to reveal answer

✓ Correct: B) pandas on your laptop

500 MB fits comfortably in memory. No need for distributed computing overhead. Use the simplest tool that works!

Part II

The Right to
Be Forgotten

Privacy regulations like GDPR and the Philippine DPA define how data can be collected, used, and stored.

Part II — Data Privacy

Data Privacy Regulation Timeline

Timeline of privacy regulations from 1995 EU Directive to 2020 CCPA
Part II — Philippine Context

Philippine Data Privacy Act (RA 10173)

Consent

Freely given, specific, and informed

Purpose

Use only for stated purpose

Minimization

Collect only what’s needed

Accuracy

Keep data up to date

Retention

Delete when no longer needed

Part II — Data Privacy

Anonymization Techniques

TechniqueDescriptionExample
MaskingHide partial data“Juan D.”
GeneralizationBroaden categoriesAge 25 → “20–30”
SuppressionRemove identifiersRemove SSN column
Noise AdditionAdd random valuesSalary ± 5%
K-AnonymityEnsure k similar records5+ with same quasi-identifiers
df['name'] = df['name'].str[0] + '***' # Masking df['age_group'] = pd.cut(df['age'], # Generalization bins=[0, 20, 30, 40, 50, 100], labels=['<20', '20-30', '30-40', '40-50', '50+'])
Part II — Data Privacy

K-Anonymity

Definition

A dataset satisfies k-anonymity if every combination of quasi-identifiers (age, ZIP, gender) matches at least k other records.

Why k ≥ 5?

With k=5, an attacker can narrow a person down to at most 1-in-5 records — not enough to re-identify.

Before and after k-anonymity showing generalized data
Part III

When Algorithms
Discriminate

Bias in data becomes bias in decisions. Understanding sources and metrics is the first step toward fairness.

Part III — Algorithmic Bias

Four Sources of Algorithmic Bias

Four-quadrant diagram of bias sources: Historical, Representation, Measurement, Aggregation
Part III — Case Studies

Real-World Bias Failures

COMPAS (Criminal Justice)

Predicted recidivism risk for sentencing. ProPublica found it produced higher false positive rates for Black defendants than white defendants. Used in real sentencing decisions.

Amazon Hiring Tool (HR)

Trained on 10 years of (mostly male) hiring data. The model learned to penalize resumes containing the word “women’s”. Amazon scrapped the entire program.

Part III — Algorithmic Bias

Detecting Bias

Check Metrics by Group

Split predictions by demographic and compare error rates. Significant differences indicate bias.

What to Look For

  • Unequal false positive rates (FPR)
  • Unequal false negative rates (FNR)
  • Different accuracy across groups
from sklearn.metrics import confusion_matrix # Check metrics by demographic group for group in df['demographic'].unique(): subset = df[df['demographic'] == group] cm = confusion_matrix( subset['actual'], subset['predicted'] ) # False positive rate fpr = cm[0, 1] / (cm[0, 0] + cm[0, 1]) # False negative rate fnr = cm[1, 0] / (cm[1, 0] + cm[1, 1]) print(f"{group}: FPR={fpr:.3f}, FNR={fnr:.3f}") # If FPR differs significantly across groups # → your model has disparate impact
Part III — Algorithmic Bias

Fairness Metrics

MetricDefinitionWhen to UseExample
Demographic ParityEqual positive prediction rates across groupsHiring, lendingSame loan approval rate for all demographics
Equalized OddsEqual TPR and FPR across groupsCriminal justiceSame error rates regardless of race
CalibrationEqual precision across groupsMedical diagnosis70% confidence means 70% correct for all groups
Part III — Algorithmic Bias

Disparate Error Rates Across Groups

Grouped bar chart showing FPR and FNR across demographic groups
Part III — Algorithmic Bias

Mitigating Bias: Three Stages

Pre-processing

Fix the data before training. Rebalance datasets, remove proxy features, use synthetic oversampling (SMOTE).

In-processing

Fix the algorithm. Add fairness constraints to the loss function, use adversarial debiasing, or fair representation learning.

Post-processing

Fix the output. Adjust decision thresholds per group, calibrate probabilities, or audit and correct predictions.

Part IV

Building AI That
Serves Everyone

Explainability, accountability, and ethics must be designed in — not bolted on after deployment.

Part IV — Responsible AI

Explainability: SHAP & LIME

Why Explainability?

  • GDPR right to “meaningful information about the logic involved” (Art. 13–15)
  • Build trust with stakeholders
  • Debug model errors and biases
  • Regulatory compliance

Key Tools

  • SHAP: Shapley values — fair attribution of each feature’s contribution
  • LIME: Local explanations via interpretable surrogate models
SHAP waterfall plot showing feature contributions to prediction
Part IV — Responsible AI

The FATE(S) Framework

Fairness

Equal treatment across demographic groups

Accountability

Clear ownership and responsibility for outcomes

Transparency

Explainable decisions and open processes

Ethics

Consider societal impact and human values

Safety

Prevent harm to users and communities

Part IV — Philippine Context

Ethics Challenges in the Philippines

GCash Credit Scoring

How do you score creditworthiness for informal economy workers with no traditional credit history? What biases might emerge?

DOH Disease Prediction

Rural areas have less data, worse connectivity. Models trained on urban data may fail in provinces where they’re needed most.

Facial Recognition

Commercial facial recognition has higher error rates for darker skin tones and women. Deployed in Philippine malls and airports.

Social Media Monitoring

With 95M accounts, social media surveillance raises privacy concerns. Where is the line between public safety and privacy?

Session 2: Key Takeaways

  1. Big data tools are needed only when scale demands it — don’t over-engineer
  2. Philippine DPA (RA 10173) governs data privacy; know its five principles
  3. Algorithmic bias enters through data, features, and modeling choices
  4. Fairness metrics help quantify bias — but you can’t satisfy all of them
  5. Responsible AI (FATE) requires continuous attention, not one-time audits

Lab 11: Bias Audit Project

Audit a model for bias, calculate fairness metrics, and propose mitigation strategies.