From raw words to actionable insights
Department of Computer Science
University of the Philippines Cebu
"80% of enterprise data is unstructured — and much of it is text."
— Ludwig Wittgenstein, 1921
Today: how to teach machines to understand text — and how to use that power responsibly.
Tokenize, clean, and normalize raw text into analysis-ready tokens.
Convert words into numbers using Bag of Words and TF-IDF representations.
Classify polarity with sentiment analysis and discover themes with LDA topic modeling.
Natural language is messy. Preprocessing cleans, normalizes, and tokenizes text before any model can learn.
Unstructured text data is everywhere — and growing faster than any other data type.
Skipping any step introduces noise. “Running” and “run” should be the same token, not two different features.
Always lowercase before tokenization, and remove stopwords before lemmatization for best results.
The Natural Language Toolkit provides tokenizers, stemmers, lemmatizers, and stopword lists for 20+ languages.
word_tokenize() — split into wordsstopwords.words() — common words listWordNetLemmatizer() — dictionary lookupWordNet defaults to noun POS — “running” stays as-is. Pass pos='v' for verb lemmatization to get “run”.
Rule-based suffix stripping. “studies” → “studi” (not a real word). Use when speed matters more than accuracy.
Dictionary-based lookup. “studies” → “study”. Slower but produces valid words. Preferred for analytics.
Machines need numbers, not words. Bag of Words and TF-IDF convert text into vector space.
The simplest text representation: count how many times each word appears.
How often does this term appear in this document? Common words like “the” score high.
How rare is this term across all documents? IDF = log(N / df). Rare terms score high.
max_features — limit vocabulary sizemin_df — ignore very rare termsmax_df — ignore very common termsngram_range — include bigramsTF-IDF = TF × log(N/df). Words that are frequent in a document but rare across the corpus get the highest score.
Stopwords like “the”, “is”, “of” appear everywhere → low IDF → near-zero TF-IDF.
Distinctive terms like “regression” or “cluster” appear in few documents → high IDF → high TF-IDF.
High TF (appears 5 times in those docs) × high IDF (only 2/100 docs) = highest TF-IDF. Terms A, C, D have low IDF because they appear in most documents.
Sentiment analysis classifies text polarity — positive, negative, or neutral — using lexicons or machine learning.
VADER, SentiWordNet. Fast, no training. Best for social media text.
Naive Bayes, SVM. Needs labeled data. Custom domain accuracy.
BERT, RoBERTa. State-of-the-art. Requires GPU compute.
Valence Aware Dictionary and sEntiment Reasoner. Rule-based, tuned for social media text.
Ranges from -1 (most negative) to +1 (most positive). Threshold: >0.05 positive, <-0.05 negative.
Analyze each sentence separately for mixed-sentiment texts like product reviews.
With 98M internet users and 95M social media accounts (DataReportal 2026), the Philippines is one of the most active countries online. Sentiment analysis helps brands, government, and researchers understand public opinion.
VADER was designed for English text. Filipino/Taglish posts may need custom lexicons or translated models for accurate results.
A simple yet effective pipeline: vectorize text with TF-IDF, then classify with Multinomial Naive Bayes.
LDA topic modeling reveals latent topics. NER extracts named entities. Word clouds visualize term frequencies.
Unsupervised algorithm that discovers hidden topics in a collection of documents.
Standard NLTK stopwords are English-only. For Taglish text, combine English + Filipino stopwords: ang, ng, sa, na, ay, mga, at, para, ko, mo.
Filipino social media often mixes English and Filipino (“Taglish”). Both stopword lists must be applied for effective preprocessing.
Big data tools, privacy regulations, algorithmic bias, and responsible AI.
When data gets big, ethics must get bigger
Department of Computer Science
University of the Philippines Cebu
"With great data comes great responsibility."
86M+
GCash registered users
98M
Internet users in PH
95M
Social media accounts
Who protects this data?
When pandas isn’t enough: Spark, cloud platforms, and the 5 Vs of big data.
GDPR, Philippine DPA (RA 10173), anonymization, and consent requirements.
Sources of algorithmic bias, fairness metrics, mitigation strategies, and responsible AI.
Big data demands distributed computing. Learn when and why to scale beyond a single machine.
Big data isn’t only about volume. Velocity (speed), variety (types), veracity (quality), and value (insights) all matter.
GCash: 13M daily transactions (velocity) across payments, loans, investments (variety) with fraud detection needs (veracity).
Most analytics tasks (<10 GB) don’t need Spark. Use the simplest tool that works.
Datasets >100 GB, iterative ML algorithms, streaming data, or when a single machine can’t keep up.
| Platform | Service | Strength | Pricing Model |
|---|---|---|---|
| AWS | Redshift, Athena, EMR | Most comprehensive ecosystem | Per-query or provisioned |
| GCP | BigQuery | Serverless SQL at scale | Per-TB scanned |
| Azure | Synapse Analytics | Enterprise integration (Office 365) | DWU-based |
500 MB fits comfortably in memory. No need for distributed computing overhead. Use the simplest tool that works!
Privacy regulations like GDPR and the Philippine DPA define how data can be collected, used, and stored.
Gold standard for privacy. Right to deletion, data portability, fines up to 4% of global revenue.
RA 10173. Enforced by NPC. Consent, purpose limitation, breach notification within 72 hours.
Consumer right to know, delete, and opt-out of data sales. Applies to large businesses.
Freely given, specific, and informed
Use only for stated purpose
Collect only what’s needed
Keep data up to date
Delete when no longer needed
The NPC enforces the DPA, investigates breaches, and can impose fines and imprisonment for violations.
Organizations must notify the NPC and affected individuals within 72 hours of discovering a personal data breach.
| Technique | Description | Example |
|---|---|---|
| Masking | Hide partial data | “Juan D.” |
| Generalization | Broaden categories | Age 25 → “20–30” |
| Suppression | Remove identifiers | Remove SSN column |
| Noise Addition | Add random values | Salary ± 5% |
| K-Anonymity | Ensure k similar records | 5+ with same quasi-identifiers |
A dataset satisfies k-anonymity if every combination of quasi-identifiers (age, ZIP, gender) matches at least k other records.
With k=5, an attacker can narrow a person down to at most 1-in-5 records — not enough to re-identify.
Bias in data becomes bias in decisions. Understanding sources and metrics is the first step toward fairness.
Bias is rarely intentional. It enters through the data, the features, and the modeling choices we make — often invisibly.
Biased models deployed at scale can affect millions: loans denied, jobs not offered, sentences lengthened.
Predicted recidivism risk for sentencing. ProPublica found it produced higher false positive rates for Black defendants than white defendants. Used in real sentencing decisions.
Trained on 10 years of (mostly male) hiring data. The model learned to penalize resumes containing the word “women’s”. Amazon scrapped the entire program.
Split predictions by demographic and compare error rates. Significant differences indicate bias.
| Metric | Definition | When to Use | Example |
|---|---|---|---|
| Demographic Parity | Equal positive prediction rates across groups | Hiring, lending | Same loan approval rate for all demographics |
| Equalized Odds | Equal TPR and FPR across groups | Criminal justice | Same error rates regardless of race |
| Calibration | Equal precision across groups | Medical diagnosis | 70% confidence means 70% correct for all groups |
You cannot satisfy all fairness metrics simultaneously (Chouldechova, 2017). Choose based on your domain’s values.
Criminal justice prioritizes equalized odds (equal error rates). Lending prioritizes demographic parity (equal access).
Group B is falsely flagged at 23% vs. 5–8% for others. This is the kind of disparity COMPAS exhibited.
Lowering FPR for Group B may raise FNR. The question is: which error is more harmful in your context?
Fix the data before training. Rebalance datasets, remove proxy features, use synthetic oversampling (SMOTE).
Fix the algorithm. Add fairness constraints to the loss function, use adversarial debiasing, or fair representation learning.
Fix the output. Adjust decision thresholds per group, calibrate probabilities, or audit and correct predictions.
Explainability, accountability, and ethics must be designed in — not bolted on after deployment.
Equal treatment across demographic groups
Clear ownership and responsibility for outcomes
Explainable decisions and open processes
Consider societal impact and human values
Prevent harm to users and communities
FATE is a continuous practice, not a one-time audit. Models must be monitored after deployment for drift and emerging bias.
Core framework: FATE (Microsoft Research). The “S” for Safety added by Columbia DSI. Also see the ACM FAccT conference on Fairness, Accountability, and Transparency.
How do you score creditworthiness for informal economy workers with no traditional credit history? What biases might emerge?
Rural areas have less data, worse connectivity. Models trained on urban data may fail in provinces where they’re needed most.
Commercial facial recognition has higher error rates for darker skin tones and women. Deployed in Philippine malls and airports.
With 95M accounts, social media surveillance raises privacy concerns. Where is the line between public safety and privacy?
Build AI that works for Filipinos, not just Western populations. Use local data, local context, and local values to create systems that serve everyone equitably.
Audit a model for bias, calculate fairness metrics, and propose mitigation strategies.