Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu
Lecture 3: Probability Review
Probability: A number between 0 and 1 that measures the likelihood of an event occurring. It quantifies uncertainty in data and predictions.
Probability is the foundation for everything we'll learn
Before we combine probabilities, let's review set notation
A ∪ B (Union)
A or B (or both)
A ∩ B (Intersection)
A and B (both)
A' (Complement)
not A
Sample Space (S): The set of all possible outcomes · Event: A subset of outcomes we care about
S = sample space · A, B = events · A∩B = intersection
Fundamental Rules
$$P(S) = 1$$
$$P(A') = 1 - P(A)$$
$$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$
Every probability must be between 0 and 1: $0 \leq P(A) \leq 1$
Before we use this notation, let's understand what it means
$$P(A \mid B)$$
$P(A|B)$
"Probability of A"
$\mid$
"given that"
$B$
"B already happened"
Example: P(Rain | Cloudy) = "If it's cloudy, what's the chance of rain?"
We're not asking about all days — only cloudy days
Conditional Probability: The probability of event A occurring, given that event B has already occurred. It measures how B affects our belief about A.
$$P(A|B) = \frac{P(A \cap B)}{P(B)}$$
What's the probability a random user pays bills AND is a woman?
$P(\text{Bills} \cap \text{Women}) = 0.70 \times 0.57 = 0.40$
Source: GCash Statistics 2024 · 70% assumption for illustration
Definition: A and B are independent if:
$$P(A \cap B) = P(A) \times P(B)$$
Caution: Independence is often assumed but rarely verified!
Thomas Bayes (1701-1761)
English minister and mathematician who asked: "How should we update our beliefs when we see new evidence?"
Bayes gives us a systematic way to update what we believe
What you believed before seeing data
Data you just observed
Your updated belief
Everyday example: You hear thunder.
Before hearing it: 30% chance of rain → After hearing thunder: 80% chance of rain
That update is Bayes' Theorem in action.
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
| Term | Name |
|---|---|
| $P(A|B)$ | Posterior |
| $P(B|A)$ | Likelihood |
| $P(A)$ | Prior |
| $P(B)$ | Evidence |
Key Insight:
Bayes' theorem lets us update our beliefs when we get new evidence.
Before we apply Bayes' theorem, learn these key terms
| Term | Symbol | What It Means | Example |
|---|---|---|---|
| Prevalence | $P(D)$ | % of population with disease | 1% have diabetes |
| Sensitivity | $P(+|D)$ | % of sick people who test positive | 95% detection rate |
| Specificity | $P(-|\neg D)$ | % of healthy people who test negative | 90% correctly cleared |
Key Question: If someone tests positive, what's the actual probability they have the disease?
(Hint: It's NOT the sensitivity!)
Most people intuitively guess ~90%. The actual probability is only 8.8%!
$$P(+) = P(+|D) \cdot P(D) + P(+|\neg D) \cdot P(\neg D)$$
$= 0.95 \times 0.01 + 0.10 \times 0.99 = \mathbf{0.1085}$
$$P(D|+) = \frac{P(+|D) \cdot P(D)}{P(+)}$$
$= \frac{0.95 \times 0.01}{0.1085} = \mathbf{0.088}$
Result: Only 8.8% chance of disease despite positive test!
Random Variable: A variable whose value is determined by the outcome of a random process. It maps outcomes to numerical values.
Countable outcomes
Infinite possible values
Statistics uses Greek letters as shorthand — here's what they mean
| Symbol | Name | Meaning | Example |
|---|---|---|---|
| μ | mu (mew) | Population mean (true average) | μ = 170 cm (avg height) |
| σ | sigma | Standard deviation (spread) | σ = 10 cm |
| $\bar{x}$ | x-bar | Sample mean (measured average) | $\bar{x}$ = 168 cm |
| Σ | Sigma (big) | Summation ("add up all...") | Σx = sum of all values |
μ and σ describe the population (what we want to know) · $\bar{x}$ comes from our sample (what we can measure)
Expected Value (Mean): The long-run average of a random variable
Variance: How spread out values are from the mean
Discrete: $E[X] = \sum x \cdot P(X=x)$
Continuous: $E[X] = \int x \cdot f(x) \, dx$
$$Var(X) = E[(X - \mu)^2]$$
Shortcut: $Var(X) = E[X^2] - (E[X])^2$
Standard Deviation: $\sigma = \sqrt{Var(X)}$ — same units as the data
Let's calculate the expected value of a ₱20 lotto ticket
| Prize | Probability | × Value |
|---|---|---|
| ₱0 (lose) | 0.99 | ₱0.00 |
| ₱50 (minor) | 0.008 | ₱0.40 |
| ₱500 | 0.0019 | ₱0.95 |
| ₱10M (jackpot) | 0.0001 | ₱1,000 |
$E[X] = \sum x \cdot P(X=x) = 0 + 0.40 + 0.95 + 1000$
E[X] ≈ ₱1,001.35
Wait... that seems profitable? 🤔
Actual jackpot probability: 1 in 40,475,358
That makes E[X] ≈ ₱1.24, not ₱1,001
You pay ₱20, expect ₱1.24 back
→ Lose ₱18.76 on average per ticket!
Lesson: Expected value reveals hidden costs that intuition misses
Source: PCSO Ultra Lotto 6/58 (official odds)
A probability distribution describes how likely each possible outcome is
Think of it as a "shape of uncertainty"
Countable outcomes (dice sum)
Binomial, Poisson
Infinite values (heights)
Normal, Exponential
Different phenomena follow different distributions — choosing the right one is key!
Choosing the wrong distribution leads to wrong conclusions
Always check your data's shape before choosing a distribution
Different distributions model different types of data: counts, rates, continuous measurements
Models the number of successes in a fixed number of independent yes/no trials
When to use: Counting successes in fixed number of trials
$$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$$
10 loan applications, 30% approval rate. P(exactly 5 approvals)?
from scipy.stats import binom
# P(exactly 5 approvals out of 10, approval rate 30%)
prob = binom.pmf(5, n=10, p=0.3)
print(f"P(X=5) = {prob:.4f}") # Output: 0.1029
Models the count of events in a fixed interval of time or space, when events occur independently at a constant average rate
When to use: Rare events in fixed time/space
$$P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}$$
Website receives avg 20 visits/hour. P(exactly 25 visits)?
from scipy.stats import poisson
# P(exactly 25 visits, avg rate 20/hour)
prob = poisson.pmf(25, mu=20)
print(f"P(X=25) = {prob:.4f}") # Output: 0.0446
The most important distribution in statistics
Many natural phenomena follow this bell-shaped curve when sample sizes are large enough
Also called the Gaussian distribution, after Carl Friedrich Gauss (1777-1855)
Normal (Gaussian) Distribution: A symmetric, bell-shaped distribution defined by mean ($\mu$) and standard deviation ($\sigma$).
This rule is used to identify outliers and assess data normality
The Most Important Theorem in Statistics
"The sampling distribution of the mean approaches normal as sample size increases, regardless of population distribution."
As sample size increases: exponential population → normal sample means
# Exponential population (skewed, not normal!)
population = np.random.exponential(scale=2, size=100000)
# Take 1000 samples, compute means → Normal!
sample_means = [np.random.choice(population, 30).mean()
for _ in range(1000)]
plt.hist(sample_means, bins=30)
Skewed exponential population → Normal distribution of sample means (n=30)
Choosing the wrong distribution leads to misleading results
If you assume income is Normal (symmetric), you will:
Better: Log-normal distribution (right-skewed)
If you model server crashes with Normal, you might:
Better: Poisson (handles count data correctly)
Always: (1) Visualize your data first (2) Check skewness and shape (3) Run a normality test if using Normal
Problem:
Solution:
Source: BSP Payment Systems Reports · Approach based on industry fraud detection practices
Given:
Calculate:
Hypothetical rates for illustration
$P(F) = P(F|Flag) \cdot P(Flag) + P(F|\neg Flag) \cdot P(\neg Flag)$
$= 0.20 \times 0.05 + 0.001 \times 0.95$
$= 0.01 + 0.00095 = \mathbf{0.01095}$
≈ 1.1% of transactions are fraudulent
Using Bayes' Theorem:
$P(Flag|F) = \frac{P(F|Flag) \cdot P(Flag)}{P(F)}$
$= \frac{0.20 \times 0.05}{0.01095} = \frac{0.01}{0.01095}$
≈ 91.3% of fraud gets flagged
Insight: The system catches 91% of fraud, but most flagged transactions (80%) are false alarms!
Lecture 4: Statistical Inference
You survey 200 students at UP Cebu about their preferred payment app.
68% say GCash.
Can you conclude that 68% of ALL Filipino college students prefer GCash?
Statistical inference = using sample data to draw conclusions about a larger population, while quantifying our uncertainty
The fundamental distinction in statistics
Population (N)
Everyone you want to study
e.g., All 109M Filipinos
Sample (n)
Subset you actually measure
e.g., Survey of 1,000 people
Why sample? We can't measure everyone — so we measure a subset and infer about the whole
Population: PSA 2020 Census (109,035,343)
Definition: Drawing conclusions about populations from samples
What's the value?
Is there an effect?
A structured method for deciding whether observed data provides enough evidence to reject a claim about a population
Just like a jury needs "beyond reasonable doubt," we need a p-value below alpha to reject H₀
Status quo, no effect
What we're testing for
We never "prove" H₀ - we either reject it or fail to reject it
The Confusion Matrix of Statistical Decisions
| H₀ Actually True | H₀ Actually False | |
|---|---|---|
| Reject H₀ |
✗ Type I Error (α) False Positive |
✓ Correct True Positive (Power) |
| Fail to Reject H₀ |
✓ Correct True Negative |
✗ Type II Error (β) False Negative |
Statistical Power = 1 − β (ability to detect a true effect when it exists)
H₀: "You are NOT pregnant"
Telling a man he's pregnant
Detecting something that doesn't exist
Cost: Unnecessary panic, wasted resources
Telling a pregnant woman she's not
Missing something that does exist
Cost: Delayed prenatal care, health risks
Type II errors are often more dangerous — you miss a real condition!
Definition: Probability of observing data as extreme as ours, assuming H₀ is true
Before computers, statisticians looked up critical values in printed tables
| df | α=0.10 | α=0.05 | α=0.01 |
|---|---|---|---|
| 5 | 2.015 | 2.571 | 4.032 |
| 10 | 1.812 | 2.228 | 3.169 |
| 15 | 1.753 | 2.131 | 2.947 |
| 20 | 1.725 | 2.086 | 2.845 |
| ∞ | 1.645 | 1.960 | 2.576 |
Two-tailed test
| df | α=0.10 | α=0.05 | α=0.01 |
|---|---|---|---|
| 1 | 2.706 | 3.841 | 6.635 |
| 2 | 4.605 | 5.991 | 9.210 |
| 3 | 6.251 | 7.815 | 11.345 |
| 4 | 7.779 | 9.488 | 13.277 |
| 5 | 9.236 | 11.070 | 15.086 |
Right-tailed test
How to use: If |your statistic| > critical value → reject H₀
Example: t = 2.5, df = 10 → critical = 2.228 → Reject
from scipy import stats
t_stat, p_value = stats.ttest_ind(group1, group2)
if p_value < 0.05:
print("Reject H₀")
No tables needed — exact p-value computed!
Both methods give the same answer — Python just automates the lookup
You flip a coin 10 times and get 9 heads. Is it unfair?
H₀: Coin is fair (P = 0.5)
H₁: Coin is biased
You calculate: p = 0.02
What p = 0.02 means:
"If the coin WERE fair, there's only a 2% chance of getting 9+ heads"
Conclusion: Since 2% < 5% (our α), we reject H₀
The result is "too weird" to happen by chance → evidence of bias
The coin flip uses the Binomial Distribution
P(X ≥ 9) = P(X=9) + P(X=10)
$P(X=k) = \binom{n}{k} p^k (1-p)^{n-k}$
$P(9) = 10 \times 0.000977$ · $P(10) = 0.000977$
= 0.0107 (one-tailed) → 0.02 (two-tailed)
from scipy.stats import binom
p_value = 1 - binom.cdf(8, n=10, p=0.5)
# One-tailed: 0.0107
p_two = 2 * p_value # Two-tailed: 0.02
Two-tailed p ≈ 0.02 because we'd also reject if we got 9+ tails (equally extreme)
t-test: A statistical test used to compare means and determine if the difference is statistically significant. Used when sample size is small or population standard deviation is unknown.
Compare sample mean to known value
$$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$$
Compare two group means
$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{s_p^2(1/n_1 + 1/n_2)}}$$
cebu = [850, 920, 780, 900, 870, 950, 820, 890, 880, 910]
manila = [800, 850, 750, 820, 780, 830, 770, 810, 790, 840]
t_stat, p_value = ttest_ind(cebu, manila)
# Output: t=3.162, p=0.0055 → Significant!
Data simulated based on LTFRB fare structures (hypothetical for illustration)
Step-by-step calculation for two-sample t-test
$\bar{x}_{Cebu} = 877$ | $\bar{x}_{Manila} = 804$
$s^2_{Cebu} = 2290$ | $s^2_{Manila} = 1160$ | $n = 10$
$s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2} = \frac{9(2290) + 9(1160)}{18} = 1725$
$SE = \sqrt{s_p^2 \cdot (\frac{1}{n_1} + \frac{1}{n_2})} = \sqrt{1725 \cdot 0.2} = 18.57$
$t = \frac{\bar{x}_1 - \bar{x}_2}{SE} = \frac{877 - 804}{18.57}$ = 3.93
df = 18 → Critical t at α=0.05 is 2.101 → 3.93 > 2.101 → Reject H₀
Purpose: Test independence of categorical variables
$$\chi^2 = \sum \frac{(O - E)^2}{E}$$
| Cash | GCash | Card | |
|---|---|---|---|
| Under 30 | 50 | 120 | 30 |
| 30-50 | 80 | 90 | 50 |
| Over 50 | 100 | 40 | 60 |
Illustrative data based on BSP 2021 Financial Inclusion Survey trends (e-wallets most common among ages 15-49)
# Contingency table: rows=age groups, cols=payment methods
table = np.array([[50, 120, 30], # Under 30
[80, 90, 50], # 30-50
[100, 40, 60]]) # Over 50
chi2, p_value, dof, expected = chi2_contingency(table)
# Output: chi2=62.89, p < 0.0001 → NOT independent!
Compare Observed (O) vs Expected (E) counts
$E = \frac{Row Total \times Col Total}{Grand Total}$
| Cash | GCash | Card | |
|---|---|---|---|
| Under 30 | E=46 | E=100 | E=56 |
| 30-50 | E=51 | E=110 | E=62 |
| Over 50 | E=46 | E=100 | E=56 |
$\chi^2 = \sum \frac{(O - E)^2}{E}$
$= \frac{(50-46)^2}{46} + \frac{(120-100)^2}{100} + ...$
χ² = 62.89, df = 4
Critical χ² at df=4, α=0.05 is 9.488 → 62.89 >> 9.488 → Reject H₀
A range of plausible values for an unknown population parameter, based on sample data
"How long is your commute?"
You wouldn't say "exactly 47 minutes"
You'd say "usually between 40-55 minutes"
A confidence interval gives this kind of range for statistical estimates
Why not just report the average? Because a single number hides the uncertainty in your estimate.
Confidence Interval: Range likely to contain the true parameter
$$\bar{x} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}$$
Correct: 95% of intervals contain the true mean
Wrong: "95% probability the mean is in this interval"
transactions = [150, 200, 180, 250, 175, 300, 220, 190,
160, 280, 195, 210, 170, 240, 185]
mean = np.mean(transactions)
se = stats.sem(transactions) # Standard error
ci = stats.t.interval(0.95, len(transactions)-1, mean, se)
# Output: mean=₱203.67, 95% CI: (₱178.42, ₱228.91)
Simulated data based on GCash transaction patterns
A controlled experiment where you compare two versions (A and B) to see which performs better on a specific metric
A/B testing is the gold standard for establishing causation, not just correlation.
It answers: "Did this change actually cause the improvement?"
Definition: Controlled experiment comparing two versions
Control (A): Blue button
Treatment (B): Green button
Result: p = 0.04 → Green button wins!
Hypothetical scenario based on Lazada PH e-commerce patterns
Problem: Running many tests increases false positives!
10 tests at α = 0.05:
$$P(\text{at least one false positive}) = 1 - 0.95^{10} = 40\%!$$
$$\alpha' = \frac{\alpha}{n}$$
For 10 tests: α' = 0.005
Effect Size: A standardized measure of the magnitude of a difference, independent of sample size. It tells you how big the effect is, not just whether it exists.
Statistical significance ≠ Practical significance
Large samples can make tiny effects "statistically significant"
Example: A drug reduces blood pressure by 0.5 mmHg (p < 0.001 with n=100,000) — statistically significant but clinically meaningless!
$$d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}$$
Difference in means ÷ Pooled standard deviation
| Interpretation | Cohen's d | Meaning |
|---|---|---|
| Small | 0.2 | Barely noticeable |
| Medium | 0.5 | Visible to careful observer |
| Large | 0.8 | Obvious to anyone |
Always report effect size alongside p-values!
Data:
Tasks:
| Conv | No Conv | |
|---|---|---|
| Control | 45 | 455 |
| Treatment | 60 | 440 |
$\chi^2 = 2.40$, df = 1
p ≈ 0.12 → Not significant at α=0.05
Difference = 12% − 9% = 3%
$SE = \sqrt{\frac{0.09 \times 0.91}{500} + \frac{0.12 \times 0.88}{500}} = 0.019$
95% CI = 3% ± 3.8%
CI: (−0.8%, 6.8%)
Contains 0 → not significant
3. Practical Significance: 3% lift = 15 extra conversions → ₱7,500 gain (if ₱500/conversion). Worth it if cost < ₱7,500!
Datasets: PSA OpenSTAT Labor Force Survey, simulated GCash transaction patterns
Topics: