Logistic Regression

CMSC 173 - Module 08

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Binary Classification

Classification problems involve predicting discrete categories rather than continuous values.

Binary Classification: Predict one of two classes (positive/negative, yes/no, 1/0)

Examples

Email spam detection
Disease diagnosis
Loan approval
Fraud detection
Image recognition (cat vs dog)

Why Not Linear Regression?

Outputs should be probabilities [0,1]
Linear regression can give values outside this range
Decision boundaries are non-linear
Sensitive to outliers

The Sigmoid Function

The sigmoid (logistic) function maps any real number to the range (0,1), making it perfect for probability estimation.

Sigmoid Formula

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Key Properties

Output range: $(0, 1)$
$\sigma(0) = 0.5$
$\sigma(-\infty) \to 0$
$\sigma(+\infty) \to 1$
Smooth, differentiable

Interpretation

S-shaped curve
Symmetric around 0.5
Steep gradient near 0
Saturates at extremes

Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ — useful for gradient descent!

Knowledge Check

Think About It

What is the output of the sigmoid function when z = 0?

Click the blurred area to reveal the answer

Logistic Regression Model

Logistic Regression Hypothesis: Applies sigmoid to linear combination of features.

Model Equation

$$h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$$

This gives us $P(y=1|x;\theta)$ — the probability that the output is class 1 given input $x$.

Components

$x$ — feature vector
$\theta$ — weight vector
$\theta^T x$ — linear combination
$h_\theta(x)$ — predicted probability

Making Predictions

If $h_\theta(x) \geq 0.5$ → predict class 1
If $h_\theta(x) < 0.5$ → predict class 0
Threshold can be adjusted

Decision Boundaries

Decision Boundary: The line/surface where $P(y=1) = P(y=0) = 0.5$

Since $\sigma(0) = 0.5$, the decision boundary occurs where $\theta^T x = 0$

Linear Decision Boundary

For 2D: $\theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0$

This is a straight line separating the two classes.

Linear Boundaries

Simple logistic regression
Straight lines/hyperplanes
Fast to compute
Works for linearly separable data

Non-linear Boundaries

Add polynomial features
Example: $x_1^2, x_2^2, x_1 x_2$
Can create circular/curved boundaries
More flexible but risk overfitting

The Log Loss Function

Problem: Mean squared error doesn't work well for classification — the cost surface is non-convex.

Log Loss (Cross-Entropy): Penalizes confident wrong predictions heavily.

Cost Function

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]$$

When $y=1$

Cost = $-\log(h_\theta(x))$
If $h_\theta(x) \to 1$: cost $\to 0$ (good!)
If $h_\theta(x) \to 0$: cost $\to \infty$ (bad!)

When $y=0$

Cost = $-\log(1-h_\theta(x))$
If $h_\theta(x) \to 0$: cost $\to 0$ (good!)
If $h_\theta(x) \to 1$: cost $\to \infty$ (bad!)

Knowledge Check

Think About It

Why do we use log loss instead of mean squared error for classification?

Click the blurred area to reveal the answer

Gradient Descent for Logistic Regression

Despite different cost functions, the gradient descent update rule looks identical to linear regression!

Gradient Formula

$$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

Update Rule

$$\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$

The key difference: $h_\theta(x)$ now means $\sigma(\theta^T x)$, not $\theta^T x$.

Learning Rate: Typically smaller than linear regression (0.001 to 0.1). Monitor convergence carefully.

Regularization in Logistic Regression

Regularization: Prevents overfitting by penalizing large parameter values.

Regularized Cost Function

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2$$

L2 Regularization (Ridge)

Adds $\frac{\lambda}{2m} \sum \theta_j^2$
Shrinks all weights
Smooth decision boundaries
Note: Don't regularize $\theta_0$

Regularization Parameter $\lambda$

$\lambda = 0$: No regularization
Small $\lambda$: Slight penalty
Large $\lambda$: Strong penalty, underfitting
Use cross-validation to tune

Multiclass Classification: One-vs-All

Extend binary logistic regression to K classes using multiple binary classifiers.

One-vs-All (One-vs-Rest): Train K separate binary classifiers, each distinguishing one class from all others.

AlgorithmFor each class $k$, train a classifier $h_\theta^{(k)}(x)$
Treat class $k$ as positive, all others as negative
Results in K weight vectors: $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(K)}$
To predict: Choose class with highest probability

\text{prediction} = \arg\max_k h_\theta^{(k)}(x)

Note: The K probabilities may not sum to 1 since classifiers are independent.

Knowledge Check

Think About It

For 5-class classification, how many binary classifiers do we train with one-vs-all?

Click the blurred area to reveal the answer

Softmax Regression (Multinomial Logistic)

Softmax Regression: Direct extension to multiclass that ensures probabilities sum to 1.

Softmax Function

$$P(y=k|x) = \frac{e^{\theta_k^T x}}{\sum_{j=1}^{K} e^{\theta_j^T x}}$$

Each class has its own parameter vector $\theta_k$. The denominator normalizes to ensure valid probabilities.

Properties

Outputs sum to 1
Generalizes sigmoid
Single unified model
More elegant than one-vs-all

Cost Function

Cross-entropy loss
$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \mathbb{1}\{y^{(i)}=k\} \log P(y^{(i)}=k|x^{(i)})$

Model Evaluation Metrics

Accuracy alone can be misleading, especially with imbalanced datasets!

Basic Metrics

Accuracy: Fraction of correct predictions
Error rate: $1 - \text{accuracy}$
Simple but insufficient

Confusion Matrix Components

True Positives (TP)
True Negatives (TN)
False Positives (FP)
False Negatives (FN)

Example: 99% accuracy on spam detection sounds great, but if only 1% of emails are spam, predicting "not spam" for everything gives 99% accuracy while catching zero spam!

Advanced Considerations

Optimization Tips

Feature scaling is crucial
Use mini-batch or stochastic GD for large datasets
Monitor log loss during training
Early stopping to prevent overfitting

Common Pitfalls

Forgetting to normalize features
Using MSE instead of log loss
Not regularizing with many features
Imbalanced class distributions

When to Use Logistic RegressionBinary or multiclass classification
Need probabilistic predictions
Want interpretable model
Baseline before complex models

Summary

Key TakeawaysSigmoid function maps linear output to probabilities [0,1]
Log loss (cross-entropy) is the proper cost function
Decision boundaries separate classes, can be non-linear with feature engineering
Gradient descent optimizes parameters, similar update rule to linear regression
Regularization prevents overfitting in high-dimensional spaces
One-vs-all or softmax extend to multiclass problems
Evaluation requires more than accuracy

Next: Classification metrics and other classification algorithms.

End of Module 08

Logistic Regression

Questions?