Logistic Regression
CMSC 173 - Module 08
Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu
Binary Classification
Classification problems involve predicting discrete categories rather than continuous values.
Binary Classification: Predict one of two classes (positive/negative, yes/no, 1/0)
Examples
- Email spam detection
- Disease diagnosis
- Loan approval
- Fraud detection
- Image recognition (cat vs dog)
Why Not Linear Regression?
- Outputs should be probabilities [0,1]
- Linear regression can give values outside this range
- Decision boundaries are non-linear
- Sensitive to outliers
The Sigmoid Function
The sigmoid (logistic) function maps any real number to the range (0,1), making it perfect for probability estimation.
Sigmoid Formula
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Key Properties
- Output range: $(0, 1)$
- $\sigma(0) = 0.5$
- $\sigma(-\infty) \to 0$
- $\sigma(+\infty) \to 1$
- Smooth, differentiable
Interpretation
- S-shaped curve
- Symmetric around 0.5
- Steep gradient near 0
- Saturates at extremes
Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ — useful for gradient descent!
Knowledge Check
Think About It
What is the output of the sigmoid function when z = 0?
When z = 0, $\sigma(0) = \frac{1}{1 + e^0} = \frac{1}{2} = 0.5$. This is the neutral point where the model is equally uncertain between both classes.
Click the blurred area to reveal the answer
Logistic Regression Model
Logistic Regression Hypothesis: Applies sigmoid to linear combination of features.
Model Equation
$$h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + e^{-\theta^T x}}$$
This gives us $P(y=1|x;\theta)$ — the probability that the output is class 1 given input $x$.
Components
- $x$ — feature vector
- $\theta$ — weight vector
- $\theta^T x$ — linear combination
- $h_\theta(x)$ — predicted probability
Making Predictions
- If $h_\theta(x) \geq 0.5$ → predict class 1
- If $h_\theta(x) < 0.5$ → predict class 0
- Threshold can be adjusted
Decision Boundaries
Decision Boundary: The line/surface where $P(y=1) = P(y=0) = 0.5$
Since $\sigma(0) = 0.5$, the decision boundary occurs where $\theta^T x = 0$
Linear Decision Boundary
For 2D: $\theta_0 + \theta_1 x_1 + \theta_2 x_2 = 0$
This is a straight line separating the two classes.
Linear Boundaries
- Simple logistic regression
- Straight lines/hyperplanes
- Fast to compute
- Works for linearly separable data
Non-linear Boundaries
- Add polynomial features
- Example: $x_1^2, x_2^2, x_1 x_2$
- Can create circular/curved boundaries
- More flexible but risk overfitting
The Log Loss Function
Problem: Mean squared error doesn't work well for classification — the cost surface is non-convex.
Log Loss (Cross-Entropy): Penalizes confident wrong predictions heavily.
Cost Function
$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right]$$
When $y=1$
- Cost = $-\log(h_\theta(x))$
- If $h_\theta(x) \to 1$: cost $\to 0$ (good!)
- If $h_\theta(x) \to 0$: cost $\to \infty$ (bad!)
When $y=0$
- Cost = $-\log(1-h_\theta(x))$
- If $h_\theta(x) \to 0$: cost $\to 0$ (good!)
- If $h_\theta(x) \to 1$: cost $\to \infty$ (bad!)
Knowledge Check
Think About It
Why do we use log loss instead of mean squared error for classification?
Log loss creates a convex cost surface, ensuring gradient descent finds the global minimum. MSE with sigmoid creates a non-convex surface with local minima. Log loss also heavily penalizes confident wrong predictions, which is desirable for classification.
Click the blurred area to reveal the answer
Gradient Descent for Logistic Regression
Despite different cost functions, the gradient descent update rule looks identical to linear regression!
Gradient Formula
$$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$
Update Rule
$$\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$$
The key difference: $h_\theta(x)$ now means $\sigma(\theta^T x)$, not $\theta^T x$.
Learning Rate: Typically smaller than linear regression (0.001 to 0.1). Monitor convergence carefully.
Regularization in Logistic Regression
Regularization: Prevents overfitting by penalizing large parameter values.
Regularized Cost Function
$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1-h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2$$
L2 Regularization (Ridge)
- Adds $\frac{\lambda}{2m} \sum \theta_j^2$
- Shrinks all weights
- Smooth decision boundaries
- Note: Don't regularize $\theta_0$
Regularization Parameter $\lambda$
- $\lambda = 0$: No regularization
- Small $\lambda$: Slight penalty
- Large $\lambda$: Strong penalty, underfitting
- Use cross-validation to tune
Multiclass Classification: One-vs-All
Extend binary logistic regression to K classes using multiple binary classifiers.
One-vs-All (One-vs-Rest): Train K separate binary classifiers, each distinguishing one class from all others.
Algorithm
- For each class $k$, train a classifier $h_\theta^{(k)}(x)$
- Treat class $k$ as positive, all others as negative
- Results in K weight vectors: $\theta^{(1)}, \theta^{(2)}, ..., \theta^{(K)}$
- To predict: Choose class with highest probability
$$\text{prediction} = \arg\max_k h_\theta^{(k)}(x)$$
Note: The K probabilities may not sum to 1 since classifiers are independent.
Knowledge Check
Think About It
For 5-class classification, how many binary classifiers do we train with one-vs-all?
We train 5 binary classifiers, one for each class. Each classifier learns to distinguish one class from all the other 4 classes combined.
Click the blurred area to reveal the answer
Softmax Regression (Multinomial Logistic)
Softmax Regression: Direct extension to multiclass that ensures probabilities sum to 1.
Softmax Function
$$P(y=k|x) = \frac{e^{\theta_k^T x}}{\sum_{j=1}^{K} e^{\theta_j^T x}}$$
Each class has its own parameter vector $\theta_k$. The denominator normalizes to ensure valid probabilities.
Properties
- Outputs sum to 1
- Generalizes sigmoid
- Single unified model
- More elegant than one-vs-all
Cost Function
- Cross-entropy loss
- $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \mathbb{1}\{y^{(i)}=k\} \log P(y^{(i)}=k|x^{(i)})$
Model Evaluation Metrics
Accuracy alone can be misleading, especially with imbalanced datasets!
Basic Metrics
- Accuracy: Fraction of correct predictions
- Error rate: $1 - \text{accuracy}$
- Simple but insufficient
Confusion Matrix Components
- True Positives (TP)
- True Negatives (TN)
- False Positives (FP)
- False Negatives (FN)
Example: 99% accuracy on spam detection sounds great, but if only 1% of emails are spam, predicting "not spam" for everything gives 99% accuracy while catching zero spam!
Advanced Considerations
Optimization Tips
- Feature scaling is crucial
- Use mini-batch or stochastic GD for large datasets
- Monitor log loss during training
- Early stopping to prevent overfitting
Common Pitfalls
- Forgetting to normalize features
- Using MSE instead of log loss
- Not regularizing with many features
- Imbalanced class distributions
When to Use Logistic Regression
- Binary or multiclass classification
- Need probabilistic predictions
- Want interpretable model
- Baseline before complex models
Summary
Key Takeaways
- Sigmoid function maps linear output to probabilities [0,1]
- Log loss (cross-entropy) is the proper cost function
- Decision boundaries separate classes, can be non-linear with feature engineering
- Gradient descent optimizes parameters, similar update rule to linear regression
- Regularization prevents overfitting in high-dimensional spaces
- One-vs-all or softmax extend to multiclass problems
- Evaluation requires more than accuracy
Next: Classification metrics and other classification algorithms.
End of Module 08
Logistic Regression
Questions?