Regularization

CMSC 173 - Module 03

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

The Problem of Overfitting

A model that fits training data too well may fail to generalize to new data.

Underfitting (High Bias)

Model too simple
High training error
High test error

Overfitting (High Variance)

Model too complex
Low training error
High test error

Goal: Find the sweet spot — complex enough to capture patterns, simple enough to generalize.

What is Regularization?

Regularization: Adding a penalty term to the cost function to discourage overly complex models.

J_{regularized}(\theta) = J(\theta) + \lambda \cdot R(\theta)

Key Components$J(\theta)$: Original loss (e.g., MSE)
$\lambda$: Regularization strength (hyperparameter)
$R(\theta)$: Penalty term (depends on weights)

Knowledge Check

Think About It

What happens if λ is too large?

Click the blurred area to reveal the answer

Ridge Regression (L2)

Ridge (L2): Penalizes the sum of squared weights.

J_{ridge}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2

Properties

Shrinks all coefficients
Never sets to exactly zero
Keeps all features

When to Use

Many small/medium effects
All features potentially relevant
Multicollinearity present

Lasso Regression (L1)

Lasso (L1): Penalizes the sum of absolute weights.

J_{lasso}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}|\theta_j|

Properties

Can set coefficients to exactly zero
Performs feature selection
Produces sparse models

When to Use

Many features, few important
Want automatic feature selection
Need interpretable models

Ridge vs Lasso: Geometric View

The shape of the constraint region determines which coefficients become zero.

Ridge (Circle)

Smooth boundary
Optimal point rarely at corners
All coefficients shrink equally

Lasso (Diamond)

Sharp corners at axes
Optimal point often at corners
Some coefficients become zero

The contours of the loss function intersect the constraint region at the optimal point.

Knowledge Check

Think About It

Why does Lasso produce sparse solutions while Ridge doesn't?

Click the blurred area to reveal the answer

Elastic Net

Elastic Net: Combines L1 and L2 regularization.

J_{elastic}(\theta) = J(\theta) + \lambda_1\sum_{j}|\theta_j| + \lambda_2\sum_{j}\theta_j^2

BenefitsFeature selection (from L1)
Handles correlated features (from L2)
More flexible than either alone

Mix ratio: Control balance between L1 and L2 with a single parameter $\alpha \in [0,1]$.

Choosing λ (Regularization Strength)

$\lambda$ is a hyperparameter — it's not learned from data, we must choose it.

Selection MethodsCross-validation: Try different λ values, pick best validation performance
Grid search: Test λ ∈ {0.001, 0.01, 0.1, 1, 10, 100}
Regularization path: Plot coefficients vs λ

λ too small

→ Little regularization

→ Risk of overfitting

λ too large

→ Too much regularization

→ Risk of underfitting

Feature Scaling Matters

Important: Always scale features before applying regularization!

Regularization penalizes large weights. If features have different scales, their weights will differ just due to scale, not importance.

Standard Scaling

$$x_{scaled} = \frac{x - \mu}{\sigma}$$

After scaling, all features compete on equal footing for the regularization penalty.

Regularization in Practice

Python Implementation

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

# Always scale first!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)

# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)

# Elastic Net
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
enet.fit(X_scaled, y)

Summary

Key TakeawaysRegularization prevents overfitting by penalizing complex models
Ridge (L2): Shrinks all coefficients, handles multicollinearity
Lasso (L1): Feature selection, produces sparse models
Elastic Net: Combines benefits of both
λ selection: Use cross-validation
Feature scaling: Essential for fair regularization

Next: Exploratory Data Analysis — understanding your data before modeling.

End of Module 03

Regularization

Questions?