Back to Course
CMSC 173

Module 03: Regularization

1 / --

Regularization

CMSC 173 - Module 03

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

The Problem of Overfitting

A model that fits training data too well may fail to generalize to new data.

Underfitting (High Bias)

  • Model too simple
  • High training error
  • High test error

Overfitting (High Variance)

  • Model too complex
  • Low training error
  • High test error
Goal: Find the sweet spot — complex enough to capture patterns, simple enough to generalize.

What is Regularization?

Regularization: Adding a penalty term to the cost function to discourage overly complex models.
$$J_{regularized}(\theta) = J(\theta) + \lambda \cdot R(\theta)$$

Key Components

  • $J(\theta)$: Original loss (e.g., MSE)
  • $\lambda$: Regularization strength (hyperparameter)
  • $R(\theta)$: Penalty term (depends on weights)

Knowledge Check

Think About It

What happens if λ is too large?

Click the blurred area to reveal the answer

Ridge Regression (L2)

Ridge (L2): Penalizes the sum of squared weights.
$$J_{ridge}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}\theta_j^2$$

Properties

  • Shrinks all coefficients
  • Never sets to exactly zero
  • Keeps all features

When to Use

  • Many small/medium effects
  • All features potentially relevant
  • Multicollinearity present

Lasso Regression (L1)

Lasso (L1): Penalizes the sum of absolute weights.
$$J_{lasso}(\theta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\sum_{j=1}^{n}|\theta_j|$$

Properties

  • Can set coefficients to exactly zero
  • Performs feature selection
  • Produces sparse models

When to Use

  • Many features, few important
  • Want automatic feature selection
  • Need interpretable models

Ridge vs Lasso: Geometric View

The shape of the constraint region determines which coefficients become zero.

Ridge (Circle)

  • Smooth boundary
  • Optimal point rarely at corners
  • All coefficients shrink equally

Lasso (Diamond)

  • Sharp corners at axes
  • Optimal point often at corners
  • Some coefficients become zero
The contours of the loss function intersect the constraint region at the optimal point.

Knowledge Check

Think About It

Why does Lasso produce sparse solutions while Ridge doesn't?

Click the blurred area to reveal the answer

Elastic Net

Elastic Net: Combines L1 and L2 regularization.
$$J_{elastic}(\theta) = J(\theta) + \lambda_1\sum_{j}|\theta_j| + \lambda_2\sum_{j}\theta_j^2$$

Benefits

  • Feature selection (from L1)
  • Handles correlated features (from L2)
  • More flexible than either alone
Mix ratio: Control balance between L1 and L2 with a single parameter $\alpha \in [0,1]$.

Choosing λ (Regularization Strength)

$\lambda$ is a hyperparameter — it's not learned from data, we must choose it.

Selection Methods

  • Cross-validation: Try different λ values, pick best validation performance
  • Grid search: Test λ ∈ {0.001, 0.01, 0.1, 1, 10, 100}
  • Regularization path: Plot coefficients vs λ

λ too small

→ Little regularization

→ Risk of overfitting

λ too large

→ Too much regularization

→ Risk of underfitting

Feature Scaling Matters

Important: Always scale features before applying regularization!
Regularization penalizes large weights. If features have different scales, their weights will differ just due to scale, not importance.

Standard Scaling

$$x_{scaled} = \frac{x - \mu}{\sigma}$$

After scaling, all features compete on equal footing for the regularization penalty.

Regularization in Practice

Python Implementation

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.preprocessing import StandardScaler

# Always scale first!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_scaled, y)

# Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_scaled, y)

# Elastic Net
enet = ElasticNet(alpha=0.1, l1_ratio=0.5)
enet.fit(X_scaled, y)

Summary

Key Takeaways

  • Regularization prevents overfitting by penalizing complex models
  • Ridge (L2): Shrinks all coefficients, handles multicollinearity
  • Lasso (L1): Feature selection, produces sparse models
  • Elastic Net: Combines benefits of both
  • λ selection: Use cross-validation
  • Feature scaling: Essential for fair regularization
Next: Exploratory Data Analysis — understanding your data before modeling.

End of Module 03

Regularization

Questions?