Model Selection and Evaluation

CMSC 173 - Module 05

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

The Model Selection Problem

Central Question: How do we choose the best model?

Challenges

Multiple algorithms available
Different hyperparameters
Trade-offs between complexity and performance
Avoiding overfitting
Generalization to unseen data

\begin{exampleblock}{Goals}

Select optimal model architecture
Tune hyperparameters effectively
Ensure reliable performance
Balance bias and variance
Maximize generalization

\end{exampleblock}

Key Insight

Model selection is not just about training performance, but about how well the model generalizes to new, unseen data.

Model Selection Pipeline

\begin{tikzpicture}[node distance=1.5cm, every node/.style={font=\small}] \node[rectangle, draw, fill=blue!20, minimum width=2.5cm, minimum height=0.8cm] (data) {Data}; \node[rectangle, draw, fill=green!20, minimum width=2.5cm, minimum height=0.8cm, right=of data] (split) {Train/Val/Test Split}; \node[rectangle, draw, fill=yellow!20, minimum width=2.5cm, minimum height=0.8cm, right=of split] (train) {Train Models}; \node[rectangle, draw, fill=orange!20, minimum width=2.5cm, minimum height=0.8cm, below=0.8cm of train] (validate) {Validate \& Select}; \node[rectangle, draw, fill=purple!20, minimum width=2.5cm, minimum height=0.8cm, left=of validate] (test) {Test Final Model}; \draw[->, thick] (data) -- (split); \draw[->, thick] (split) -- (train); \draw[->, thick] (train) -- (validate); \draw[->, thick] (validate) -- (test); \end{tikzpicture}

Split: Divide data into training, validation, and test sets
Train: Fit multiple candidate models
Validate: Compare models on validation set
Select: Choose best performing model
Test: Final evaluation on held-out test set

Train-Validation-Test Split

[Figure: ../figures/train_test_split.png]

Training SetModel fitting
Learning parameters
60-70\% of data

Validation SetModel selection
Hyperparameter tuning
15-20\% of data

Test SetFinal evaluation
Unbiased estimate
15-20\% of data

Understanding Prediction Error

For a regression problem, the expected prediction error can be decomposed:

Error Decomposition

$$ \mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2 $$

Bias
$$
\text{Bias}[\hat{f}] = \mathbb{E}[\hat{f}] - f
$$
Error from wrong assumptions in the learning algorithm

Variance
$$
\text{Var}[\hat{f}] = \mathbb{E}[(\hat{f} - \mathbb{E}[\hat{f}])^2]
$$
Error from sensitivity to training set variations

Irreducible Error
$$
\sigma^2 = \text{Var}[\epsilon]
$$
Noise in the data that cannot be reduced

The Bias-Variance Tradeoff

[Figure: ../figures/bias_variance_tradeoff.png]

\begin{tipblock}{Key Insight}

As model complexity increases, bias decreases but variance increases
The optimal model minimizes the total error (bias$^2$ + variance)
There exists a sweet spot that balances both sources of error

\end{tipblock}

High Bias vs High Variance

High Bias (Underfitting)

Characteristics:

Overly simple model
Poor training performance
Poor test performance
Cannot capture data patterns

Solutions:

Increase model complexity
Add more features
Reduce regularization
Train longer

High Variance (Overfitting)

Characteristics:

Overly complex model
Excellent training performance
Poor test performance
Memorizes training data

Solutions:

Simplify model
Get more training data
Increase regularization
Use early stopping

Visualizing Underfitting and Overfitting

[Figure: ../figures/underfitting_overfitting.png]

Left: Underfitting - linear model cannot capture nonlinear relationship
Center: Good fit - balanced complexity captures true pattern
Right: Overfitting - high-degree polynomial fits noise

Model Complexity and Error

[Figure: ../figures/model_complexity_curve.png]

Observations

Training error decreases monotonically with complexity
Validation error has a U-shaped curve
Gap between curves indicates overfitting
Optimal complexity minimizes validation error

Why Do We Need Validation?

The Fundamental Problem

We cannot evaluate model performance on the same data used for training!

\begin{exampleblock}{Training Error is Optimistic}

Model has seen the training data
Can memorize patterns and noise
Does not reflect generalization
Always decreases with complexity

\end{exampleblock} \begin{exampleblock}{Validation Error is Realistic}

Model has not seen validation data
Measures true generalization
Enables fair model comparison
Guides hyperparameter selection

\end{exampleblock}

[Figure: ../figures/validation_necessity.png]

Learning Curves

[Figure: ../figures/learning_curves.png]

Underfitting: Both errors high, converge to high value
Well-fitted: Both errors low, small gap between them
Overfitting: Large gap between training and validation error

Cross-Validation: Motivation

Problem with Single Train-Val Split

Results depend on random split
Some data points never used for training
Some never used for validation
High variance in performance estimates

\begin{exampleblock}{Cross-Validation Solution}

Use multiple train-validation splits
Every data point used for both training and validation
Average results across splits for robust estimate
Reduces variance in performance evaluation

\end{exampleblock}

Cross-Validation Schemes

[Figure: ../figures/cross_validation_schemes.png]

K-Fold CVSplit data into K folds
Train on K-1, validate on 1
Repeat K times
Average K results

Stratified K-FoldMaintains class distribution
Important for imbalanced data
Each fold representative
Same averaging as K-Fold

K-Fold Cross-Validation Algorithm

\begin{algorithm}[H] \caption{K-Fold Cross-Validation} \begin{algorithmic}[1] \REQUIRE Dataset $D$, Model $M$, Number of folds $K$ \ENSURE Cross-validation score \STATE Randomly partition $D$ into $K$ equal-sized subsets $D_1, D_2, …, D_K$ \STATE Initialize $\text{scores} = []$ \FOR{$i = 1$ to $K$} \STATE $D_{\text{val}} \leftarrow D_i$ \STATE $D_{\text{train}} \leftarrow D \setminus D_i$ \STATE Train model $\hat{M}$ on $D_{\text{train}}$ \STATE $s_i \leftarrow \text{Evaluate}(\hat{M}, D_{\text{val}})$ \STATE Append $s_i$ to $\text{scores}$ \ENDFOR \STATE return $\frac{1}{K} \sum_{i=1}^{K} s_i$ \end{algorithmic} \end{algorithm} \begin{tipblock}{Common Choices} $K = 5$ or $K = 10$ are typical values balancing computational cost and variance reduction. \end{tipblock}

Validation Curve

[Figure: ../figures/validation_curve.png]

Using Validation Curves

Plot training and validation scores vs. hyperparameter values
Identify optimal hyperparameter setting
Diagnose underfitting and overfitting regions
Select model with best validation performance

Classification Metrics: Confusion Matrix

[Figure: ../figures/confusion_matrix.png]

DefinitionsTP: True Positives
TN: True Negatives
FP: False Positives (Type I error)
FN: False Negatives (Type II error)

Key Metrics

$$\begin{aligned}\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \\ \text{Precision} = \frac{TP}{TP + FP} \\ \text{Recall} = \frac{TP}{TP + FN}\end{aligned}$$

Classification Metrics Comparison

[Figure: ../figures/metrics_comparison.png]

AccuracyOverall correctness; can be misleading with imbalanced classes

F1-ScoreHarmonic mean of precision and recall: $F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$

ROC Curve and AUC

[Figure: ../figures/roc_curve.png]

ROC CurvePlots TPR vs FPR
Shows performance across thresholds
Diagonal = random classifier
Upper-left corner = perfect

\begin{exampleblock}{AUC Score}

Area Under ROC Curve
Range: [0, 1]
0.5 = random
1.0 = perfect
Threshold-independent

\end{exampleblock}

Precision-Recall Curve

[Figure: ../figures/precision_recall_curve.png]

When to Use

Imbalanced datasets
Care about positive class
False positives costly
Alternative to ROC

\begin{tipblock}{Interpretation}

High area = good performance
Trade-off between precision and recall
Choose threshold based on application needs

\end{tipblock}

Regression Metrics

Mean Squared Error (MSE)
$$
\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
$$
Penalizes large errors heavily
Same units as $y^2$
Always non-negative
Lower is better

Root Mean Squared Error
$$
\text{RMSE} = \sqrt{\text{MSE}}
$$
Same units as $y$
More interpretable than MSE

Mean Absolute Error (MAE)
$$
\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|
$$
Robust to outliers
Same units as $y$
Easy to interpret

R-Squared ($R^2$)
$$
R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}
$$
Proportion of variance explained
Range: $(-\infty, 1]$
1 = perfect predictions

What is Regularization?

Definition

Regularization is a technique to prevent overfitting by adding a penalty term to the loss function that discourages complex models.

\begin{exampleblock}{General Form} $$ \text{Loss}_{\text{regularized}} = \text{Loss}_{\text{data}} + \lambda \cdot \text{Penalty}(\text{parameters}) $$ where $\lambda \geq 0$ is the regularization parameter controlling the strength of regularization. \end{exampleblock}

BenefitsReduces overfitting
Improves generalization
Encourages simpler models
Can perform feature selection

Trade-off$\lambda$ too small: underfitting
$\lambda$ too large: underfitting
Must tune $\lambda$ via validation

Ridge Regression (L2 Regularization)

Objective Function
$$
\min_{\mathbf{w}} \sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2 + \lambda \|\mathbf{w}\|_2^2
$$
where $\|\mathbf{w}\|_2^2 = \sum_{j=1}^{d} w_j^2$ is the L2 norm.

\begin{exampleblock}{Characteristics}

Shrinks coefficients towards zero
Does not set coefficients exactly to zero
Has closed-form solution
Stable and computationally efficient
Preferred when all features are relevant

\end{exampleblock}

Solution
$$
\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}
$$

[Figure: ../figures/regularization_effect.png]

Lasso Regression (L1 Regularization)

Objective Function
$$
\min_{\mathbf{w}} \sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2 + \lambda \|\mathbf{w}\|_1
$$
where $\|\mathbf{w}\|_1 = \sum_{j=1}^{d} |w_j|$ is the L1 norm.

\begin{exampleblock}{Characteristics}

Can set coefficients exactly to zero
Performs automatic feature selection
Produces sparse models
No closed-form solution (use optimization)
Preferred with many irrelevant features

\end{exampleblock}

Sparsity Property

Lasso's ability to zero out coefficients makes it ideal for interpretable models and high-dimensional data.

[Figure: ../figures/sparsity_comparison.png]

L1 vs L2 Regularization: Geometric Interpretation

[Figure: ../figures/l1_vs_l2_geometry.png]

L2 (Ridge): Circular constraint region - solution rarely at axes (non-sparse)
L1 (Lasso): Diamond constraint region - corners encourage sparse solutions
Contours represent loss function, constraint region represents penalty

Regularization Paths

[Figure: ../figures/regularization_paths.png]

Observations

Ridge: Coefficients shrink smoothly but never reach exactly zero
Lasso: Coefficients can become exactly zero at finite $\lambda$
As $\lambda \to \infty$, all coefficients approach zero
Different coefficients zero out at different $\lambda$ values in Lasso

Elastic Net: Combining L1 and L2

Objective Function
$$
\min_{\mathbf{w}} \sum_{i=1}^{n}(y_i - \mathbf{w}^T\mathbf{x}_i)^2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2
$$
Alternatively parameterized with mixing parameter $\alpha \in [0,1]$:
$$
\text{Penalty} = \lambda \left[ \alpha \|\mathbf{w}\|_1 + (1-\alpha) \|\mathbf{w}\|_2^2 \right]
$$

\begin{exampleblock}{Advantages}

Combines benefits of Ridge and Lasso
Handles correlated features better than Lasso
Can select groups of correlated features
More stable than Lasso

\end{exampleblock}

\begin{tipblock}{When to Use}

Many correlated features
Want feature selection and grouping
Lasso is too aggressive
Ridge is not sparse enough

\end{tipblock}

Comparing Regularization Methods

[Figure: ../figures/regularization_comparison.png]

All methods converge to similar training error with strong regularization
Test error differences reveal generalization capabilities
Optimal $\lambda$ differs across methods

Regularization in Other Models

Neural NetworksWeight decay: L2 penalty on weights
Dropout: Randomly drop neurons during training
Early stopping: Stop training before overfitting
Batch normalization: Normalize activations

Support Vector Machines$C$ parameter controls regularization
Small $C$ = strong regularization
Large $C$ = weak regularization

Decision Trees/ForestsMax depth
Min samples per leaf
Max number of features
Pruning

General StrategiesData augmentation
Feature selection
Ensemble methods
Cross-validation for tuning

Model Selection Best Practices

\begin{exampleblock}{Do's}

Always use separate train/validation/test sets
Use cross-validation for robust estimates
Tune hyperparameters only on validation data
Report final performance on test set (once!)
Standardize/normalize features appropriately
Use stratified splits for classification
Track both training and validation metrics
Document all preprocessing steps

\end{exampleblock}

Common Pitfalls to Avoid

Don'ts

Data leakage: Including test data in preprocessing
Peeking at test set: Multiple evaluations on test set
Ignoring class imbalance: Using accuracy on imbalanced data
Not checking assumptions: Assuming i.i.d. data
Overfitting validation set: Excessive hyperparameter tuning
Cherry-picking results: Reporting only best-case performance
Inadequate splitting: Too small validation/test sets
Comparing on training data: Always compare on validation

Data Leakage: A Critical Issue

What is Data Leakage?

Information from the test/validation set leaking into the training process, leading to overly optimistic performance estimates.

\begin{exampleblock}{Common Sources}

Normalization using all data
Feature selection on all data
Imputation using all data
Temporal data ordering issues
Duplicate samples across splits

\end{exampleblock}

\begin{tipblock}{Prevention}

Split data FIRST
Fit preprocessing only on training
Transform validation/test separately
Use pipelines
Be careful with time series

\end{tipblock}

Example: Correct Order

1. Split data $\to$ 2. Fit scaler on train $\to$ 3. Transform train/val/test $\to$ 4. Train model

Hyperparameter Tuning Strategies

Grid SearchExhaustive search over grid
Guarantees finding best in grid
Exponential in \# parameters
Good for few parameters

Random SearchRandomly sample combinations
Often finds good solutions faster
Better for many parameters
Can set computational budget

Bayesian OptimizationModels objective function
Guides search intelligently
Most sample-efficient
Good for expensive models

\begin{tipblock}{Practical Tips}

Start with coarse grid
Refine around best values
Use log scale for $\lambda$
Parallelize when possible

\end{tipblock}

Nested Cross-Validation

Problem

Using CV for both model selection and performance estimation gives biased results!

\begin{exampleblock}{Solution: Nested CV}

Outer loop: Estimates true performance
Inner loop: Selects hyperparameters
Provides unbiased performance estimate
More computationally expensive

\end{exampleblock}

Structure

For each outer fold:

Set aside test fold
Use inner CV to select hyperparameters
Train final model with best hyperparameters
Evaluate on test fold

Average outer fold results

\begin{tikzpicture}[scale=0.7] % Outer loop \draw[thick, blue] (0,0) rectangle (5,0.6); \node at (2.5, 0.3) {\small Outer Fold 1}; \draw[thick, blue] (0,-1) rectangle (5,-0.4); \node at (2.5, -0.7) {\small Outer Fold 2}; \draw[thick, blue] (0,-2) rectangle (5,-1.4); \node at (2.5, -1.7) {\small Outer Fold 3}; % Inner loops for first outer fold \draw[thick, red] (0.2, 1.2) rectangle (1.5, 1.6); \node at (0.85, 1.4) {\tiny Inner 1}; \draw[thick, red] (1.7, 1.2) rectangle (3, 1.6); \node at (2.35, 1.4) {\tiny Inner 2}; \draw[thick, red] (3.2, 1.2) rectangle (4.5, 1.6); \node at (3.85, 1.4) {\tiny Inner 3}; % Arrows \draw[->, thick] (2.5, 0.6) -- (2.5, 1.2); % Labels \node[blue, left] at (-0.2, 0.3) {\small Outer}; \node[red, right] at (5.2, 1.4) {\small Inner}; \end{tikzpicture}

Model Selection Checklist

Before Training[$\square$] Understand the problem and data
[$\square$] Check for class imbalance
[$\square$] Handle missing values
[$\square$] Split data properly
[$\square$] Standardize/normalize features
[$\square$] Choose appropriate metrics

During Training[$\square$] Use cross-validation
[$\square$] Track train and validation metrics
[$\square$] Try multiple model types
[$\square$] Tune hyperparameters systematically
[$\square$] Check for overfitting

After Training[$\square$] Evaluate on test set (once!)
[$\square$] Compare multiple metrics
[$\square$] Analyze errors/confusion matrix
[$\square$] Check for biases
[$\square$] Document results
[$\square$] Assess computational requirements

Golden Rule

Never touch the test set until final evaluation, and evaluate on it only once!

Summary: Key Concepts

Bias-Variance Tradeoff
- Balance between model complexity and generalization
- Underfitting (high bias) vs Overfitting (high variance)
Model Validation
- Always use separate train/validation/test sets
- Cross-validation provides robust performance estimates
- Learning curves diagnose fitting issues
Evaluation Metrics
- Choose metrics appropriate for the problem
- Classification: accuracy, precision, recall, F1, ROC-AUC
- Regression: MSE, RMSE, MAE, $R^2$
Regularization
- Ridge (L2): shrinks coefficients, keeps all features
- Lasso (L1): feature selection via sparsity
- Elastic Net: combines L1 and L2

Key Takeaways

Critical Principles

Generalization is the goal - training performance is not enough
Avoid data leakage - fit preprocessing only on training data
Use proper validation - cross-validation for robust estimates
Test set is sacred - evaluate on it only once at the end
Choose appropriate metrics - align with business/research goals
Regularize when needed - prevent overfitting proactively
Document everything - ensure reproducibility

Next Steps

Practice model selection and evaluation on real datasets using cross-validation, regularization, and proper evaluation protocols.

Additional Resources

TextbooksHastie, Tibshirani, Friedman - The Elements of Statistical Learning
Bishop - Pattern Recognition and Machine Learning
James et al. - An Introduction to Statistical Learning

Online Resourcesscikit-learn documentation: Model selection and evaluation
Coursera: Machine Learning by Andrew Ng
Fast.ai: Practical Deep Learning for Coders

\Large Thank you!

End of Module 05

Model Selection and Evaluation

Questions?