Cross Validation \& Hyperparameter Tuning

CMSC 173 - Module 06

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

The Model Validation Problem

Central Question: How do we assess model performance?

Training ErrorError on data used to train the model
Always optimistically biased
Cannot be used for model selection
Misleading indicator

Test ErrorError on unseen data
True indicator of generalization
Should only be used once
Gold standard

Problem

We need to estimate generalization performance without using the test set during model development and hyperparameter tuning.

The Three-Way Data Split

\begin{tikzpicture}[scale=1.0, every node/.style={scale=1.0}] % Original dataset \node[rectangle, draw, fill=gray!20, minimum width=10cm, minimum height=1cm] (dataset) at (0,3) {}; \node[above=0.1cm of dataset] {Original Dataset}; % Data split arrows \draw[fold arrow, very thick] (-3,2.3) -- (-3,1.7); \draw[fold arrow, very thick] (0,2.3) -- (0,1.7); \draw[fold arrow, very thick] (3,2.3) -- (3,1.7); % Split proportions \node at (-3,2) {\scriptsize 60\%}; \node at (0,2) {\scriptsize 20\%}; \node at (3,2) {\scriptsize 20\%}; % Training set \node[train fold, minimum width=6cm, minimum height=1cm] (train) at (-3,1) {}; \node[above=0.1cm of train] {Training Set}; \node[below=0.1cm of train] {\scriptsize Used to train models}; % Validation set \node[valid fold, minimum width=2cm, minimum height=1cm] (valid) at (0,1) {}; \node[above=0.1cm of valid] {Validation Set}; \node[below=0.1cm of valid] {\scriptsize Model selection \& tuning}; % Test set \node[test fold, minimum width=2cm, minimum height=1cm] (test) at (3,1) {}; \node[above=0.1cm of test] {Test Set}; \node[below=0.1cm of test] {\scriptsize Final evaluation}; % Workflow arrows \draw[fold arrow, very thick] (train) to[bend left=20] (-1,-0.5); \draw[fold arrow, very thick] (valid) -- (0,-0.5); \draw[fold arrow, very thick] (test) to[bend right=20] (1,-0.5); % Model development process \node[cv process, minimum width=3cm] (model) at (0,-1) {Model Development}; \node[below=0.1cm of model] {\scriptsize Training + Validation}; % Final evaluation \draw[fold arrow, very thick] (model) -- (0,-2.5); \node[cv process, fill=testcolor!20, minimum width=2.5cm] (final) at (0,-3) {Final Test}; \node[below=0.1cm of final] {\scriptsize Unbiased estimate}; \end{tikzpicture}

Key Insight

Validation set acts as a proxy for the test set during model development, allowing us to:

Compare different models
Tune hyperparameters
Avoid overfitting to the test set

Holdout Validation Method

Procedure:

Randomly split data into training and validation sets
Train model on training set
Evaluate on validation set
Select best performing model/hyperparameters
Final evaluation on held-out test set

Common Split Ratios:

70\% Train, 15\% Validation, 15\% Test
60\% Train, 20\% Validation, 20\% Test
80\% Train, 10\% Validation, 10\% Test

Mathematical FormulationGiven dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$:

$$\begin{aligned}\mathcal{D}_{train} \cup \mathcal{D}_{val} \cup \mathcal{D}_{test} = \mathcal{D} \\ \mathcal{D}_{train} \cap \mathcal{D}_{val} \cap \mathcal{D}_{test} = \emptyset \\ |\mathcal{D}_{train}| = \alpha n \\ |\mathcal{D}_{val}| = \beta n \\ |\mathcal{D}_{test}| = (1-\alpha-\beta) n\end{aligned}$$

where typically $\alpha \in [0.6, 0.8]$

Advantage

Simple, fast, and provides unbiased estimate when validation set is large enough.

Holdout Validation: Advantages \& Disadvantages

AdvantagesComputational efficiency: Train model only once
Simple implementation: Straightforward to understand and code
Fast evaluation: Quick assessment of model performance
Realistic simulation: Mimics real-world deployment scenario

When to UseLarge datasets (n $>$ 10,000)
Computationally expensive models
Time-sensitive applications

DisadvantagesHigh variance: Performance estimate depends on specific split
Data inefficiency: Reduces training data size
Unreliable for small datasets: Estimates can be very noisy
Potential bias: Unlucky splits can mislead model selection

When NOT to UseSmall datasets (n $<$ 1,000)
Need for robust performance estimates
Critical applications requiring high confidence

Key Takeaway

Holdout validation trades robustness for computational efficiency.

K-Fold Cross-Validation Method

\begin{tikzpicture}[scale=0.9, every node/.style={scale=0.9}] % Title \node at (0,4.5) {5-Fold Cross-Validation Example}; % Original dataset \node[rectangle, draw, fill=gray!20, minimum width=10cm, minimum height=0.6cm] (dataset) at (0,4) {}; \node[left=0.2cm of dataset] {\scriptsize Original Data:}; % Fold 1 \node[valid fold, minimum width=2cm, minimum height=0.6cm] (f1v) at (-4,3) {}; \node[train fold, minimum width=8cm, minimum height=0.6cm] (f1t) at (0,3) {}; \node[left=0.2cm of f1v] {\scriptsize Fold 1:}; \node[below=0.05cm of f1v] {\tiny Valid}; \node[below=0.05cm of f1t] {\tiny Train}; % Fold 2 \node[train fold, minimum width=2cm, minimum height=0.6cm] (f2t1) at (-4,2.2) {}; \node[valid fold, minimum width=2cm, minimum height=0.6cm] (f2v) at (-2,2.2) {}; \node[train fold, minimum width=6cm, minimum height=0.6cm] (f2t2) at (1,2.2) {}; \node[left=0.2cm of f2t1] {\scriptsize Fold 2:}; \node[below=0.05cm of f2v] {\tiny Valid}; % Fold 3 \node[train fold, minimum width=4cm, minimum height=0.6cm] (f3t1) at (-3,1.4) {}; \node[valid fold, minimum width=2cm, minimum height=0.6cm] (f3v) at (0,1.4) {}; \node[train fold, minimum width=4cm, minimum height=0.6cm] (f3t2) at (3,1.4) {}; \node[left=0.2cm of f3t1] {\scriptsize Fold 3:}; \node[below=0.05cm of f3v] {\tiny Valid}; % Fold 4 \node[train fold, minimum width=6cm, minimum height=0.6cm] (f4t1) at (-2,0.6) {}; \node[valid fold, minimum width=2cm, minimum height=0.6cm] (f4v) at (2,0.6) {}; \node[train fold, minimum width=2cm, minimum height=0.6cm] (f4t2) at (4,0.6) {}; \node[left=0.2cm of f4t1] {\scriptsize Fold 4:}; \node[below=0.05cm of f4v] {\tiny Valid}; % Fold 5 \node[train fold, minimum width=8cm, minimum height=0.6cm] (f5t) at (-1,-0.2) {}; \node[valid fold, minimum width=2cm, minimum height=0.6cm] (f5v) at (4,-0.2) {}; \node[left=0.2cm of f5t] {\scriptsize Fold 5:}; \node[below=0.05cm of f5v] {\tiny Valid}; % Results aggregation \draw[fold arrow, very thick] (-2,-0.8) -- (-2,-1.5); \draw[fold arrow, very thick] (0,-0.8) -- (0,-1.5); \draw[fold arrow, very thick] (2,-0.8) -- (2,-1.5); \node[cv process, minimum width=6cm] (aggregate) at (0,-2) {Average Results}; \node[below=0.1cm of aggregate] {\scriptsize $CV_{score} = \frac{1}{k}\sum_{i=1}^{k} Score_i$}; \end{tikzpicture}

Core Idea

Use all data for both training and validation by systematically rotating which portion serves as the validation set.

K-Fold Cross-Validation: Mathematical Formulation

Algorithm:

Partition dataset $\mathcal{D}$ into $k$ equal-sized folds: $\mathcal{D} = \mathcal{D}_1 \cup \mathcal{D}_2 \cup ⋯ \cup \mathcal{D}_k$
For each fold $i = 1, …, k$:
- Training set: $\mathcal{D}_{train}^{(i)} = \mathcal{D} \setminus \mathcal{D}_i$
- Validation set: $\mathcal{D}_{val}^{(i)} = \mathcal{D}_i$
- Train model $f^{(i)}$ on $\mathcal{D}_{train}^{(i)}$
- Compute validation error: $E^{(i)} = L(f^{(i)}, \mathcal{D}_{val}^{(i)})$
Estimate generalization error: $\hat{E}_{CV} = \frac{1}{k}\sum_{i=1}^k E^{(i)}$

Error Estimation$$\hat{E}_{CV} = \frac{1}{k}\sum_{i=1}^k L(f^{(i)}, \mathcal{D}_i)$$

$$\text{Var}(\hat{E}_{CV}) = \frac{1}{k^2}\sum_{i=1}^k \text{Var}(E^{(i)})$$

Standard Error$$SE(\hat{E}_{CV}) = \sqrt{\frac{1}{k}\sum_{i=1}^k \left(E^{(i)} - \hat{E}_{CV}\right)^2}$$

Confidence interval: $\hat{E}_{CV} \pm 1.96 \cdot SE$

Choice of k in K-Fold Cross-Validation

Common Values:

k = 5: Good balance, widely used
k = 10: Most popular, good bias-variance tradeoff
k = n (LOOCV): Lowest bias, highest variance

Bias-Variance Tradeoff: $$\begin{aligned}\text{Bias} \propto \frac{k-1}{k} \text{ (decreases with k)} \\ \text{Variance} \propto \text{correlation between folds} \\ \text{Computation} \propto k \text{ (increases with k)}\end{aligned}$$

Choosing kSmall k (k = 3-5):
Higher bias
Lower variance
Less computation
Good for large datasets




Large k (k = 10-20):
Lower bias
Higher variance
More computation
Good for small datasets

Recommendation

Use k = 10 as default choice. It provides good bias-variance tradeoff for most practical scenarios.

K-Fold vs Holdout: Performance Comparison

[Figure: ../figures/model_comparison_cv.png]

Key Observation

K-fold cross-validation provides more robust and reliable performance estimates, especially important for model comparison and selection.

Leave-One-Out Cross-Validation (LOOCV)

[Figure: ../figures/loocv_validation.png]

Mathematical FormulationLOOCV is k-fold CV with $k = n$:
$$\hat{E}_{LOOCV} = \frac{1}{n}\sum_{i=1}^n L(f^{(-i)}, (x_i, y_i))$$

where $f^{(-i)}$ is trained on all data except $(x_i, y_i)$.

PropertiesBias: Nearly unbiased (uses n-1 samples)
Variance: High (high correlation between folds)
Computation: Expensive (n model fits)
Deterministic: No randomness in splits

When to Use LOOCV

Small datasets where every sample is precious, and computational cost is acceptable.

Stratified K-Fold Cross-Validation

[Figure: ../figures/stratified_kfold_comparison.png]

Key Advantage

Maintains class distribution across folds, crucial for:

Imbalanced datasets: Ensures each fold has representatives from all classes
Small datasets: Prevents folds with missing classes
Multiclass problems: Maintains proportional representation

Specialized Cross-Validation Methods

Time Series Cross-ValidationProblem: Standard CV violates temporal order

Solution: Forward chaining
Fold 1: Train[1:100], Test[101:150]
Fold 2: Train[1:150], Test[151:200]
Fold 3: Train[1:200], Test[201:250]


Preserves: Temporal dependencies

Group K-FoldProblem: Data points are grouped (e.g., patients, images from same source)

Solution: Ensure entire groups stay together in splits

Prevents: Data leakage between folds

Monte Carlo Cross-ValidationMethod: Random sampling approach
Randomly split data multiple times
Average performance across all splits
More flexible than k-fold


Advantage: Can control train/validation ratio

Nested Cross-ValidationPurpose: Unbiased model selection + evaluation

Structure:
Outer loop: Performance estimation
Inner loop: Hyperparameter tuning


Result: True generalization estimate

Guideline

Choose validation method based on data characteristics and problem constraints.

The Hyperparameter Optimization Problem

Goal: Find optimal hyperparameters $\lambda^*$ $$\begin{aligned}\lambda^* = \arg\min_{\lambda \in \Lambda} \hat{E}_{CV}(\lambda) \\ \hat{E}_{CV}(\lambda) = \frac{1}{k}\sum_{i=1}^k L(f_\lambda^{(i)}, \mathcal{D}_i)\end{aligned}$$ where:

$\lambda$: hyperparameter vector
$\Lambda$: search space
$f_\lambda^{(i)}$: model trained with $\lambda$ on fold $i$

\begin{exampleblock}{Examples} SVM: $\lambda = (C, \gamma)$ $$\Lambda = [10^{-3}, 10^3] \times [10^{-6}, 10^1]$$ Random Forest: $$\lambda = (n_{est}, \text{depth}, \text{features})$$ Neural Network: $$\lambda = (lr, \text{batch}, \text{layers}, \text{dropout})$$ \end{exampleblock}

Challenge

Hyperparameter spaces are often high-dimensional, mixed-type, and expensive to evaluate.

Grid Search Method

[Figure: ../figures/grid_search_visualization.png]

Core Principle

Exhaustive search over a discrete grid of hyperparameter combinations.

Grid Search: Mathematical Formulation

Algorithm:

Define search grid: $\Lambda_{grid} = \{\lambda_1, \lambda_2, …, \lambda_m\}$
For each $\lambda_j \in \Lambda_{grid}$:
- Compute $\hat{E}_{CV}(\lambda_j)$ using k-fold cross-validation
- Store result: $(\lambda_j, \hat{E}_{CV}(\lambda_j))$
Return: $\lambda^* = \arg\min_{\lambda_j} \hat{E}_{CV}(\lambda_j)$

Grid ConstructionFor parameter $\lambda_i$ with range $[a_i, b_i]$:

Linear spacing:
$$\lambda_i^{(j)} = a_i + \frac{j-1}{n_i-1}(b_i - a_i)$$

Log spacing (preferred for scale-invariant parameters):
$$\lambda_i^{(j)} = a_i \cdot \left(\frac{b_i}{a_i}\right)^{\frac{j-1}{n_i-1}}$$

Computational ComplexityTotal evaluations: $\prod_{i=1}^p n_i$

With k-fold CV:
$$\text{Total cost} = k \cdot \prod_{i=1}^p n_i \cdot C_{train}$$

where:
$p$: number of hyperparameters
$n_i$: grid points for parameter $i$
$C_{train}$: cost of training one model

Grid Search: Advantages \& Disadvantages

AdvantagesExhaustive: Guarantees finding the best combination on the grid
Parallel: Evaluations are independent
Reproducible: Deterministic results
Simple: Easy to implement and understand
Complete coverage: Systematic exploration

Best PracticesUse log-uniform grids for scale parameters (C, $\gamma$, learning rate)
Start with coarse grid, refine around promising regions
Use domain knowledge for range selection

DisadvantagesCurse of dimensionality: Exponential growth with parameters
Computational cost: Can be prohibitively expensive
Grid limitations: Optimal values might lie between grid points
Uniform sampling: Wastes effort in unpromising regions
Manual tuning: Requires careful grid design

When NOT to UseHigh-dimensional hyperparameter spaces ($p > 4$)
Expensive model training ($C_{train}$ very large)
Continuous optimization required

Rule of Thumb

Grid search is practical for $\leq$ 3-4 hyperparameters with reasonable computational budget.

Random Search vs Grid Search

[Figure: ../figures/random_vs_grid_search.png]

Key Insight

Random search often finds better solutions than grid search with the same computational budget, especially in high dimensions.

Random Search: Method \& Theory

Algorithm:

Define parameter distributions: $\lambda_i \sim p_i(\lambda_i)$
For $j = 1$ to $m$ (budget):
- Sample: $\lambda^{(j)} \sim \prod_{i=1}^p p_i(\lambda_i)$
- Evaluate: $\hat{E}_{CV}(\lambda^{(j)})$
Return: $\lambda^* = \arg\min_j \hat{E}_{CV}(\lambda^{(j)})$

Common DistributionsContinuous parameters:
Uniform: $\lambda \sim U(a, b)$
Log-uniform: $\log \lambda \sim U(\log a, \log b)$
Normal: $\lambda \sim \mathcal{N}(\mu, \sigma^2)$


Discrete parameters:
Uniform: $\lambda \sim \text{Uniform}\{v_1, v_2, …, v_k\}$

Theoretical AdvantageIf only $d$ out of $p$ parameters matter:

Grid search: Need $n^p$ points for $n$-point resolution

Random search: Need only $n$ points for same effective resolution in important dimensions

Probability of good solution:
$$P(\text{success}) = 1 - (1 - \epsilon)^m$$
where $\epsilon$ is fraction of good region

Random Search: Advantages \& Best Practices

AdvantagesDimension-friendly: Scales well to high dimensions
Efficient: Often finds good solutions quickly
Flexible: Works with any parameter distribution
Parallel: Evaluations are independent
Anytime: Can stop early if budget is limited
Robust: Less sensitive to irrelevant parameters

DisadvantagesNo guarantee: May miss optimal regions
Random: Results vary between runs
Distribution-dependent: Requires good prior knowledge

Best PracticesUse log-uniform for scale parameters:
$$C \sim \text{LogUniform}(10^{-3}, 10^3)$$
Set reasonable bounds based on domain knowledge
Run multiple times and take the best result
Start with random search, then refine with local methods

Implementation\begin{algorithmic}[1]
\STATE Set parameter ranges and distributions
\FOR{$i = 1$ to $n_{trials}$}
\STATE Sample $\lambda^{(i)}$ from distributions
\STATE Evaluate $f(\lambda^{(i)})$ using CV
\ENDFOR
\RETURN $\arg\min_i f(\lambda^{(i)})$
\end{algorithmic}

Recommendation

Use random search as default for $> 3$ hyperparameters or when computational budget is limited.

Bayesian Optimization Intuition

[Figure: ../figures/optuna_optimization_trace.png]

Smart Search Strategy

Idea: Use past evaluations to guide future search, balancing exploitation (refine good areas) and exploration (discover new areas).

Bayesian Optimization: Mathematical Framework

Core Components:

1. Surrogate ModelGaussian Process: $f(\lambda) \sim \mathcal{GP}(m(\lambda), k(\lambda, \lambda'))$

Predictive distribution:
$$\begin{aligned}\mu_n(\lambda) = k_n^T K_n^{-1} \mathbf{y}_n \\ \sigma_n^2(\lambda) = k(\lambda, \lambda) - k_n^T K_n^{-1} k_n\end{aligned}$$

where:
$k_n$: kernel vector at $\lambda$
$K_n$: kernel matrix of observed points
$\mathbf{y}_n$: observed function values

2. Acquisition FunctionExpected Improvement (EI):
$$\begin{aligned}EI(\lambda) = \mathbb{E}[\max(f^* - f(\lambda), 0)] \\ = (f^* - \mu(\lambda))\Phi(z) + \sigma(\lambda)\phi(z)\end{aligned}$$

where $z = \frac{f^* - \mu(\lambda)}{\sigma(\lambda)}$, $f^*$ is current best

Upper Confidence Bound (UCB):
$$UCB(\lambda) = \mu(\lambda) + \kappa \sigma(\lambda)$$

Algorithm:

Initialize with random evaluations
Fit GP to observed data $\{(\lambda_i, f(\lambda_i))\}_{i=1}^n$
Find $\lambda_{n+1} = \arg\max_\lambda \text{Acquisition}(\lambda)$
Evaluate $f(\lambda_{n+1})$ and update data
Repeat until budget exhausted

Optuna: Tree-Structured Parzen Estimator (TPE)

TPE Algorithm:

Split observations by performance quantile $\gamma$:
- Good: $\mathcal{L} = \{\lambda: f(\lambda) < Q_\gamma\}$
- Bad: $\mathcal{G} = \{\lambda: f(\lambda) \geq Q_\gamma\}$
Model densities:
- $p(\lambda|\mathcal{L})$: density of good configurations
- $p(\lambda|\mathcal{G})$: density of bad configurations
Acquisition function: $$a(\lambda) = \frac{p(\lambda|\mathcal{L})}{p(\lambda|\mathcal{G})}$$
Select: $\lambda_{next} = \arg\max_\lambda a(\lambda)$

TPE AdvantagesEfficient: Lower computational cost than GP
Flexible: Handles mixed parameter types
Scalable: Works well in high dimensions
Robust: Less sensitive to hyperparameters

Optuna FeaturesPruning: Early stopping of unpromising trials
Multi-objective optimization
Distributed optimization
Integration with ML frameworks

Key Insight

TPE focuses sampling on regions where good configurations are dense relative to bad ones.

Hyperparameter Search: Performance Comparison

[Figure: ../figures/search_efficiency_comparison.png]

Method ComparisonGrid Search: Systematic but inefficient
Random Search: Better exploration, good baseline
Bayesian Optimization: Fastest convergence, most efficient

RecommendationsFew parameters ($\leq 3$): Grid search
Many parameters ($> 3$): Random search
Expensive evaluations: Bayesian optimization (Optuna)
Mixed types: Optuna TPE

Best Practice

Start with random search for quick baseline, then use Bayesian optimization for refinement.

Learning Curves

[Figure: ../figures/learning_curves_cv.png]

Purpose

Learning curves help diagnose bias (underfitting) vs variance (overfitting) and determine if more data would help.

Learning Curve Analysis

High Bias (Underfitting)Symptoms:
Training and validation curves converge
Both curves plateau at suboptimal level
Large training error
Small gap between curves


Solutions:
Increase model complexity
Add more features
Reduce regularization
Use more flexible model

High Variance (Overfitting)Symptoms:
Large gap between training and validation curves
Training error very low
Validation error high
Curves don't converge


Solutions:
Get more training data
Reduce model complexity
Increase regularization
Use ensemble methods

Mathematical Insight

For a learning algorithm with $m$ training examples: $$\text{Generalization Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}$$ Learning curves show how this decomposes as $m$ increases.

Validation Curves

[Figure: ../figures/validation_curves.png]

Purpose

Validation curves show how model performance varies with a single hyperparameter, helping identify optimal values and diagnose overfitting/underfitting.

Interpreting Validation Curves

Typical Pattern Analysis:

Regularization Parameters(e.g., SVM C parameter, Ridge $\alpha$)

Low values (high regularization):
High bias, low variance
Training and validation errors both high
Underfitting


High values (low regularization):
Low bias, high variance
Large gap between curves
Overfitting


Optimal region: Minimum validation error

Model Complexity Parameters(e.g., Random Forest depth, SVM $\gamma$)

Mathematical relationship:
$$\text{Complexity} \propto \text{Overfitting Risk}$$

Sweet spot identification:
Find minimum validation error
Check if training error is reasonable
Ensure curves are not diverging rapidly


Cross-validation confidence:
Error bars show $\pm 1$ standard deviation across CV folds

Best Practice

Always plot both training and validation curves with error bars to make informed hyperparameter choices.

Bias-Variance Decomposition

[Figure: ../figures/bias_variance_tradeoff.png]

Fundamental Insight

Cross-validation helps us navigate the bias-variance tradeoff by providing robust estimates of generalization performance across different model complexities.

Bias-Variance Mathematical Framework

Decomposition of Expected Test Error: For a target function $f(x)$ and prediction $\hat{f}(x)$: $$\mathbb{E}[(\hat{f}(x) - f(x))^2] = \text{Bias}^2[\hat{f}(x)] + \text{Var}[\hat{f}(x)] + \sigma^2$$ where: $$\begin{aligned}\text{Bias}[\hat{f}(x)] = \mathbb{E}[\hat{f}(x)] - f(x) \\ \text{Var}[\hat{f}(x)] = \mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2] \\ \sigma^2 = \text{Irreducible error (noise)}\end{aligned}$$

Cross-Validation \& Bias-VarianceCV helps estimate:
Bias: Through training error analysis
Variance: Through CV fold variability
Optimal complexity: Bias-variance tradeoff point


$$\hat{E}_{CV} \approx \text{Bias}^2 + \text{Var} + \sigma^2$$

Model Selection StrategyCompute $\hat{E}_{CV}(\lambda)$ for different complexities
Plot learning/validation curves
Identify minimum of validation curve
Check bias-variance indicators:
Training error (bias)
CV std deviation (variance)

Cross-Validation Best Practices

Data SplittingStratify for classification tasks
Preserve temporal order for time series
Group-aware splitting for clustered data
Test set isolation: Never use for model selection

Hyperparameter SearchStart with random search for baseline
Use Bayesian optimization for expensive evaluations
Log-uniform sampling for scale parameters
Early stopping for unpromising configurations

Statistical ConsiderationsReport mean $\pm$ std of CV scores
Use paired t-tests for model comparison
Account for multiple testing when comparing many models
Consider McNemar's test for classification

Computational EfficiencyParallel evaluation of CV folds
Caching of expensive preprocessing steps
Progressive validation for large datasets
Approximate methods when exact CV is too expensive

Golden Rule

Never use the test set for any decision making during model development. Reserve it solely for final performance evaluation.

Common Pitfalls \& How to Avoid Them

Data LeakageProblem: Information from validation/test leaks into training

Examples:
Scaling on entire dataset before splitting
Feature selection using all data
Temporal leakage in time series


Solution: Apply preprocessing within each CV fold

Look-ahead BiasProblem: Using future information for prediction

Example: Standard CV on time series data

Solution: Time series CV with forward chaining

Selection BiasProblem: Multiple testing without correction

Example: Comparing 100 models, reporting best CV score

Solution: Bonferroni correction or separate validation set for selection

Inappropriate CV MethodProblems:
LOOCV with large datasets (unstable)
Standard CV with imbalanced data
Ignoring data structure (groups, time)


Solution: Choose CV method based on data characteristics

Validation

Always ask: "Does my validation procedure realistically simulate deployment conditions?"

Hyperparameter Search Space

[Figure: ../figures/hyperparameter_surface.png]

Key Insight

Hyperparameter optimization landscapes are often complex with multiple local optima, making smart search strategies essential for finding good solutions efficiently.

Summary: Model Validation Hierarchy

Model Validation Hierarchy: \begin{tabular}{c} Original Dataset \\ $\downarrow$ \\ Train / Validation / Test Split \\ $\downarrow$ $\downarrow$ $\downarrow$ \\ \begin{tabular}{@{}c@{}c@{}c@{}} Training Set & Validation Set & Test Set \\ (60-80\%) & (10-20\%) & (10-20\%) \\ $\downarrow$ & $\downarrow$ & $\downarrow$ \\ Cross-Validation & Hyperparameter & Final \\ (Model Selection) & Tuning & Evaluation \end{tabular} \end{tabular}

Best Practice Workflow

Split data
Use CV for model selection
Tune hyperparameters on validation set
Final evaluation on test set (once!)

Key Takeaways

Validation MethodsHoldout: Fast but high variance
K-fold CV: Best general-purpose method
LOOCV: Unbiased but expensive
Stratified: Essential for imbalanced data
Time series CV: Preserves temporal order

Method Selection GuideSmall data (n < 1000): 10-fold CV or LOOCV
Large data (n > 10000): 5-fold CV or holdout
Imbalanced: Stratified k-fold
Time series: Forward chaining
Grouped data: Group k-fold

Hyperparameter SearchGrid search: $\leq$3 parameters, comprehensive
Random search: >3 parameters, efficient baseline
Bayesian optimization: Expensive evaluations, smart search
Use proper distributions: Log-uniform for scale parameters

Critical PrinciplesNever use test set for model selection
Avoid data leakage at all costs
Report confidence intervals (mean ± std)
Match validation to deployment conditions
Use learning curves to diagnose bias/variance

Remember

Cross-validation is not just about getting a number—it's about making principled decisions about model selection, hyperparameter tuning, and understanding model behavior.

End of Module 06

Cross Validation \& Hyperparameter Tuning

Questions?