Kernel Methods

CMSC 173 - Module 10

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

Review: Supervised Learning Framework

Supervised Learning: Learning from Labeled Examples

[Figure: ../figures/supervised_learning_overview.png]

Core ConceptGiven labeled training data, learn $f: \mathcal{X} \to \mathcal{Y}$.

Regression TasksContinuous output
Ex: price prediction

Classification TasksDiscrete labels
Ex: spam detection

Comparison of Supervised Learning Methods

\scriptsize \begin{tabular}{|l|c|c|} \hline Method & Model & Loss Function \\ \hline Linear Regression & $y = w^T x + b$ & $\sum_{i=1}^N (y_i - w^T x_i - b)^2$ \\ & & Sum-of-squares loss \\ \hline Logistic Regression & $P(y=1|x) = \sigma(w^T x + b)$ & $-\sum_{i=1}^N y_i \log(\sigma(w^T x_i + b))$ \\ & $\sigma(z) = \frac{1}{1+e^{-z}}$ & $+ (1-y_i)\log(1-\sigma(w^T x_i + b))$ \\ & & Cross-entropy loss \\ \hline Support Vector Machine & $y = \text{sign}(w^T x + b)$ & $\frac{1}{2}\|w\|^2 + C\sum_i \max(0, 1-y_i(w^T x_i + b))$ \\ & & Hinge loss + L2 regularization \\ \hline \end{tabular}

Today's Focus

We'll explore how Support Vector Machines use the kernel trick to solve non-linear problems while maintaining computational efficiency.

What are Kernel Methods?

DefinitionKernel methods are a class of algorithms that use kernel functions to operate in high-dimensional feature spaces without explicitly computing the coordinates in that space.

Key Idea:

Transform data to higher dimensions where it becomes linearly separable
Use kernel trick to avoid explicit transformation
Compute inner products efficiently in feature space

Why Kernel Methods?

Many real-world problems are not linearly separable in their original feature space.

Common Applications:

Classification: Support Vector Machines
Regression: Support Vector Regression
Dimensionality Reduction: Kernel PCA
Clustering: Kernel K-means

Advantages:

Handle non-linear relationships
Computational efficiency via kernel trick
Strong theoretical foundations
Flexible and powerful

Examples:

Image classification
Text analysis
Bioinformatics
Time series analysis

Linear vs Non-linear Separability

[Figure: ../figures/linear_vs_nonlinear_data.png]

Linearly Separable DataCharacteristics:
Classes form distinct clusters
Linear boundary separates perfectly
Decision rule: $w^T x + b = 0$

Advantages:

Simple and interpretable
Fast training and prediction
Good generalization

Non-linearly Separable DataCharacteristics:
Complex data patterns (e.g., concentric circles)
No linear boundary can separate
Requires non-linear decision boundary

Solution: Kernel Methods

Transform to higher dimensions
Apply linear methods in new space
Kernel trick for efficiency

Rosenblatt's Perceptron (1957)

The Perceptron AlgorithmFrank Rosenblatt introduced the first learning algorithm for binary classification in 1957.

Perceptron Model: $$\begin{aligned}f(x) = \text{sign}(w^T x + b) \\ \text{Output} = \begin{cases} +1 \text{if } w^T x + b \geq 0 \\ -1 \text{if } w^T x + b < 0 \end{cases}\end{aligned}$$ Learning Rule: $$\begin{aligned}w^{(t+1)} = w^{(t)} + \eta \cdot y_i \cdot x_i \\ b^{(t+1)} = b^{(t)} + \eta \cdot y_i\end{aligned}$$ Update only when misclassified

Perceptron Convergence Theorem

If data is linearly separable, the perceptron algorithm will converge to a solution in finite steps.

Historical Impact:

1957: Perceptron introduced
1969: Limitations exposed (XOR problem)
1980s-1990s: SVM development
Key insight: Maximum margin principle

Motivation for SVMs:

Address perceptron limitations
Better generalization bounds
Handle non-linearly separable data

Perceptron vs SVM: A Visual Comparison

[Figure: ../figures/perceptron_vs_svm.png]

Perceptron ApproachGoal: Find any separating hyperplane
Method: Iterative weight updates
Result: Multiple valid solutions
Issue: No optimality criterion

SVM ApproachGoal: Find optimal separating hyperplane
Method: Maximize margin width
Result: Unique optimal solution
Benefit: Better generalization

Key Insight

SVMs choose the hyperplane that maximizes the distance to the nearest data points (support vectors), leading to better generalization performance.

From Linear to Non-linear: Evolution of Ideas

Timeline of Development:

1957: Rosenblatt's Perceptron
1969: Minsky \& Papert limitations
1979: Least squares SVM ideas
1992: Boser, Guyon, Vapnik kernel trick
1995: Cortes \& Vapnik soft margin
1996: Schölkopf kernel methods

Minsky \& Papert (1969)Showed that perceptrons cannot solve non-linearly separable problems like XOR, leading to the "AI winter."

The Kernel Revolution

The kernel trick (1992) revived interest by enabling non-linear classification while maintaining computational efficiency.

The XOR Problem:

[Figure: ../figures/xor_problem.png]

Problem StatementInput: $(0,0) \to 0$, $(0,1) \to 1$
Input: $(1,0) \to 1$, $(1,1) \to 0$
No linear separator exists

Kernel Solution: Transform to 3D space: $$\phi(x_1, x_2) = (x_1, x_2, x_1 x_2)$$ Result: Linearly separable in 3D!

Support Vector Machines: Overview

Core ConceptSVM finds the optimal hyperplane that separates classes with the maximum margin.

Key Components:

Decision boundary: Hyperplane separating classes
Margin: Distance between boundary and closest points
Support vectors: Points defining the margin

Mathematical Formulation: $$\begin{aligned}\text{Hyperplane: } w^T x + b = 0 \\ \text{Margin: } \frac{2}{\|w\|}\end{aligned}$$

Goal

Maximize margin while correctly classifying all training points.

[Figure: ../figures/linear_svm_margins.png]

Why Maximum Margin?

Generalization: Better performance on unseen data
Robustness: Less sensitive to noise
Uniqueness: Single optimal solution

Geometric Interpretation of Margin

Margin Definition: For a hyperplane $w^T x + b = 0$: $$\begin{aligned}\text{Distance from point } x_i \text{ to hyperplane:} \\ d_i = \frac{|w^T x_i + b|}{\|w\|}\end{aligned}$$ Margin Width: $$\text{Margin} = \min_{i} d_i = \frac{1}{\|w\|}$$

Canonical FormScale $w$ and $b$ so that for the closest points:
$$w^T x_i + b = \pm 1$$
Then margin becomes $\frac{2}{\|w\|}$.

Visual Interpretation:

[Figure: ../figures/svm_geometry.png]

Key ElementsDecision boundary: $w^T x + b = 0$
Margin boundaries: $w^T x + b = \pm 1$
Support vectors: Closest points
Margin width: $\frac{2}{\|w\|}$

SVM Optimization: Primal Problem

Hard Margin SVM:

Primal Problem$$\begin{aligned}\min_{w,b}   \frac{1}{2}\|w\|^2 \\ \text{subject to}   y_i(w^T x_i + b) \geq 1,   i = 1,…,n\end{aligned}$$

Interpretation:

Objective: Minimize $\|w\|^2$ $\Rightarrow$ Maximize margin $\frac{2}{\|w\|}$
Constraints: All points correctly classified with margin $\geq 1$

Problem Type:

Quadratic objective function
Linear constraints
Convex optimization problem
Unique global optimum

Soft Margin SVM:

Primal Problem with Slack Variables$$\begin{aligned}\min_{w,b,\xi}   \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \\ \text{subject to}   y_i(w^T x_i + b) \geq 1 - \xi_i \\ \xi_i \geq 0,   i = 1,…,n\end{aligned}$$

Slack Variables $\xi_i$:

$\xi_i = 0$: Point correctly classified with margin $\geq 1$
$0 < \xi_i < 1$: Point correctly classified but within margin
$\xi_i \geq 1$: Point misclassified

Regularization Parameter $C$:

Large $C$: Penalty for violations (hard margin)
Small $C$: Allow more violations (soft margin)
Controls bias-variance tradeoff

SVM Optimization: Dual Problem

Lagrangian Formulation:

Lagrangian$$\begin{aligned}L = \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \\ - \sum_{i=1}^{n}\alpha_i[y_i(w^T x_i + b) - 1 + \xi_i] \\ - \sum_{i=1}^{n}\mu_i\xi_i\end{aligned}$$

KKT Conditions: $$\begin{aligned}\frac{\partial L}{\partial w} = 0 \Rightarrow w = \sum_{i=1}^{n}\alpha_i y_i x_i \\ \frac{\partial L}{\partial b} = 0 \Rightarrow \sum_{i=1}^{n}\alpha_i y_i = 0 \\ \frac{\partial L}{\partial \xi_i} = 0 \Rightarrow \alpha_i + \mu_i = C\end{aligned}$$

Dual Problem:

Dual Formulation$$\begin{aligned}\max_{\alpha}   \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j=1}^{n}\alpha_i\alpha_j y_i y_j x_i^T x_j \\ \text{subject to}   \sum_{i=1}^{n}\alpha_i y_i = 0 \\ 0 \leq \alpha_i \leq C,   i = 1,…,n\end{aligned}$$

Key Insights:

Only depends on inner products $x_i^T x_j$
Sparse solution: many $\alpha_i = 0$
Support vectors: $\alpha_i > 0$

Kernel Trick Preview

Replace $x_i^T x_j$ with $K(x_i, x_j)$ to work in higher dimensions!

Hard vs Soft Margin SVMs

[Figure: ../figures/hard_vs_soft_margin.png]

Hard Margin (C = 1000)No training errors allowed
Requires linearly separable data
May overfit to training data
Complex decision boundary

Soft Margin (C = 0.1)Allows some training errors
Works with non-separable data
Better generalization
Smoother decision boundary

The Kernel Trick

Feature Mapping: Transform input space $\mathcal{X}$ to feature space $\mathcal{H}$: $$\phi: \mathcal{X} \rightarrow \mathcal{H}$$ Example: Polynomial Features $$\begin{aligned}x = (x_1, x_2) \\ \phi(x) = (1, x_1, x_2, x_1^2, x_2^2, \sqrt{2}x_1 x_2)\end{aligned}$$

Kernel FunctionInstead of computing $\phi(x_i)^T \phi(x_j)$ explicitly:
$$K(x_i, x_j) = \phi(x_i)^T \phi(x_j)$$

Key Insight

We can compute inner products in high-dimensional space without explicitly mapping to that space!

Computational Advantage: Without Kernel Trick:

Map: $\mathcal{O}(d')$ where $d'$ is feature space dimension
Inner product: $\mathcal{O}(d')$
Total: $\mathcal{O}(d')$ per pair

With Kernel Trick:

Direct kernel computation: $\mathcal{O}(d)$ where $d$ is input dimension
No explicit mapping needed
Total: $\mathcal{O}(d)$ per pair

Example: Polynomial Kernel $$\begin{aligned}K(x, x') = (x^T x' + 1)^2 \\ = 1 + 2x^T x' + (x^T x')^2\end{aligned}$$ This implicitly computes inner product in a $(d+1)(d+2)/2$ dimensional space using only $\mathcal{O}(d)$ operations!

Kernel Trick Visualization

[Figure: ../figures/kernel_trick_transformation.png]

Original Space

Non-linearly separable data in 2D cannot be separated by a straight line.

Transformed Space

Data becomes linearly separable in 3D after polynomial transformation $\phi(x) = (x_1, x_2, x_1^2 + x_2^2)$.

Kernel Result

RBF kernel achieves non-linear separation directly in original 2D space.

Common Kernel Functions

[Figure: ../figures/kernel_functions.png]

Common Kernel Functions: Definitions

Linear Kernel$$K(x, x') = x^T x'$$
Equivalent to no transformation
Fastest computation
Good baseline

Polynomial Kernel$$K(x, x') = (\gamma x^T x' + r)^d$$
Captures interactions to degree $d$
Parameters: $\gamma, r, d$

RBF (Gaussian) Kernel$$K(x, x') = \exp\left(-\gamma \|x - x'\|^2\right)$$
Most popular kernel
Infinite-dimensional feature space
Smooth, localized similarity

Sigmoid Kernel$$K(x, x') = \tanh(\gamma x^T x' + r)$$
Neural network inspired
Less commonly used

Comparing Different Kernels

[Figure: ../figures/different_kernels.png]

Kernel Selection:

Linear: High-dim data
Polynomial: Known relationships
RBF: General-purpose
Sigmoid: Neural network

Performance Factors:

Data dimensionality
Training set size
Noise level
Domain knowledge

RBF Kernel Parameter Effects

[Figure: ../figures/rbf_kernel_parameters.png]

Gamma ($\gamma$): Small $\Rightarrow$ smooth, Large $\Rightarrow$ complex Regularization ($C$): Small $\Rightarrow$ soft margin, Large $\Rightarrow$ hard margin

Mercer's Theorem: Valid Kernels

Mercer's TheoremA function $K(x, x')$ is a valid kernel (corresponds to an inner product in some feature space) if and only if for any finite set of points $\{x_1, …, x_n\}$, the kernel matrix $\mathbf{K}$ is positive semi-definite.

Kernel Matrix: $$\mathbf{K}_{ij} = K(x_i, x_j)$$ Positive Semi-definite: $$\mathbf{K} \succeq 0 \iff \sum_{i,j} c_i c_j K(x_i, x_j) \geq 0$$ for all real numbers $c_1, …, c_n$.

Practical Implication

Not every function can be used as a kernel! Only those satisfying Mercer's condition.

Properties of Valid Kernels:

Symmetry: $K(x, x') = K(x', x)$
Positive semi-definiteness: Kernel matrix $\mathbf{K} \succeq 0$

Constructing New Kernels: If $K_1$ and $K_2$ are valid kernels, then: $$\begin{aligned}K(x, x') = K_1(x, x') + K_2(x, x') \\ K(x, x') = c \cdot K_1(x, x'), c > 0 \\ K(x, x') = K_1(x, x') \cdot K_2(x, x') \\ K(x, x') = \exp(K_1(x, x')) \\ K(x, x') = f(x) K_1(x, x') f(x')\end{aligned}$$ Domain-Specific Kernels:

String kernels for text
Graph kernels for networks
Tree kernels for structured data

Examples of Valid and Invalid Kernels

Valid Kernels:

Linear$$K(x, x') = x^T x'$$

Polynomial$$K(x, x') = (x^T x' + 1)^d,   d \geq 1$$

RBF/Gaussian$$K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)$$

Exponential$$K(x, x') = \exp(-\gamma \|x - x'\|),   \gamma > 0$$

Why Valid? All can be shown to correspond to inner products in (possibly infinite-dimensional) feature spaces.

Invalid Kernels:

Negative Power$$K(x, x') = (x^T x')^{-1}$$
Not positive semi-definite.

Logarithmic$$K(x, x') = \log(x^T x' + 1)$$
Can produce negative eigenvalues.

Checking Validity:

Theoretical: Prove positive semi-definiteness
Computational: Check eigenvalues of kernel matrix
Construction: Build from known valid kernels

Practical Note

Most standard kernels in ML libraries are guaranteed to be valid. Custom kernels need verification.

Multiple Kernel Learning (MKL)

MotivationDifferent kernels capture different aspects of data:
RBF kernel: local similarities
Linear kernel: global structure
Polynomial kernel: feature interactions

Idea: Combine multiple kernels to get better performance than any single kernel. Linear Combination: $$K(x, x') = \sum_{m=1}^M \beta_m K_m(x, x')$$ where $\beta_m \geq 0$ and $\sum_{m=1}^M \beta_m = 1$. Applications:

Multi-modal data (text + images)
Feature selection
Domain adaptation

MKL Optimization:

Joint Optimization$$\begin{aligned}\min_{w,b,\xi,\beta}   \frac{1}{2}\sum_{m=1}^M \frac{\|w_m\|^2}{\beta_m} + C\sum_{i=1}^n \xi_i \\ \text{subject to}   y_i\left(\sum_{m=1}^M w_m^T \phi_m(x_i) + b\right) \geq 1 - \xi_i \\ \xi_i \geq 0,   \beta_m \geq 0,   \sum_{m=1}^M \beta_m = 1\end{aligned}$$

Solution Methods:

Alternating optimization: Fix $\beta$, solve for $w$; fix $w$, solve for $\beta$
Semi-definite programming: Convex formulation
Gradient-based methods: Efficient for large-scale

Kernel Weight Interpretation:

$\beta_m$ close to 1: Kernel $m$ is most important
$\beta_m$ close to 0: Kernel $m$ is irrelevant
Automatic feature/kernel selection

MKL Example: Combining Kernels

Example Setup: Consider combining three kernels: $$\begin{aligned}K_1(x, x') = x^T x' \text{(Linear)} \\ K_2(x, x') = (x^T x' + 1)^2 \text{(Polynomial)} \\ K_3(x, x') = \exp(-\|x-x'\|^2) \text{(RBF)}\end{aligned}$$ Combined Kernel: $$K(x, x') = \beta_1 K_1(x, x') + \beta_2 K_2(x, x') + \beta_3 K_3(x, x')$$ Learning Process:

Start with equal weights: $\beta_1 = \beta_2 = \beta_3 = \frac{1}{3}$
Iteratively optimize weights and SVM parameters
Converge to optimal combination

Sample Results: \begin{tabular}{|l|c|c|} \hline Kernel & Weight & Accuracy \\ \hline Linear & 0.1 & 0.78 \\ Polynomial & 0.3 & 0.82 \\ RBF & 0.6 & 0.85 \\ \hline Combined MKL & - & 0.89 \\ \hline \end{tabular} Advantages:

Better performance than individual kernels
Automatic selection of relevant kernels
Interpretability through weights

Challenges:

Increased computational complexity
More hyperparameters to tune
Risk of overfitting with many kernels

Multi-class SVM Strategies

[Figure: ../figures/multiclass_strategies.png]

One-vs-Rest (OvR)Train $k$ binary classifiers
Each separates one class from all others
Prediction: class with highest score

One-vs-One (OvO)Train $\binom{k}{2}$ binary classifiers
Each separates pair of classes
Prediction: majority voting

Direct MulticlassSingle optimization problem
Simultaneous separation
More complex but unified

One-vs-Rest Detailed Analysis

[Figure: ../figures/ovr_detailed.png]

OvR Strategy: For $k$ classes:

Class 0 vs {1,2,3}
Class 1 vs {0,2,3}
Class 2 vs {0,1,3}
Class 3 vs {0,1,2}

Prediction: $$\hat{y} = \arg\max_{j} f_j(x)$$

Advantages:

Simple implementation
Linear scaling with $k$
Any binary classifier

Disadvantages:

Class imbalance
Inconsistent regions

Complexity: $$\mathcal{O}(k \cdot \text{binary})$$

Multi-class Kernel Comparison

[Figure: ../figures/kernel_multiclass_comparison.png]

Performance:

Linear: High-dim data
Polynomial: Feature interactions
RBF: Most flexible

Selection Criteria:

Dataset size
Computational cost
Cross-validation

Best Practice

Start with RBF kernel and tune $C$, $\gamma$ via cross-validation.

Support Vector Regression (SVR)

SVR ConceptExtend SVM to regression by finding a function that deviates from target values by at most $\epsilon$, while being as flat as possible.

Linear SVR: $$f(x) = w^T x + b$$ Optimization Problem: $$\begin{aligned}\min_{w,b,\xi,\xi^*} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n(\xi_i + \xi_i^*) \\ \text{subject to} y_i - w^T x_i - b \leq \epsilon + \xi_i \\ w^T x_i + b - y_i \leq \epsilon + \xi_i^* \\ \xi_i, \xi_i^* \geq 0\end{aligned}$$ $\epsilon$-insensitive Loss: $$L_\epsilon(y, f(x)) = \max(0, |y - f(x)| - \epsilon)$$

Key Parameters:

$\epsilon$: Width of insensitive zone
$C$: Regularization parameter
Kernel parameters: $\gamma$ for RBF, etc.

Dual Formulation: $$f(x) = \sum_{i=1}^n (\alpha_i - \alpha_i^*) K(x_i, x) + b$$ where $\alpha_i, \alpha_i^* \geq 0$ are Lagrange multipliers. Support Vectors:

Points outside $\epsilon$-tube
$\alpha_i > 0$ or $\alpha_i^* > 0$
Determine the regression function

Sparsity

Many $\alpha_i = \alpha_i^* = 0$, leading to sparse solutions.

SVR Demonstration

[Figure: ../figures/svr_demonstration.png]

Linear SVRSimple linear relationship
Good for linear trends
Fast computation

Polynomial SVRCaptures polynomial trends
Risk of overfitting
Degree selection important

RBF SVRMost flexible
Handles non-linear patterns
Requires parameter tuning

Effect of $$ Parameter in SVR

[Figure: ../figures/epsilon_parameter_effect.png]

$\epsilon$ Parameter Effects:

Small $\epsilon$ (0.01): Tight fit, many support vectors
Medium $\epsilon$ (0.1): Balanced complexity
Large $\epsilon$ (0.5): Loose fit, fewer support vectors

Trade-offs:

Small $\epsilon$: Low bias, high variance
Large $\epsilon$: High bias, low variance
Sparsity: Larger $\epsilon$ $\Rightarrow$ fewer support vectors

Selection Guidelines:

Cross-validation for optimal $\epsilon$
Consider noise level in data
Balance accuracy vs complexity

Practical Values:

Start with $\epsilon = 0.1$
Scale with target variable range
Grid search with $C$ and kernel parameters

Rule of Thumb

Set $\epsilon \approx \frac{\text{std}(y)}{10}$ as starting point.

Kernel Ridge Regression vs SVR

[Figure: ../figures/kernel_ridge_vs_svr.png]

Kernel Ridge Regression vs SVR: Comparison

Kernel RidgeObjective:
$$\min_\alpha \|\mathbf{K}\alpha - y\|^2 + \lambda \alpha^T \mathbf{K} \alpha$$

Solution:
$$\alpha = (\mathbf{K} + \lambda \mathbf{I})^{-1} y$$

Properties:

Non-sparse
Closed-form
Fast for small data

SVRObjective:
$$\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum_i (\xi_i + \xi_i^*)$$

Constraints:
$$|y_i - f(x_i)| \leq \epsilon + \xi_i$$

Properties:

Sparse (SVs)
Robust to outliers
Quadratic program

Regularization in Kernel Regression

[Figure: ../figures/regularization_comparison.png]

Kernel Ridge ($\alpha$):

Small: Overfitting risk
Medium: Balanced
Large: Underfitting risk

$$\min_f \sum_{i} (y_i - f(x_i))^2 + \alpha \|f\|^2$$

SVR ($C$):

High: Complex model
Medium: Balanced
Low: Simple model

Relationship: $C \approx \frac{1}{\alpha}$

Worked Example: RBF Kernel Computation

Problem Setup: Given two points: $$\begin{aligned}x_1 = (1, 2) \\ x_2 = (3, 1)\end{aligned}$$ Compute RBF kernel with $\gamma = 0.5$: $$K(x_1, x_2) = \exp(-\gamma \|x_1 - x_2\|^2)$$ Step 1: Compute distance $$\begin{aligned}x_1 - x_2 = (1, 2) - (3, 1) = (-2, 1) \\ \|x_1 - x_2\|^2 = (-2)^2 + 1^2 = 4 + 1 = 5\end{aligned}$$ Step 2: Apply kernel $$\begin{aligned}K(x_1, x_2) = \exp(-0.5 \times 5) \\ = \exp(-2.5) \\ \approx 0.082\end{aligned}$$

Interpretation:

Points are moderately far apart
Kernel value is small (0.082)
Indicates low similarity

Compare with closer points: For $x_1 = (1, 2)$ and $x_3 = (1.1, 2.1)$: $$\begin{aligned}\|x_1 - x_3\|^2 = (0.1)^2 + (0.1)^2 = 0.02 \\ K(x_1, x_3) = \exp(-0.5 \times 0.02) = \exp(-0.01) \\ \approx 0.99\end{aligned}$$ Effect of $\gamma$:

Large $\gamma$: Rapid decay, local influence
Small $\gamma$: Slow decay, global influence

Key Insight

RBF kernel measures similarity through Euclidean distance in input space.

Practical Implementation Tips

Data Preprocessing:

Feature Scaling: Critical for RBF kernels
Normalization: StandardScaler or MinMaxScaler
Missing Values: Handle before kernel computation

Hyperparameter Tuning:

Grid Search Example\scriptsize
\texttt{param\_grid = \{} \\
\texttt{    'C': [0.1, 1, 10, 100],} \\
\texttt{    'gamma': [0.001, 0.01, 0.1, 1],} \\
\texttt{    'kernel': ['rbf', 'poly', 'linear']} \\
\texttt{\}}

Performance Considerations:

Linear kernel: $\mathcal{O}(n \times d)$
RBF kernel: $\mathcal{O}(n \times d)$ per evaluation
Training complexity: $\mathcal{O}(n^2)$ to $\mathcal{O}(n^3)$

Model Selection:

Start with RBF kernel
Use cross-validation
Compare with linear kernel
Consider computational constraints

Common Pitfalls:

Avoid These

Forgetting to scale features
Using default parameters
Ignoring class imbalance
Overfitting with complex kernels

Debugging Tips:

Check kernel matrix properties
Visualize decision boundaries
Monitor support vector counts
Validate on holdout set

Software Libraries:

\texttt{scikit-learn}: General purpose
\texttt{libsvm}: High performance
\texttt{thundersvm}: GPU acceleration

Understanding Model Types

Parametric Models:

DefinitionFixed number of parameters independent of training set size. Make strong assumptions about functional form.

Examples:

Linear Regression: $f(x) = w^T x + b$
Logistic Regression: $p = \sigma(w^T x + b)$
Perceptron: Fixed decision boundary
Neural Networks: Fixed architecture

Characteristics:

Fast training and prediction
Strong inductive bias
May underfit complex data
Interpretable parameters

Non-parametric Models:

DefinitionNumber of parameters grows with training data size. Make minimal assumptions about functional form.

Examples:

k-NN: Stores all training data
Decision Trees: Adaptive structure
Kernel Methods: Support vector representation
Gaussian Processes: Infinite parameters

Characteristics:

Flexible representation
Can fit complex patterns
Risk of overfitting
Higher computational cost

Kernel Methods: The Non-parametric Perspective

Why SVMs are Non-parametric:

Key Insight

SVM decision function depends on support vectors, whose number grows with data complexity, not fixed in advance.

Decision Function: $$f(x) = \sum_{i \in SV} \alpha_i y_i K(x_i, x) + b$$

Number of support vectors $|SV|$ varies
Complex data $\Rightarrow$ more support vectors
Simple data $\Rightarrow$ fewer support vectors

Adaptive Complexity:

Model complexity adapts to data
Automatic feature selection
Sparse representation via support vectors

Comparison with Other Methods:

Parametric Linear Classifier$$f(x) = w^T x + b$$
Fixed $d+1$ parameters regardless of training set size.

Non-parametric SVM$$f(x) = \sum_{i=1}^{n_{sv}} \alpha_i y_i K(x_i, x) + b$$
$n_{sv}$ support vectors determined by data.

Benefits of Non-parametric Approach:

Flexibility: No assumptions about decision boundary shape
Universality: Can approximate any function (with appropriate kernel)
Robustness: Less sensitive to model specification

Kernel Functions and Function Spaces

Reproducing Kernel Hilbert Space (RKHS):

Mathematical FrameworkKernel $K$ defines an infinite-dimensional feature space $\mathcal{H}$ where linear methods become non-linear in original space.

Key Properties: $$\begin{aligned}\phi: \mathcal{X} \to \mathcal{H} \\ K(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}\end{aligned}$$ Universal Approximation:

RBF kernels are universal approximators
Can represent any continuous function
Given sufficient training data

Non-parametric Power

Kernel methods can learn arbitrarily complex decision boundaries without specifying the form in advance.

Practical Implications: Model Selection Strategy:

Start Simple: Linear kernel first
Add Complexity: Polynomial → RBF
Cross-validate: Choose optimal kernel and parameters

Trade-offs:

Parametric AdvantageFast training and prediction
Lower memory requirements
Better interpretability

Non-parametric AdvantageHigher representational power
Better fit to complex data
Fewer modeling assumptions

When to Use Kernel Methods:

Non-linear relationships in data
High-dimensional feature spaces
Need for flexible decision boundaries

Summary and Key Takeaways

Core Concepts Learned:

Kernel Trick: Implicit high-dimensional mapping
Support Vectors: Sparse representation
Margin Maximization: Generalization principle
Non-linear Separation: Via kernels

Main Algorithms:

SVM: Classification with maximum margin
SVR: Regression with $\epsilon$-insensitive loss
Kernel Ridge: Regularized regression
Multi-class: Extensions to multiple classes

Key Kernels:

Linear, Polynomial, RBF, Sigmoid
Mercer's theorem for validity
Multiple kernel learning

Practical Guidelines:

When to Use Kernel MethodsNon-linear relationships in data
Need for sparse solutions
Strong theoretical guarantees required
Medium-sized datasets

Parameter Selection:

C: Start with 1.0, tune via CV
$\gamma$: Start with $\frac{1}{\text{n\_features}}$
$\epsilon$: Start with 0.1 for SVR

Limitations:

Computational complexity: $\mathcal{O}(n^2)$ to $\mathcal{O}(n^3)$
Memory requirements: Store kernel matrix
Parameter sensitivity
Not suitable for very large datasets

Next Steps

Explore deep learning for automatic feature learning in high-dimensional spaces.

End of Module 10

Kernel Methods

Questions?