Back to Course
CMSC 173

Module 07: Principal Component Analysis

1 / --

Principal Component Analysis

CMSC 173 - Module 07

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What is Principal Component Analysis?

Definition

Principal Component Analysis (PCA) is a statistical technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible.
\begin{exampleblock}{Key Characteristics}
  • Unsupervised learning method
  • Linear dimensionality reduction
  • Orthogonal transformation
  • Variance maximization
\end{exampleblock}

[Figure: ../figures/dimensionality_reduction_concept.png]

Applications:
  • Data visualization
  • Feature extraction
  • Noise reduction
  • Image compression

Goal

Find new axes (principal components) that maximize variance and are uncorrelated with each other.

The Curse of Dimensionality

Problem

As the number of features increases, data becomes increasingly sparse, making analysis difficult and computationally expensive.
Challenges in High Dimensions:
  • Data sparsity: Points become isolated
  • Distance measures: Lose meaning
  • Overfitting: Models fit noise
  • Computation: Time/memory grows exponentially

Hughes Phenomenon

Model performance initially improves with more features, then degrades beyond an optimal point.

[Figure: ../figures/curse_of_dimensionality.png]

Example Impact:
  • 10 features: $10^2 = 100$ parameter pairs
  • 100 features: $100^2 = 10,000$ pairs
  • 1000 features: $1,000,000$ pairs!
\begin{exampleblock}{Solution: PCA} Reduce dimensions while retaining information. \end{exampleblock}

Why Dimensionality Reduction?

Motivation for PCA:
  • Visualization: Reduce to 2D/3D for plotting
  • Storage: Compress data efficiently
  • Speed: Faster training and inference
  • Noise removal: Filter out low-variance components
  • Feature extraction: Create meaningful features
  • Collinearity: Remove redundant features

Core Principle

Most real-world data has intrinsic dimensionality much lower than its ambient dimensionality.
Example: Images of faces
  • Ambient: 10,000 pixels
  • Intrinsic: 100-200 dimensions
Benefits vs Trade-offs: \begin{exampleblock}{Advantages}
  • Reduced computational cost
  • Easier visualization
  • Noise reduction
  • Better generalization
\end{exampleblock}

Trade-offs

  • Information loss
  • Interpretability reduced
  • Linear assumptions
  • Sensitive to scaling

When to Use PCA

  • Many correlated features
  • Need for visualization
  • Computational constraints
  • Preprocessing for ML

Real-World Applications

Computer Vision:
  • Face recognition: Eigenfaces
  • Image compression: Reduce storage
  • Object detection: Feature extraction
  • Image denoising: Remove noise
Finance:
  • Portfolio optimization
  • Risk assessment
  • Market trend analysis
  • Fraud detection
Biology \& Medicine:
  • Gene expression analysis
  • Medical imaging
  • Drug discovery
  • Disease classification
Natural Language Processing:
  • Document clustering
  • Topic modeling
  • Sentiment analysis
  • Information retrieval
Signal Processing:
  • Audio compression
  • Speech recognition
  • Sensor data analysis
  • Anomaly detection
Recommendation Systems:
  • User preference modeling
  • Content filtering
  • Collaborative filtering
  • Feature engineering
\begin{exampleblock}{Success Story} Netflix Prize: Teams used PCA for feature reduction and achieved significant performance gains. \end{exampleblock}

Variance: Measuring Spread

Definition

Variance measures how far data points spread from their mean value.
For a single variable: $$\begin{aligned}\text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 \\ = \mathbb{E}[(X - \mathbb{E}[X])^2]\end{aligned}$$ Properties:
  • Always non-negative: $\text{Var}(X) \geq 0$
  • Zero variance: constant values
  • Units: squared original units
  • Standard deviation: $\sigma = \sqrt{\text{Var}(X)}$

PCA Goal

Find directions of maximum variance in the data.

[Figure: ../figures/variance_visualization.png]

Example Calculation: Data: $[2, 4, 6, 8, 10]$ $$\begin{aligned}\bar{x} = \frac{2+4+6+8+10}{5} = 6 \\ \text{Var}(X) = \frac{(2-6)^2 + (4-6)^2 + ⋯}{5} \\ = \frac{16 + 4 + 0 + 4 + 16}{5} = 8\end{aligned}$$ \begin{exampleblock}{Intuition} High variance direction contains more information than low variance direction. \end{exampleblock}

Covariance: Measuring Linear Relationship

Definition

Covariance measures the joint variability of two variables.
Formula: $$\begin{aligned}\text{Cov}(X, Y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \\ = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]\end{aligned}$$ Interpretation:
  • $\text{Cov}(X, Y) > 0$: Positive relationship
  • $\text{Cov}(X, Y) < 0$: Negative relationship
  • $\text{Cov}(X, Y) = 0$: No linear relationship
  • $\text{Cov}(X, X) = \text{Var}(X)$
\begin{exampleblock}{Correlation} Normalized covariance: $$\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]$$ \end{exampleblock}
Covariance Matrix: For data $\mathbf{X} \in \mathbb{R}^{n \times d}$: $$\mathbf{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X} \in \mathbb{R}^{d \times d}$$ Structure: $$\mathbf{\Sigma} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1,X_2) & ⋯ \\ \text{Cov}(X_2,X_1) & \text{Var}(X_2) & ⋯ \\ ⋮ & ⋮ & \ddots \end{bmatrix}$$ Properties:
  • Symmetric: $\Sigma_{ij} = \Sigma_{ji}$
  • Positive semi-definite
  • Diagonal: variances
  • Off-diagonal: covariances

PCA Key

Diagonalize covariance matrix to find uncorrelated components.

Eigendecomposition: The Core of PCA

Definition

For a square matrix $\mathbf{A}$, eigendecomposition finds vectors and scalars such that: $$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$
Components:
  • $\mathbf{v}$: Eigenvector (direction)
  • $\lambda$: Eigenvalue (scaling factor)
  • Matrix only changes magnitude, not direction
For Covariance Matrix: $$\mathbf{\Sigma} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$$ where:
  • $\mathbf{V}$: Eigenvector matrix
  • $\mathbf{\Lambda}$: Diagonal eigenvalue matrix
  • $\mathbf{V}^T\mathbf{V} = \mathbf{I}$ (orthonormal)

[Figure: ../figures/covariance_eigenvectors.png]

Geometric Interpretation:
  • Eigenvectors: Principal component directions
  • Eigenvalues: Variance along each PC
  • Largest eigenvalue: Most variance
  • Smallest eigenvalue: Least variance
\begin{exampleblock}{PCA Connection}
  • PC directions = Eigenvectors
  • PC importance = Eigenvalues
  • Sorted by decreasing eigenvalue
\end{exampleblock}

Eigendecomposition Example

Problem: Find eigenvalues and eigenvectors $$\mathbf{\Sigma} = \begin{bmatrix} 4 & 2 \\ 2 & 3 \end{bmatrix}$$ Step 1: Characteristic equation $$\det(\mathbf{\Sigma} - \lambda \mathbf{I}) = 0$$ $$\det\begin{bmatrix} 4-\lambda & 2 \\ 2 & 3-\lambda \end{bmatrix} = 0$$ $$(4-\lambda)(3-\lambda) - 4 = 0$$ $$\lambda^2 - 7\lambda + 8 = 0$$ Step 2: Solve for eigenvalues $$\lambda_1 = 5.56, \lambda_2 = 1.44$$
Step 3: Find eigenvectors For $\lambda_1 = 5.56$: $$(\mathbf{\Sigma} - 5.56\mathbf{I})\mathbf{v}_1 = 0$$ $$\mathbf{v}_1 = \begin{bmatrix} 0.79 \\ 0.61 \end{bmatrix}$$ For $\lambda_2 = 1.44$: $$\mathbf{v}_2 = \begin{bmatrix} -0.61 \\ 0.79 \end{bmatrix}$$ Interpretation:
  • PC1 direction: $(0.79, 0.61)$
  • PC1 variance: $5.56$
  • PC2 direction: $(-0.61, 0.79)$
  • PC2 variance: $1.44$
  • Total variance: $5.56 + 1.44 = 7$

Note

Eigenvectors are orthogonal: $\mathbf{v}_1 \perp \mathbf{v}_2$

Singular Value Decomposition (SVD)

Definition

Any matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ can be decomposed as: $$\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$
Components:
  • $\mathbf{U} \in \mathbb{R}^{n \times n}$: Left singular vectors
  • $\mathbf{\Sigma} \in \mathbb{R}^{n \times d}$: Singular values (diagonal)
  • $\mathbf{V} \in \mathbb{R}^{d \times d}$: Right singular vectors
  • $\mathbf{U}^T\mathbf{U} = \mathbf{I}$, $\mathbf{V}^T\mathbf{V} = \mathbf{I}$
Relationship to PCA: $$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{\Sigma}^T\mathbf{\Sigma}\mathbf{V}^T$$ Therefore:
  • PC directions: Columns of $\mathbf{V}$
  • PC variances: $\sigma_i^2 / n$

[Figure: ../figures/svd_decomposition.png]

Advantages of SVD for PCA:
  • More numerically stable
  • Works for $n < d$ or $n > d$
  • No need to form $\mathbf{X}^T\mathbf{X}$
  • Efficient algorithms available
\begin{exampleblock}{Truncated SVD} Keep only top $k$ components: $$\mathbf{X}_k = \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T$$ Best rank-$k$ approximation in Frobenius norm. \end{exampleblock}

SVD vs Eigendecomposition

Eigendecomposition Approach

Steps:
  1. Center data: $\tilde{\mathbf{X}} = \mathbf{X} - \bar{\mathbf{X}}$
  2. Compute covariance: $\mathbf{\Sigma} = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$
  3. Eigendecomposition: $\mathbf{\Sigma} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$
  4. Sort by eigenvalues
Complexity: $\mathcal{O}(d^2n + d^3)$ Issues:
  • Numerical instability for $\mathbf{X}^T\mathbf{X}$
  • Squaring condition number
  • Memory: storing $d \times d$ matrix

SVD Approach

Steps:
  1. Center data: $\tilde{\mathbf{X}} = \mathbf{X} - \bar{\mathbf{X}}$
  2. SVD: $\tilde{\mathbf{X}} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$
  3. PCs: columns of $\mathbf{V}$
  4. Variance: $\sigma_i^2 / n$
Complexity: $\mathcal{O}(\min(nd^2, n^2d))$ Advantages:
  • Better numerical stability
  • Works directly on $\mathbf{X}$
  • Better for $n \ll d$ or $n \gg d$
  • Modern implementations optimized

Recommendation

Use SVD for PCA in practice.

Projection and Reconstruction

Projection onto PCs: Transform to $k$ principal components: $$\mathbf{Z} = \mathbf{X}\mathbf{V}_k$$ where $\mathbf{V}_k \in \mathbb{R}^{d \times k}$ are first $k$ PCs. Properties:
  • $\mathbf{Z} \in \mathbb{R}^{n \times k}$: Reduced representation
  • Columns of $\mathbf{Z}$ are uncorrelated
  • Maximum variance preserved
  • Linear transformation
Reconstruction: $$\hat{\mathbf{X}} = \mathbf{Z}\mathbf{V}_k^T = \mathbf{X}\mathbf{V}_k\mathbf{V}_k^T$$

Reconstruction Error

$$\|\mathbf{X} - \hat{\mathbf{X}}\|_F^2 = \sum_{i=k+1}^{d}\lambda_i$$

[Figure: ../figures/pca_reconstruction.png]

Example: Original: $\mathbf{x} = [5, 3]$ PCs: $\mathbf{v}_1 = [0.8, 0.6]$ Projection: $$z_1 = \mathbf{x}^T\mathbf{v}_1 = 5(0.8) + 3(0.6) = 5.8$$ Reconstruction: $$\hat{\mathbf{x}} = z_1\mathbf{v}_1 = 5.8[0.8, 0.6] = [4.64, 3.48]$$ Error: $$\|\mathbf{x} - \hat{\mathbf{x}}\| = \|[0.36, -0.48]\| = 0.6$$

Data Centering and Standardization

Centering (Required): Remove mean from each feature: $$\tilde{x}_{ij} = x_{ij} - \bar{x}_j$$ Why?
  • PCA finds directions of variance
  • Variance computed about mean
  • Ensures PC1 passes through origin
  • Mathematical requirement
Standardization (Optional): Scale to unit variance: $$\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j}$$ When to standardize?
  • Features have different scales
  • Different units of measurement
  • Want equal feature importance

[Figure: ../figures/standardization_importance.png]

\begin{exampleblock}{Example Impact} Without standardization:
  • Income: \$20,000-\$200,000
  • Age: 20-80 years
  • PC1 dominated by income scale
With standardization:
  • Both scaled to mean 0, std 1
  • Equal contribution opportunity
\end{exampleblock}

Best Practice

Always center. Standardize if features have different scales or units.

PCA Algorithm: Step-by-Step

\begin{algorithm}[H] \caption{Principal Component Analysis} \begin{algorithmic}[1] \REQUIRE Data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$, number of components $k$ \ENSURE Principal components $\mathbf{V}_k$, transformed data $\mathbf{Z}$ \STATE Center the data: \STATE $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i$ \STATE $\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}_n\bar{\mathbf{x}}^T$ \STATE Optional: Standardize features \STATE Compute covariance matrix: \STATE $\mathbf{\Sigma} = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$ \STATE OR compute SVD: \STATE $\tilde{\mathbf{X}} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ \STATE Extract top $k$ eigenvectors/singular vectors: \STATE $\mathbf{V}_k = [\mathbf{v}_1, …, \mathbf{v}_k]$ \STATE Project data onto PCs: \STATE $\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k$ \RETURN $\mathbf{V}_k$, $\mathbf{Z}$ \end{algorithmic} \end{algorithm}
Key Steps Explained: \begin{exampleblock}{Step 1-2: Centering} Essential for computing variance correctly. Shift data so mean is at origin. \end{exampleblock} \begin{exampleblock}{Step 3-4: Covariance/SVD} Two equivalent approaches. SVD more stable numerically. \end{exampleblock} \begin{exampleblock}{Step 5: Top-k Selection} Sort by eigenvalues (descending) and keep largest $k$ components. \end{exampleblock} \begin{exampleblock}{Step 6: Projection} Transform original data to new coordinate system defined by PCs. \end{exampleblock} Complexity:
  • Eigendecomposition: $\mathcal{O}(d^3)$
  • SVD: $\mathcal{O}(\min(nd^2, n^2d))$

Worked Example: PCA on Toy Dataset (Part 1)

Problem Setup: Dataset with 5 points in 2D: $$\mathbf{X} = \begin{bmatrix} 1 & 2 \\ 2 & 3 \\ 3 & 4 \\ 4 & 5 \\ 5 & 6 \end{bmatrix}$$ Goal: Find principal components Step 1: Center the data Compute mean: $$\bar{\mathbf{x}} = \frac{1}{5}[15, 20]^T = [3, 4]^T$$ Centered data: $$\tilde{\mathbf{X}} = \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{bmatrix}$$
Step 2: Compute covariance matrix $$\mathbf{\Sigma} = \frac{1}{5}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$ $$= \frac{1}{5}\begin{bmatrix} -2 & -1 & 0 & 1 & 2 \\ -2 & -1 & 0 & 1 & 2 \end{bmatrix} \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{bmatrix}$$ $$= \frac{1}{5}\begin{bmatrix} 10 & 10 \\ 10 & 10 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}$$

Observation

Perfect correlation: $\text{Cov}(X_1, X_2) = \text{Var}(X_1) = \text{Var}(X_2) = 2$

Worked Example: PCA on Toy Dataset (Part 2)

Step 3: Find eigenvalues Solve: $\det(\mathbf{\Sigma} - \lambda\mathbf{I}) = 0$ $$\det\begin{bmatrix} 2-\lambda & 2 \\ 2 & 2-\lambda \end{bmatrix} = 0$$ $$(2-\lambda)^2 - 4 = 0$$ $$\lambda^2 - 4\lambda = 0$$ $$\lambda(\lambda - 4) = 0$$ Eigenvalues: $$\lambda_1 = 4, \lambda_2 = 0$$ Interpretation:
  • All variance in first PC: $\lambda_1 = 4$
  • Zero variance in second PC: $\lambda_2 = 0$
  • Data lies on a line!
Step 4: Find eigenvectors For $\lambda_1 = 4$: $$(\mathbf{\Sigma} - 4\mathbf{I})\mathbf{v}_1 = 0$$ $$\begin{bmatrix} -2 & 2 \\ 2 & -2 \end{bmatrix}\mathbf{v}_1 = 0$$ Solution: $\mathbf{v}_1 = \frac{1}{\sqrt{2}}[1, 1]^T$ For $\lambda_2 = 0$: $$\mathbf{v}_2 = \frac{1}{\sqrt{2}}[1, -1]^T$$ Verification: $$\mathbf{v}_1^T\mathbf{v}_2 = \frac{1}{2}(1 - 1) = 0 \checkmark$$ \begin{exampleblock}{Result} PC1: $[0.707, 0.707]$ (diagonal direction)\\ PC2: $[0.707, -0.707]$ (perpendicular) \end{exampleblock}

Worked Example: PCA on Toy Dataset (Part 3)

Step 5: Project data onto PCs $$\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}$$ $$= \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{bmatrix} \begin{bmatrix} 0.707 & 0.707 \\ 0.707 & -0.707 \end{bmatrix}$$ $$= \begin{bmatrix} -2.83 & 0 \\ -1.41 & 0 \\ 0 & 0 \\ 1.41 & 0 \\ 2.83 & 0 \end{bmatrix}$$ Observation: All points lie on PC1 axis (PC2 coordinates are zero).
Step 6: Explained variance Total variance: $$\text{Total} = \lambda_1 + \lambda_2 = 4 + 0 = 4$$ Explained by PC1: $$\frac{\lambda_1}{\text{Total}} = \frac{4}{4} = 100\%$$ Explained by PC2: $$\frac{\lambda_2}{\text{Total}} = \frac{0}{4} = 0\%$$ Reconstruction (k=1): $$\hat{\mathbf{X}} = \mathbf{Z}_{:,1}\mathbf{v}_1^T + \bar{\mathbf{X}}$$ Perfect reconstruction since all variance in PC1!

Conclusion

This toy example shows perfectly correlated data requiring only 1 dimension.

Worked Example: Non-Trivial 2D Case (Part 1)

New Dataset (4 points): $$\mathbf{X} = \begin{bmatrix} 1 & 2 \\ 2 & 1 \\ 3 & 4 \\ 4 & 3 \end{bmatrix}$$ Step 1: Center data Mean: $\bar{\mathbf{x}} = [2.5, 2.5]^T$ $$\tilde{\mathbf{X}} = \begin{bmatrix} -1.5 & -0.5 \\ -0.5 & -1.5 \\ 0.5 & 1.5 \\ 1.5 & 0.5 \end{bmatrix}$$ Step 2: Covariance $$\mathbf{\Sigma} = \frac{1}{4}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$ $$= \frac{1}{4}\begin{bmatrix} 5 & 3 \\ 3 & 5 \end{bmatrix} = \begin{bmatrix} 1.25 & 0.75 \\ 0.75 & 1.25 \end{bmatrix}$$
Step 3: Eigenvalues $$\det(\mathbf{\Sigma} - \lambda\mathbf{I}) = 0$$ $$(1.25-\lambda)^2 - 0.5625 = 0$$ $$\lambda^2 - 2.5\lambda + 1 = 0$$ Using quadratic formula: $$\lambda = \frac{2.5 \pm \sqrt{6.25-4}}{2} = \frac{2.5 \pm 1.5}{2}$$ Eigenvalues: $$\lambda_1 = 2.0, \lambda_2 = 0.5$$ Variance explained:
  • PC1: $\frac{2.0}{2.5} = 80\%$
  • PC2: $\frac{0.5}{2.5} = 20\%$
  • Both components carry information!

Worked Example: Non-Trivial 2D Case (Part 2)

Step 4: Eigenvectors For $\lambda_1 = 2.0$: $$\begin{bmatrix} -0.75 & 0.75 \\ 0.75 & -0.75 \end{bmatrix}\mathbf{v}_1 = 0$$ Normalize: $\mathbf{v}_1 = \frac{1}{\sqrt{2}}[1, 1]^T = [0.707, 0.707]^T$ For $\lambda_2 = 0.5$: $$\mathbf{v}_2 = \frac{1}{\sqrt{2}}[1, -1]^T = [0.707, -0.707]^T$$ Step 5: Transform data $$\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}$$ $$= \begin{bmatrix} -1.414 & -0.707 \\ -1.414 & 0.707 \\ 1.414 & -0.707 \\ 1.414 & 0.707 \end{bmatrix}$$
Step 6: Reconstruction (k=1) Keep only PC1: $$\hat{\tilde{\mathbf{X}}} = \mathbf{Z}_{:,1}\mathbf{v}_1^T$$ $$= \begin{bmatrix} -1.414 \\ -1.414 \\ 1.414 \\ 1.414 \end{bmatrix} \begin{bmatrix} 0.707 & 0.707 \end{bmatrix}$$ $$= \begin{bmatrix} -1.0 & -1.0 \\ -1.0 & -1.0 \\ 1.0 & 1.0 \\ 1.0 & 1.0 \end{bmatrix}$$ Reconstruction error: $$\text{Error} = \lambda_2 = 0.5$$ \begin{exampleblock}{Summary} Reduced from 2D to 1D, preserving 80\% of variance. \end{exampleblock}

Kernel PCA: Non-linear Extension

Motivation

Standard PCA finds only linear relationships. Many real-world datasets have non-linear structure.
Key Idea:
  • Map data to high-dimensional space: $\phi: \mathbb{R}^d \to \mathcal{H}$
  • Apply PCA in feature space $\mathcal{H}$
  • Use kernel trick: $K(x_i, x_j) = \phi(x_i)^T\phi(x_j)$
  • Never explicitly compute $\phi(x)$
Algorithm:
  1. Compute kernel matrix: $\mathbf{K}_{ij} = K(x_i, x_j)$
  2. Center kernel matrix
  3. Eigendecompose: $\mathbf{K} = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$
  4. Project: $z_i = \mathbf{U}^T\mathbf{k}(x_i)$

[Figure: ../figures/kernel_pca_concept.png]

Common Kernels:
  • RBF: $K(x,y) = \exp(-\gamma\|x-y\|^2)$
  • Polynomial: $K(x,y) = (x^Ty + c)^d$
  • Sigmoid: $K(x,y) = \tanh(\alpha x^Ty + c)$
\begin{exampleblock}{Applications}
  • Manifold learning
  • Non-linear dimensionality reduction
  • Feature extraction for non-linear data
\end{exampleblock}

Trade-off

More flexible but computationally expensive: $\mathcal{O}(n^3)$

Sparse PCA

Motivation

Standard PCA produces components that are linear combinations of all original features, making interpretation difficult.
Goal: Find PCs with sparse loadings
  • Most loadings are exactly zero
  • Only few features contribute to each PC
  • Easier interpretation
  • Feature selection built-in
Formulation: $$\min_{\mathbf{V}} \|\mathbf{X} - \mathbf{X}\mathbf{V}\mathbf{V}^T\|_F^2 + \lambda\sum_{j=1}^{k}\|\mathbf{v}_j\|_1$$
  • First term: Reconstruction error
  • Second term: Sparsity penalty (L1 regularization)
  • $\lambda$: Controls sparsity level

[Figure: ../figures/sparse_pca.png]

Comparison: \begin{tabular}{|l|c|c|} \hline Method & Variance & Sparsity \\ \hline Standard PCA & 100\% & 0\% \\ Sparse PCA & 95-98\% & 70-90\% \\ \hline \end{tabular} Applications:
  • Gene expression analysis
  • Financial data analysis
  • Text mining
  • Any domain requiring interpretability

Trade-off

Slight loss of explained variance for much better interpretability.

Incremental PCA

Motivation

Standard PCA requires entire dataset in memory. Not feasible for:
  • Very large datasets
  • Streaming data
  • Online learning scenarios
  • Limited memory systems
Key Idea: Process data in mini-batches:
  1. Initialize with first batch
  2. Update PCs incrementally with new batches
  3. Maintain approximate PCs online
  4. Never load full dataset
Algorithm Sketch: \begin{algorithmic}[1] \STATE Initialize: $\mathbf{V}_0$, $\mathbf{\Sigma}_0$, $n_0$ \FOR{each batch $\mathbf{X}_b$} \STATE Update mean: $\bar{\mathbf{x}}_{new}$ \STATE Update covariance: $\mathbf{\Sigma}_{new}$ \STATE SVD: Update $\mathbf{V}$ \STATE $n \gets n + n_b$ \ENDFOR \end{algorithmic}

[Figure: ../figures/incremental_pca.png]

Advantages:
  • Memory efficient: Constant memory
  • Scalable: Handles large data
  • Online: Updates with new data
  • Streaming: Real-time processing
Complexity:
  • Per batch: $\mathcal{O}(bd^2)$ where $b$ = batch size
  • Memory: $\mathcal{O}(d^2)$ instead of $\mathcal{O}(nd)$
  • Nearly identical results to batch PCA
\begin{exampleblock}{Use Cases}
  • Large-scale image processing
  • Sensor data streams
  • Online recommendation systems
\end{exampleblock}

Probabilistic PCA

Motivation

Standard PCA is deterministic. Probabilistic PCA provides:
  • Generative model of data
  • Handling of missing values
  • Uncertainty quantification
  • Bayesian extensions
Model: $$\begin{aligned}\mathbf{z} \sim \mathcal{N}(0, \mathbf{I}_k) \\ \mathbf{x}|\mathbf{z} \sim \mathcal{N}(\mathbf{W}\mathbf{z} + \boldsymbol{\mu}, \sigma^2\mathbf{I}_d)\end{aligned}$$ where:
  • $\mathbf{z} \in \mathbb{R}^k$: Latent variables
  • $\mathbf{W} \in \mathbb{R}^{d \times k}$: Loading matrix
  • $\sigma^2$: Isotropic noise
Maximum Likelihood Solution: $$\mathbf{W}_{ML} = \mathbf{V}_k(\mathbf{\Lambda}_k - \sigma^2\mathbf{I})^{1/2}\mathbf{R}$$ where $\mathbf{R}$ is arbitrary rotation.
EM Algorithm: \begin{algorithmic}[1] \STATE Initialize $\mathbf{W}$, $\sigma^2$ \REPEAT \STATE E-step: Compute $\mathbb{E}[\mathbf{z}|\mathbf{x}]$ \STATE M-step: Update $\mathbf{W}$, $\sigma^2$ \UNTIL{convergence} \end{algorithmic} Advantages:
  • Handle missing data naturally
  • Mixture of PPCAs for clustering
  • Bayesian inference possible
  • Uncertainty quantification
Relationship to Standard PCA:
  • As $\sigma^2 \to 0$: Recovers standard PCA
  • Same subspace in limit
  • Additional noise model
\begin{exampleblock}{Extensions}
  • Factor Analysis: Non-isotropic noise
  • Variational PCA: Variational inference
  • Sparse PPCA: Sparse loadings
\end{exampleblock}

Robust PCA

Motivation

Standard PCA is sensitive to:
  • Outliers: Anomalous data points
  • Corrupted entries: Missing or noisy values
  • Sparse errors: Localized corruption
Problem Formulation: Decompose $\mathbf{X}$ into: $$\mathbf{X} = \mathbf{L} + \mathbf{S}$$ where:
  • $\mathbf{L}$: Low-rank (clean data)
  • $\mathbf{S}$: Sparse (outliers/corruption)
Optimization: $$\min_{\mathbf{L},\mathbf{S}} \|\mathbf{L}\|_* + \lambda\|\mathbf{S}\|_1$$ $$\text{subject to } \mathbf{X} = \mathbf{L} + \mathbf{S}$$
  • $\|\mathbf{L}\|_*$: Nuclear norm (rank proxy)
  • $\|\mathbf{S}\|_1$: L1 norm (sparsity)
Applications: \begin{exampleblock}{Video Surveillance}
  • $\mathbf{L}$: Static background
  • $\mathbf{S}$: Moving objects
  • Separates foreground/background
\end{exampleblock} \begin{exampleblock}{Face Recognition}
  • $\mathbf{L}$: Face structure
  • $\mathbf{S}$: Occlusions, shadows
  • Robust to partial occlusion
\end{exampleblock} Solution Methods:
  • Principal Component Pursuit (PCP)
  • Alternating Direction Method (ADMM)
  • Proximal gradient methods
Complexity:
  • $\mathcal{O}(nd^2k)$ per iteration
  • Slower than standard PCA
  • Worth it for corrupted data

Scree Plot: Visualizing Variance

Definition

A scree plot displays eigenvalues (or explained variance) for each principal component in decreasing order.
How to Read:
  • X-axis: Principal component number
  • Y-axis: Eigenvalue or explained variance
  • Look for "elbow" point
  • Steep drop indicates important PCs
  • Flat tail indicates noise
Elbow Method:
  • Find point where curve bends
  • Keep PCs before elbow
  • Remaining PCs contribute little
  • Subjective but practical

Rule of Thumb

Keep components before the "elbow" where variance drops sharply.

[Figure: ../figures/scree_plot.png]

Example Interpretation:
  • PC1: Large eigenvalue (most variance)
  • PC2-3: Moderate decrease
  • PC4+: Flat (noise level)
  • Decision: Keep 3-4 components
\begin{exampleblock}{Kaiser Criterion} Alternative rule: Keep components with eigenvalue $> 1$ (for correlation matrix). \end{exampleblock}

Explained Variance Ratio

Individual Explained Variance: For component $i$: $$\text{EVR}_i = \frac{\lambda_i}{\sum_{j=1}^{d}\lambda_j}$$ Cumulative Explained Variance: For first $k$ components: $$\text{CEV}_k = \frac{\sum_{i=1}^{k}\lambda_i}{\sum_{j=1}^{d}\lambda_j}$$ Selection Criteria:
  • Fixed threshold: $\text{CEV}_k \geq 0.95$ (95\%)
  • Fixed number: $k = 10$ components
  • Elbow method: Visual inspection
  • Cross-validation: Best downstream performance

Common Choice

Keep components explaining 90-95\% of total variance.
Example Calculation: Eigenvalues: $[5.2, 2.1, 0.8, 0.3, 0.1]$ Total: $\sum \lambda_i = 8.5$ \begin{tabular}{|c|c|c|} \hline PC & EVR & CEV \\ \hline 1 & 61.2\% & 61.2\% \\ 2 & 24.7\% & 85.9\% \\ 3 & 9.4\% & 95.3\% \\ 4 & 3.5\% & 98.8\% \\ 5 & 1.2\% & 100.0\% \\ \hline \end{tabular} Decision:
  • For 85\% variance: Keep 2 PCs
  • For 95\% variance: Keep 3 PCs
  • For 99\% variance: Keep 4 PCs
\begin{exampleblock}{Trade-off} More components: Better reconstruction, less compression \end{exampleblock}

Reconstruction Error

Definition

Reconstruction error measures information lost when using only $k < d$ principal components.
Formula: $$E_k = \|\mathbf{X} - \hat{\mathbf{X}}_k\|_F^2 = \sum_{i=k+1}^{d}\lambda_i$$ where $\hat{\mathbf{X}}_k = \mathbf{X}\mathbf{V}_k\mathbf{V}_k^T$ Normalized Error: $$\text{Relative Error} = \frac{E_k}{\|\mathbf{X}\|_F^2} = \frac{\sum_{i=k+1}^{d}\lambda_i}{\sum_{i=1}^{d}\lambda_i}$$ Properties:
  • Monotonically decreasing in $k$
  • Zero when $k = d$ (perfect reconstruction)
  • Directly related to discarded eigenvalues

[Figure: ../figures/reconstruction_error.png]

Selection Strategy: Set error tolerance $\epsilon$: $$E_k \leq \epsilon \cdot \|\mathbf{X}\|_F^2$$ Choose smallest $k$ satisfying constraint. Example:
  • Total variance: 100
  • Target: 5\% error tolerance
  • Need: CEV $\geq$ 95\%
  • $\sum_{i=1}^{k}\lambda_i \geq 95$
\begin{exampleblock}{Application} Image compression: Balance file size vs quality using reconstruction error. \end{exampleblock}

Cross-Validation for Component Selection

Task-Specific Selection: When PCA is used for downstream task:
  1. Apply PCA with different $k$ values
  2. Train model on reduced data
  3. Evaluate on validation set
  4. Choose $k$ with best performance
Procedure: \begin{algorithmic}[1] \STATE Split data: Train/Validation \FOR{$k = 1$ to $k_{max}$} \STATE Fit PCA on training set \STATE Transform train \& validation \STATE Train classifier/regressor \STATE Evaluate performance \ENDFOR \STATE Select $k$ with best validation score \end{algorithmic}

Important

Fit PCA only on training data to avoid data leakage!
Example: Classification Task \begin{tabular}{|c|c|c|} \hline k & Train Acc & Val Acc \\ \hline 5 & 0.75 & 0.72 \\ 10 & 0.82 & 0.79 \\ 20 & 0.89 & 0.85 \\ 50 & 0.95 & 0.84 \\ 100 & 0.98 & 0.82 \\ \hline \end{tabular} Analysis:
  • Best validation: $k = 20$
  • $k = 50, 100$: Overfitting
  • Task-specific optimum
Considerations:
  • Variance explained vs task performance
  • Computational cost
  • Interpretability needs
  • Domain constraints
\begin{exampleblock}{Best Practice} Use cross-validation when PCA is preprocessing step for supervised learning. \end{exampleblock}

Component Correlation Analysis

Goal: Understand relationships between original features and principal components. Loading Matrix: $$\mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, …, \mathbf{v}_d]$$ Interpretation:
  • $v_{ij}$: Contribution of feature $j$ to PC $i$
  • Large $|v_{ij}|$: Feature $j$ important for PC $i$
  • Sign indicates direction of contribution
  • Loadings are correlations (if standardized)
Biplot:
  • Visualize data and loadings together
  • Arrow length: Variable importance
  • Arrow direction: Correlation with PCs
  • Angle between arrows: Variable correlation

Interpretation

Close to parallel arrows indicate highly correlated variables.

[Figure: ../figures/component_correlation.png]

Example Loading Matrix: \scriptsize \begin{tabular}{|l|c|c|c|} \hline Feature & PC1 & PC2 & PC3 \\ \hline Height & 0.82 & 0.15 & -0.05 \\ Weight & 0.79 & 0.20 & 0.08 \\ Age & 0.12 & 0.91 & 0.10 \\ Income & -0.05 & 0.08 & 0.95 \\ \hline \end{tabular} Interpretation:
  • PC1: "Body size" (height, weight)
  • PC2: "Age" dimension
  • PC3: "Income" dimension
  • Clear separation of concepts

Application: Image Compression

Problem: Images contain redundant information. Can we store them more efficiently? PCA Approach:
  1. Treat each image as vector (flatten pixels)
  2. Build dataset: $\mathbf{X} \in \mathbb{R}^{n \times d}$
  3. Apply PCA: Find eigenvectors
  4. Project: $\mathbf{Z} = \mathbf{X}\mathbf{V}_k$
  5. Store: $\mathbf{Z}$ and $\mathbf{V}_k$ instead of $\mathbf{X}$
Compression Ratio: Original: $n \times d$ values Compressed: $n \times k + k \times d$ values Ratio: $\frac{nd}{nk + kd} \approx \frac{d}{k}$ (for large $n$) Example:
  • $d = 10,000$ pixels
  • $k = 100$ components
  • Compression: $100:1$

[Figure: ../figures/image_compression.png]

Quality vs Compression: \scriptsize \begin{tabular}{|c|c|c|} \hline PCs & Compression & Quality \\ \hline 10 & 95\% & Poor \\ 50 & 75\% & Fair \\ 100 & 50\% & Good \\ 200 & 0\% & Excellent \\ \hline \end{tabular} \begin{exampleblock}{Trade-off} Balance between file size and visual quality. Typical choice: 80-90\% variance retained. \end{exampleblock}

Application: Face Recognition (Eigenfaces)

Eigenfaces Method:
  1. Collect face images: $n$ images, $d$ pixels each
  2. Apply PCA: Find "eigenfaces" (principal components)
  3. Each eigenface captures facial variation
  4. Represent faces in eigenface space
  5. Recognition: Nearest neighbor in PC space
Advantages:
  • Dimensionality reduction: $10,000 \to 100$
  • Fast matching in low-dimensional space
  • Captures important facial features
  • Robust to minor variations
Process:
  • Training: Build eigenface basis
  • Enrollment: Project new face to PC space
  • Recognition: Compare with database
  • Decision: Threshold distance

[Figure: ../figures/eigenfaces_application.png]

Typical Eigenfaces:
  • PC1: Average lighting
  • PC2: Left-right contrast
  • PC3: Facial expression
  • PC4-10: Detailed features
  • PC11+: Fine details/noise
\begin{exampleblock}{Historical Note} Eigenfaces pioneered in 1991 by Turk and Pentland. Still used as baseline method. \end{exampleblock}

Limitations

Sensitive to: lighting, pose, facial expression. Modern methods use deep learning.

Application: Noise Filtering

Principle: Noise typically has:
  • Low variance
  • High frequency
  • Random direction
  • Captured by small eigenvalues
Signal has:
  • High variance
  • Low frequency
  • Structured patterns
  • Captured by large eigenvalues
Denoising Procedure:
  1. Apply PCA to noisy data
  2. Keep only top $k$ components (signal)
  3. Discard remaining components (noise)
  4. Reconstruct: $\hat{\mathbf{X}} = \mathbf{X}\mathbf{V}_k\mathbf{V}_k^T$

Key Insight

PCA acts as low-pass filter, removing high-frequency noise.

[Figure: ../figures/noise_filtering.png]

Example: Signal Denoising Original signal + Gaussian noise
  • Noise variance: 0.1
  • PCA: Keep 5 components
  • SNR improvement: 15 dB
Applications:
  • Image denoising
  • Audio signal processing
  • Sensor data cleaning
  • Medical imaging
  • Financial time series
\begin{exampleblock}{Challenge} Choosing $k$: Too small loses signal, too large keeps noise. Use cross-validation or domain knowledge. \end{exampleblock}

Application: Data Visualization

Challenge: Cannot visualize data in $d > 3$ dimensions. PCA Solution: Project to 2D or 3D for visualization:
  1. Apply PCA: Get principal components
  2. Keep PC1 and PC2 (or PC1, PC2, PC3)
  3. Plot data in reduced space
  4. Preserves maximum variance
Interpretation:
  • PC1 (x-axis): Direction of most variance
  • PC2 (y-axis): Second most variance
  • Distances approximately preserved
  • Clusters may emerge
Use Cases:
  • Exploratory data analysis
  • Cluster visualization
  • Outlier detection
  • Pattern discovery
  • Presentation to stakeholders

[Figure: ../figures/data_visualization_application.png]

\begin{exampleblock}{Example: Iris Dataset}
  • Original: 4 dimensions
  • PCA: 2D projection
  • Explained variance: 95.8\%
  • Clear species separation visible
\end{exampleblock} Comparison with Other Methods: \scriptsize \begin{tabular}{|l|c|c|} \hline Method & Linear & Global \\ \hline PCA & Yes & Yes \\ t-SNE & No & No \\ UMAP & No & Local \\ MDS & Yes/No & Yes \\ \hline \end{tabular}

Limitation

PCA preserves global structure but may miss non-linear patterns.

Application: Exploratory Data Analysis

Digits Dataset Visualization: High-dimensional handwritten digit images (64 dimensions) projected to 2D. Insights from PCA:
  • Digit clusters visible in PC space
  • Similar digits closer together
  • Outliers easily identified
  • Confusion patterns apparent
Practical Workflow:
  1. Load dataset
  2. Standardize features
  3. Apply PCA (keep 2-3 PCs)
  4. Visualize with scatter plot
  5. Color by labels (if available)
  6. Identify patterns/outliers

EDA Benefit

Quick visual check before applying complex ML models.

[Figure: ../figures/digits_visualization.png]

Observations:
  • Digits 0 and 1 well-separated
  • Digits 3, 5, 8 overlap slightly
  • Some outliers (miswritten digits)
  • PC1 and PC2 capture 30\% variance
\begin{exampleblock}{Feature Engineering} PCA projections can be used as features for downstream classification:
  • Reduce 64D to 20D
  • Train classifier on PC scores
  • Faster training
  • Often better generalization
\end{exampleblock}

Application: Feature Engineering

PCA as Preprocessing: Use PCA-transformed features for ML models. Benefits:
  • Decorrelation: Remove multicollinearity
  • Dimensionality: Reduce feature count
  • Noise reduction: Filter out noisy components
  • Speed: Faster model training
  • Regularization: Implicit regularization effect
Pipeline: \begin{algorithmic}[1] \STATE Train set: Fit PCA \STATE Train set: Transform with PCA \STATE Test set: Transform with same PCA \STATE Train classifier on PC scores \STATE Evaluate on test PC scores \end{algorithmic}

Warning

Never fit PCA on test data! This causes data leakage.

[Figure: ../figures/feature_engineering_application.png]

Example Results: \scriptsize \begin{tabular}{|l|c|c|} \hline Features & Accuracy & Time \\ \hline Original (1000D) & 0.85 & 120s \\ PCA (100D) & 0.87 & 15s \\ PCA (50D) & 0.86 & 8s \\ PCA (10D) & 0.79 & 2s \\ \hline \end{tabular} Observations:
  • Sweet spot: 50-100 components
  • Improved accuracy with PCA
  • Much faster training
  • Slight degradation with too few PCs
\begin{exampleblock}{Use Case} High-dimensional datasets where features are correlated (genomics, text, images). \end{exampleblock}

Best Practices: Standardization

Critical Decision

Should you standardize features before PCA?
Standardization: $$z_j = \frac{x_j - \mu_j}{\sigma_j}$$ When to Standardize:
  • Features have different units
  • Features have different scales
  • Want equal weight for all features
  • Domain knowledge suggests equal importance
When NOT to Standardize:
  • Features already on same scale
  • Scale carries information
  • Domain knowledge: some features more important
  • Interpretation requires original scale

[Figure: ../figures/standardization_importance.png]

Example Impact: Dataset: Height (cm), Weight (kg), Age (years) Without standardization:
  • Height range: 150-200
  • Weight range: 50-100
  • Age range: 20-70
  • PC1 dominated by height
With standardization:
  • All ranges: -2 to 2
  • Equal opportunity for all
  • More balanced PCs
\begin{exampleblock}{Recommendation} Default: Standardize unless you have good reason not to. \end{exampleblock}

Computational Complexity

Standard PCA Complexity: \begin{tabular}{|l|c|} \hline Operation & Complexity \\ \hline Centering & $\mathcal{O}(nd)$ \\ Covariance & $\mathcal{O}(nd^2)$ \\ Eigendecomp & $\mathcal{O}(d^3)$ \\ SVD & $\mathcal{O}(\min(nd^2, n^2d))$ \\ \hline Total & $\mathcal{O}(nd^2 + d^3)$ \\ \hline \end{tabular} Memory Requirements:
  • Data: $\mathcal{O}(nd)$
  • Covariance: $\mathcal{O}(d^2)$
  • Eigenvectors: $\mathcal{O}(d^2)$
  • Total: $\mathcal{O}(nd + d^2)$
Scalability Issues:
  • Large $d$: Covariance matrix huge
  • Large $n$: Memory for data matrix
  • Both large: Computational bottleneck

[Figure: ../figures/computational_complexity.png]

Solutions for Large-Scale: \begin{exampleblock}{Incremental PCA}
  • Process mini-batches
  • Memory: $\mathcal{O}(bd + d^2)$
  • Time: $\mathcal{O}(ndk)$
\end{exampleblock} \begin{exampleblock}{Randomized PCA}
  • Approximate top-$k$ PCs
  • Time: $\mathcal{O}(ndk)$
  • Much faster for $k \ll d$
\end{exampleblock} \begin{exampleblock}{Sparse PCA}
  • Exploit data sparsity
  • Reduce effective dimensionality
\end{exampleblock}

Rule of Thumb

Standard PCA: $n, d < 10,000$\\ Incremental: $n > 100,000$\\ Randomized: $d > 10,000$, $k \ll d$

Common Pitfalls and How to Avoid Them

1. Data Leakage:

Mistake

Fitting PCA on entire dataset including test set.
Correct approach:
  • Fit PCA only on training set
  • Transform train and test separately
  • Use same transformation for both
2. Forgetting to Center:

Mistake

Applying PCA without centering data.
Why it matters:
  • PCA assumes zero mean
  • Results will be incorrect
  • Always center first!
3. Wrong Scaling Choice:

Mistake

Not standardizing when features have different scales.
Impact:
  • PCs dominated by large-scale features
  • Misleading variance explanation
4. Interpreting PCs as Features:

Mistake

Assuming PCs have same meaning as original features.
Reality:
  • PCs are linear combinations
  • May not have intuitive interpretation
  • Use loadings for understanding
5. Assuming Linearity:

Mistake

Applying PCA to manifold data with non-linear structure.
Solution:
  • Check for non-linearity first
  • Consider Kernel PCA or manifold methods
6. Ignoring Outliers:

Mistake

Not handling outliers before PCA.
Impact:
  • PCs skewed by outliers
  • Variance misrepresented
  • Use robust PCA if needed

Interpreting PCA Results

What to Report:
  1. Number of components: $k$ chosen
  2. Explained variance: Per component and cumulative
  3. Scree plot: Visualize eigenvalue decay
  4. Loading matrix: Top features per PC
  5. Biplot: If $d$ is small
  6. Reconstruction error: If applicable
Interpreting Loadings:
  • Examine largest magnitude loadings
  • Group features by sign
  • Name PC based on dominant features
  • Example: "Size component", "Age component"
\begin{exampleblock}{Example} PC1 with high loadings on [height, weight, BMI]:
  • Interpretation: "Body size" component
  • Positive values: Larger individuals
  • Negative values: Smaller individuals
\end{exampleblock}
Statistical Significance: Bootstrap approach:
  • Resample data multiple times
  • Compute PCA on each sample
  • Check stability of components
  • Report confidence intervals
Permutation test:
  • Randomly permute features
  • Compare eigenvalues to null distribution
  • Test if variance is significant
Practical Checklist:
  • $\checkmark$ Data centered/standardized?
  • $\checkmark$ Scree plot shows elbow?
  • $\checkmark$ Enough variance explained?
  • $\checkmark$ PCs interpretable?
  • $\checkmark$ No data leakage?
  • $\checkmark$ Outliers addressed?
  • $\checkmark$ Results validated?

Documentation

Always document preprocessing choices and justification for $k$.

Key Takeaways

Core Concepts:
  • PCA: Linear dimensionality reduction via variance maximization
  • Principal components: Orthogonal directions of maximum variance
  • Eigendecomposition: Mathematical foundation
  • SVD: Practical computation method
  • Variance explained: Quantifies information retention
Key Steps:
  1. Center (and optionally standardize) data
  2. Compute covariance or apply SVD
  3. Extract eigenvectors/singular vectors
  4. Project data onto top-$k$ components
  5. Evaluate and interpret results
Variants:
  • Kernel PCA: Non-linear extension
  • Sparse PCA: Interpretable loadings
  • Incremental PCA: Large-scale data
  • Robust PCA: Handle outliers
Applications:
  • Data visualization and exploration
  • Image compression and processing
  • Face recognition (eigenfaces)
  • Noise filtering and denoising
  • Feature engineering for ML
  • Dimensionality reduction
Best Practices:
  • Always center data, standardize if needed
  • Use scree plot and explained variance
  • Avoid data leakage in train/test split
  • Check for outliers and non-linearity
  • Validate component selection
  • Document all preprocessing choices
Limitations:
  • Assumes linear relationships
  • Sensitive to outliers and scaling
  • Loses interpretability
  • May not preserve non-linear structure

Further Reading and Resources

Classic Papers:
  • Pearson (1901): "On lines and planes of closest fit"
  • Hotelling (1933): "Analysis of complex statistical variables"
  • Turk \& Pentland (1991): "Eigenfaces for recognition"
  • Jolliffe (2002): "Principal Component Analysis" (book)
Advanced Topics:
  • Independent Component Analysis (ICA)
  • Non-negative Matrix Factorization (NMF)
  • t-SNE and UMAP for visualization
  • Autoencoders for non-linear PCA
  • Gaussian Process Latent Variable Models
Software Libraries:
  • \texttt{scikit-learn}: PCA, KernelPCA, IncrementalPCA
  • \texttt{numpy/scipy}: Low-level linear algebra
  • \texttt{statsmodels}: Statistical PCA
Related Methods:
  • Linear Discriminant Analysis (LDA): Supervised dimensionality reduction
  • Factor Analysis: Assumes latent variables
  • Canonical Correlation Analysis: Multi-view learning
  • Isomap: Geodesic distances
  • Locally Linear Embedding: Manifold learning
When to Use Alternatives:
  • Non-linear structure: Kernel PCA, manifold methods
  • Labeled data: LDA, supervised methods
  • Visualization only: t-SNE, UMAP
  • Interpretability: Sparse methods, NMF
  • Very large data: Random projections, sketching
\begin{exampleblock}{Next Steps}
  • Practice on real datasets
  • Compare with other methods
  • Explore kernel and sparse variants
  • Study deep learning autoencoders
\end{exampleblock}
Thank you for your attention!

End of Module 07

Principal Component Analysis

Questions?