Principal Component Analysis

CMSC 173 - Module 07

Noel Jeffrey Pinton
Department of Computer Science
University of the Philippines Cebu

Outline

\tableofcontents

What is Principal Component Analysis?

DefinitionPrincipal Component Analysis (PCA) is a statistical technique that transforms high-dimensional data into a lower-dimensional representation while preserving as much variance as possible.

\begin{exampleblock}{Key Characteristics}

Unsupervised learning method
Linear dimensionality reduction
Orthogonal transformation
Variance maximization

\end{exampleblock}

[Figure: ../figures/dimensionality_reduction_concept.png]

Applications:

Data visualization
Feature extraction
Noise reduction
Image compression

Goal

Find new axes (principal components) that maximize variance and are uncorrelated with each other.

The Curse of Dimensionality

ProblemAs the number of features increases, data becomes increasingly sparse, making analysis difficult and computationally expensive.

Challenges in High Dimensions:

Data sparsity: Points become isolated
Distance measures: Lose meaning
Overfitting: Models fit noise
Computation: Time/memory grows exponentially

Hughes Phenomenon

Model performance initially improves with more features, then degrades beyond an optimal point.

[Figure: ../figures/curse_of_dimensionality.png]

Example Impact:

10 features: $10^2 = 100$ parameter pairs
100 features: $100^2 = 10,000$ pairs
1000 features: $1,000,000$ pairs!

\begin{exampleblock}{Solution: PCA} Reduce dimensions while retaining information. \end{exampleblock}

Why Dimensionality Reduction?

Motivation for PCA:

Visualization: Reduce to 2D/3D for plotting
Storage: Compress data efficiently
Speed: Faster training and inference
Noise removal: Filter out low-variance components
Feature extraction: Create meaningful features
Collinearity: Remove redundant features

Core PrincipleMost real-world data has intrinsic dimensionality much lower than its ambient dimensionality.

Example: Images of faces

Ambient: 10,000 pixels
Intrinsic: 100-200 dimensions

Benefits vs Trade-offs: \begin{exampleblock}{Advantages}

Reduced computational cost
Easier visualization
Noise reduction
Better generalization

\end{exampleblock}

Trade-offs

Information loss
Interpretability reduced
Linear assumptions
Sensitive to scaling

When to Use PCA

Many correlated features
Need for visualization
Computational constraints
Preprocessing for ML

Real-World Applications

Computer Vision:

Face recognition: Eigenfaces
Image compression: Reduce storage
Object detection: Feature extraction
Image denoising: Remove noise

Finance:

Portfolio optimization
Risk assessment
Market trend analysis
Fraud detection

Biology \& Medicine:

Gene expression analysis
Medical imaging
Drug discovery
Disease classification

Natural Language Processing:

Document clustering
Topic modeling
Sentiment analysis
Information retrieval

Signal Processing:

Audio compression
Speech recognition
Sensor data analysis
Anomaly detection

Recommendation Systems:

User preference modeling
Content filtering
Collaborative filtering
Feature engineering

\begin{exampleblock}{Success Story} Netflix Prize: Teams used PCA for feature reduction and achieved significant performance gains. \end{exampleblock}

Variance: Measuring Spread

DefinitionVariance measures how far data points spread from their mean value.

For a single variable: $$\begin{aligned}\text{Var}(X) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 \\ = \mathbb{E}[(X - \mathbb{E}[X])^2]\end{aligned}$$ Properties:

Always non-negative: $\text{Var}(X) \geq 0$
Zero variance: constant values
Units: squared original units
Standard deviation: $\sigma = \sqrt{\text{Var}(X)}$

PCA Goal

Find directions of maximum variance in the data.

[Figure: ../figures/variance_visualization.png]

Example Calculation: Data: $[2, 4, 6, 8, 10]$ $$\begin{aligned}\bar{x} = \frac{2+4+6+8+10}{5} = 6 \\ \text{Var}(X) = \frac{(2-6)^2 + (4-6)^2 + ⋯}{5} \\ = \frac{16 + 4 + 0 + 4 + 16}{5} = 8\end{aligned}$$ \begin{exampleblock}{Intuition} High variance direction contains more information than low variance direction. \end{exampleblock}

Covariance: Measuring Linear Relationship

DefinitionCovariance measures the joint variability of two variables.

Formula: $$\begin{aligned}\text{Cov}(X, Y) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y}) \\ = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]\end{aligned}$$ Interpretation:

$\text{Cov}(X, Y) > 0$: Positive relationship
$\text{Cov}(X, Y) < 0$: Negative relationship
$\text{Cov}(X, Y) = 0$: No linear relationship
$\text{Cov}(X, X) = \text{Var}(X)$

\begin{exampleblock}{Correlation} Normalized covariance: $$\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]$$ \end{exampleblock}

Covariance Matrix: For data $\mathbf{X} \in \mathbb{R}^{n \times d}$: $$\mathbf{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X} \in \mathbb{R}^{d \times d}$$ Structure: $$\mathbf{\Sigma} = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1,X_2) & ⋯ \\ \text{Cov}(X_2,X_1) & \text{Var}(X_2) & ⋯ \\ ⋮ & ⋮ & \ddots \end{bmatrix}$$ Properties:

Symmetric: $\Sigma_{ij} = \Sigma_{ji}$
Positive semi-definite
Diagonal: variances
Off-diagonal: covariances

PCA Key

Diagonalize covariance matrix to find uncorrelated components.

Eigendecomposition: The Core of PCA

DefinitionFor a square matrix $\mathbf{A}$, eigendecomposition finds vectors and scalars such that:
$$\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$$

Components:

$\mathbf{v}$: Eigenvector (direction)
$\lambda$: Eigenvalue (scaling factor)
Matrix only changes magnitude, not direction

For Covariance Matrix: $$\mathbf{\Sigma} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$$ where:

$\mathbf{V}$: Eigenvector matrix
$\mathbf{\Lambda}$: Diagonal eigenvalue matrix
$\mathbf{V}^T\mathbf{V} = \mathbf{I}$ (orthonormal)

[Figure: ../figures/covariance_eigenvectors.png]

Geometric Interpretation:

Eigenvectors: Principal component directions
Eigenvalues: Variance along each PC
Largest eigenvalue: Most variance
Smallest eigenvalue: Least variance

\begin{exampleblock}{PCA Connection}

PC directions = Eigenvectors
PC importance = Eigenvalues
Sorted by decreasing eigenvalue

\end{exampleblock}

Eigendecomposition Example

Problem: Find eigenvalues and eigenvectors $$\mathbf{\Sigma} = \begin{bmatrix} 4 & 2 \\ 2 & 3 \end{bmatrix}$$ Step 1: Characteristic equation $$\det(\mathbf{\Sigma} - \lambda \mathbf{I}) = 0$$ $$\det\begin{bmatrix} 4-\lambda & 2 \\ 2 & 3-\lambda \end{bmatrix} = 0$$ $$(4-\lambda)(3-\lambda) - 4 = 0$$ $$\lambda^2 - 7\lambda + 8 = 0$$ Step 2: Solve for eigenvalues $$\lambda_1 = 5.56, \lambda_2 = 1.44$$

Step 3: Find eigenvectors For $\lambda_1 = 5.56$: $$(\mathbf{\Sigma} - 5.56\mathbf{I})\mathbf{v}_1 = 0$$ $$\mathbf{v}_1 = \begin{bmatrix} 0.79 \\ 0.61 \end{bmatrix}$$ For $\lambda_2 = 1.44$: $$\mathbf{v}_2 = \begin{bmatrix} -0.61 \\ 0.79 \end{bmatrix}$$ Interpretation:

PC1 direction: $(0.79, 0.61)$
PC1 variance: $5.56$
PC2 direction: $(-0.61, 0.79)$
PC2 variance: $1.44$
Total variance: $5.56 + 1.44 = 7$

Note

Eigenvectors are orthogonal: $\mathbf{v}_1 \perp \mathbf{v}_2$

Singular Value Decomposition (SVD)

DefinitionAny matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ can be decomposed as:
$$\mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$$

Components:

$\mathbf{U} \in \mathbb{R}^{n \times n}$: Left singular vectors
$\mathbf{\Sigma} \in \mathbb{R}^{n \times d}$: Singular values (diagonal)
$\mathbf{V} \in \mathbb{R}^{d \times d}$: Right singular vectors
$\mathbf{U}^T\mathbf{U} = \mathbf{I}$, $\mathbf{V}^T\mathbf{V} = \mathbf{I}$

Relationship to PCA: $$\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{\Sigma}^T\mathbf{\Sigma}\mathbf{V}^T$$ Therefore:

PC directions: Columns of $\mathbf{V}$
PC variances: $\sigma_i^2 / n$

[Figure: ../figures/svd_decomposition.png]

Advantages of SVD for PCA:

More numerically stable
Works for $n < d$ or $n > d$
No need to form $\mathbf{X}^T\mathbf{X}$
Efficient algorithms available

\begin{exampleblock}{Truncated SVD} Keep only top $k$ components: $$\mathbf{X}_k = \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T$$ Best rank-$k$ approximation in Frobenius norm. \end{exampleblock}

SVD vs Eigendecomposition

Eigendecomposition ApproachSteps:
Center data: $\tilde{\mathbf{X}} = \mathbf{X} - \bar{\mathbf{X}}$
Compute covariance: $\mathbf{\Sigma} = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$
Eigendecomposition: $\mathbf{\Sigma} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T$
Sort by eigenvalues

Complexity: $\mathcal{O}(d^2n + d^3)$

Issues:
Numerical instability for $\mathbf{X}^T\mathbf{X}$
Squaring condition number
Memory: storing $d \times d$ matrix

SVD ApproachSteps:
Center data: $\tilde{\mathbf{X}} = \mathbf{X} - \bar{\mathbf{X}}$
SVD: $\tilde{\mathbf{X}} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T$
PCs: columns of $\mathbf{V}$
Variance: $\sigma_i^2 / n$




Complexity: $\mathcal{O}(\min(nd^2, n^2d))$



Advantages:
Better numerical stability
Works directly on $\mathbf{X}$
Better for $n \ll d$ or $n \gg d$
Modern implementations optimized

Recommendation

Use SVD for PCA in practice.

Projection and Reconstruction

Projection onto PCs: Transform to $k$ principal components: $$\mathbf{Z} = \mathbf{X}\mathbf{V}_k$$ where $\mathbf{V}_k \in \mathbb{R}^{d \times k}$ are first $k$ PCs. Properties:

$\mathbf{Z} \in \mathbb{R}^{n \times k}$: Reduced representation
Columns of $\mathbf{Z}$ are uncorrelated
Maximum variance preserved
Linear transformation

Reconstruction: $$\hat{\mathbf{X}} = \mathbf{Z}\mathbf{V}_k^T = \mathbf{X}\mathbf{V}_k\mathbf{V}_k^T$$

Reconstruction Error

$$\|\mathbf{X} - \hat{\mathbf{X}}\|_F^2 = \sum_{i=k+1}^{d}\lambda_i$$

[Figure: ../figures/pca_reconstruction.png]

Example: Original: $\mathbf{x} = [5, 3]$ PCs: $\mathbf{v}_1 = [0.8, 0.6]$ Projection: $$z_1 = \mathbf{x}^T\mathbf{v}_1 = 5(0.8) + 3(0.6) = 5.8$$ Reconstruction: $$\hat{\mathbf{x}} = z_1\mathbf{v}_1 = 5.8[0.8, 0.6] = [4.64, 3.48]$$ Error: $$\|\mathbf{x} - \hat{\mathbf{x}}\| = \|[0.36, -0.48]\| = 0.6$$

Data Centering and Standardization

Centering (Required): Remove mean from each feature: $$\tilde{x}_{ij} = x_{ij} - \bar{x}_j$$ Why?

PCA finds directions of variance
Variance computed about mean
Ensures PC1 passes through origin
Mathematical requirement

Standardization (Optional): Scale to unit variance: $$\tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sigma_j}$$ When to standardize?

Features have different scales
Different units of measurement
Want equal feature importance

[Figure: ../figures/standardization_importance.png]

\begin{exampleblock}{Example Impact} Without standardization:

Income: \$20,000-\$200,000
Age: 20-80 years
PC1 dominated by income scale

With standardization:

Both scaled to mean 0, std 1
Equal contribution opportunity

\end{exampleblock}

Best Practice

Always center. Standardize if features have different scales or units.

PCA Algorithm: Step-by-Step

\begin{algorithm}[H] \caption{Principal Component Analysis} \begin{algorithmic}[1] \REQUIRE Data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$, number of components $k$ \ENSURE Principal components $\mathbf{V}_k$, transformed data $\mathbf{Z}$ \STATE Center the data: \STATE $\bar{\mathbf{x}} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i$ \STATE $\tilde{\mathbf{X}} = \mathbf{X} - \mathbf{1}_n\bar{\mathbf{x}}^T$ \STATE Optional: Standardize features \STATE Compute covariance matrix: \STATE $\mathbf{\Sigma} = \frac{1}{n}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$ \STATE OR compute SVD: \STATE $\tilde{\mathbf{X}} = \mathbf{U}\mathbf{D}\mathbf{V}^T$ \STATE Extract top $k$ eigenvectors/singular vectors: \STATE $\mathbf{V}_k = [\mathbf{v}_1, …, \mathbf{v}_k]$ \STATE Project data onto PCs: \STATE $\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}_k$ \RETURN $\mathbf{V}_k$, $\mathbf{Z}$ \end{algorithmic} \end{algorithm}

Key Steps Explained: \begin{exampleblock}{Step 1-2: Centering} Essential for computing variance correctly. Shift data so mean is at origin. \end{exampleblock} \begin{exampleblock}{Step 3-4: Covariance/SVD} Two equivalent approaches. SVD more stable numerically. \end{exampleblock} \begin{exampleblock}{Step 5: Top-k Selection} Sort by eigenvalues (descending) and keep largest $k$ components. \end{exampleblock} \begin{exampleblock}{Step 6: Projection} Transform original data to new coordinate system defined by PCs. \end{exampleblock} Complexity:

Eigendecomposition: $\mathcal{O}(d^3)$
SVD: $\mathcal{O}(\min(nd^2, n^2d))$

Worked Example: PCA on Toy Dataset (Part 1)

Problem Setup: Dataset with 5 points in 2D: $$\mathbf{X} = \begin{bmatrix} 1 & 2 \\ 2 & 3 \\ 3 & 4 \\ 4 & 5 \\ 5 & 6 \end{bmatrix}$$ Goal: Find principal components Step 1: Center the data Compute mean: $$\bar{\mathbf{x}} = \frac{1}{5}[15, 20]^T = [3, 4]^T$$ Centered data: $$\tilde{\mathbf{X}} = \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{bmatrix}$$

Step 2: Compute covariance matrix $$\mathbf{\Sigma} = \frac{1}{5}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$ $$= \frac{1}{5}\begin{bmatrix} -2 & -1 & 0 & 1 & 2 \\ -2 & -1 & 0 & 1 & 2 \end{bmatrix} \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{bmatrix}$$ $$= \frac{1}{5}\begin{bmatrix} 10 & 10 \\ 10 & 10 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ 2 & 2 \end{bmatrix}$$

Observation

Perfect correlation: $\text{Cov}(X_1, X_2) = \text{Var}(X_1) = \text{Var}(X_2) = 2$

Worked Example: PCA on Toy Dataset (Part 2)

Step 3: Find eigenvalues Solve: $\det(\mathbf{\Sigma} - \lambda\mathbf{I}) = 0$ $$\det\begin{bmatrix} 2-\lambda & 2 \\ 2 & 2-\lambda \end{bmatrix} = 0$$ $$(2-\lambda)^2 - 4 = 0$$ $$\lambda^2 - 4\lambda = 0$$ $$\lambda(\lambda - 4) = 0$$ Eigenvalues: $$\lambda_1 = 4, \lambda_2 = 0$$ Interpretation:

All variance in first PC: $\lambda_1 = 4$
Zero variance in second PC: $\lambda_2 = 0$
Data lies on a line!

Step 4: Find eigenvectors For $\lambda_1 = 4$: $$(\mathbf{\Sigma} - 4\mathbf{I})\mathbf{v}_1 = 0$$ $$\begin{bmatrix} -2 & 2 \\ 2 & -2 \end{bmatrix}\mathbf{v}_1 = 0$$ Solution: $\mathbf{v}_1 = \frac{1}{\sqrt{2}}[1, 1]^T$ For $\lambda_2 = 0$: $$\mathbf{v}_2 = \frac{1}{\sqrt{2}}[1, -1]^T$$ Verification: $$\mathbf{v}_1^T\mathbf{v}_2 = \frac{1}{2}(1 - 1) = 0 \checkmark$$ \begin{exampleblock}{Result} PC1: $[0.707, 0.707]$ (diagonal direction)\\ PC2: $[0.707, -0.707]$ (perpendicular) \end{exampleblock}

Worked Example: PCA on Toy Dataset (Part 3)

Step 5: Project data onto PCs $$\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}$$ $$= \begin{bmatrix} -2 & -2 \\ -1 & -1 \\ 0 & 0 \\ 1 & 1 \\ 2 & 2 \end{bmatrix} \begin{bmatrix} 0.707 & 0.707 \\ 0.707 & -0.707 \end{bmatrix}$$ $$= \begin{bmatrix} -2.83 & 0 \\ -1.41 & 0 \\ 0 & 0 \\ 1.41 & 0 \\ 2.83 & 0 \end{bmatrix}$$ Observation: All points lie on PC1 axis (PC2 coordinates are zero).

Step 6: Explained variance Total variance: $$\text{Total} = \lambda_1 + \lambda_2 = 4 + 0 = 4$$ Explained by PC1: $$\frac{\lambda_1}{\text{Total}} = \frac{4}{4} = 100\%$$ Explained by PC2: $$\frac{\lambda_2}{\text{Total}} = \frac{0}{4} = 0\%$$ Reconstruction (k=1): $$\hat{\mathbf{X}} = \mathbf{Z}_{:,1}\mathbf{v}_1^T + \bar{\mathbf{X}}$$ Perfect reconstruction since all variance in PC1!

Conclusion

This toy example shows perfectly correlated data requiring only 1 dimension.

Worked Example: Non-Trivial 2D Case (Part 1)

New Dataset (4 points): $$\mathbf{X} = \begin{bmatrix} 1 & 2 \\ 2 & 1 \\ 3 & 4 \\ 4 & 3 \end{bmatrix}$$ Step 1: Center data Mean: $\bar{\mathbf{x}} = [2.5, 2.5]^T$ $$\tilde{\mathbf{X}} = \begin{bmatrix} -1.5 & -0.5 \\ -0.5 & -1.5 \\ 0.5 & 1.5 \\ 1.5 & 0.5 \end{bmatrix}$$ Step 2: Covariance $$\mathbf{\Sigma} = \frac{1}{4}\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}$$ $$= \frac{1}{4}\begin{bmatrix} 5 & 3 \\ 3 & 5 \end{bmatrix} = \begin{bmatrix} 1.25 & 0.75 \\ 0.75 & 1.25 \end{bmatrix}$$

Step 3: Eigenvalues $$\det(\mathbf{\Sigma} - \lambda\mathbf{I}) = 0$$ $$(1.25-\lambda)^2 - 0.5625 = 0$$ $$\lambda^2 - 2.5\lambda + 1 = 0$$ Using quadratic formula: $$\lambda = \frac{2.5 \pm \sqrt{6.25-4}}{2} = \frac{2.5 \pm 1.5}{2}$$ Eigenvalues: $$\lambda_1 = 2.0, \lambda_2 = 0.5$$ Variance explained:

PC1: $\frac{2.0}{2.5} = 80\%$
PC2: $\frac{0.5}{2.5} = 20\%$
Both components carry information!

Worked Example: Non-Trivial 2D Case (Part 2)

Step 4: Eigenvectors For $\lambda_1 = 2.0$: $$\begin{bmatrix} -0.75 & 0.75 \\ 0.75 & -0.75 \end{bmatrix}\mathbf{v}_1 = 0$$ Normalize: $\mathbf{v}_1 = \frac{1}{\sqrt{2}}[1, 1]^T = [0.707, 0.707]^T$ For $\lambda_2 = 0.5$: $$\mathbf{v}_2 = \frac{1}{\sqrt{2}}[1, -1]^T = [0.707, -0.707]^T$$ Step 5: Transform data $$\mathbf{Z} = \tilde{\mathbf{X}}\mathbf{V}$$ $$= \begin{bmatrix} -1.414 & -0.707 \\ -1.414 & 0.707 \\ 1.414 & -0.707 \\ 1.414 & 0.707 \end{bmatrix}$$

Step 6: Reconstruction (k=1) Keep only PC1: $$\hat{\tilde{\mathbf{X}}} = \mathbf{Z}_{:,1}\mathbf{v}_1^T$$ $$= \begin{bmatrix} -1.414 \\ -1.414 \\ 1.414 \\ 1.414 \end{bmatrix} \begin{bmatrix} 0.707 & 0.707 \end{bmatrix}$$ $$= \begin{bmatrix} -1.0 & -1.0 \\ -1.0 & -1.0 \\ 1.0 & 1.0 \\ 1.0 & 1.0 \end{bmatrix}$$ Reconstruction error: $$\text{Error} = \lambda_2 = 0.5$$ \begin{exampleblock}{Summary} Reduced from 2D to 1D, preserving 80\% of variance. \end{exampleblock}

Kernel PCA: Non-linear Extension

MotivationStandard PCA finds only linear relationships. Many real-world datasets have non-linear structure.

Key Idea:

Map data to high-dimensional space: $\phi: \mathbb{R}^d \to \mathcal{H}$
Apply PCA in feature space $\mathcal{H}$
Use kernel trick: $K(x_i, x_j) = \phi(x_i)^T\phi(x_j)$
Never explicitly compute $\phi(x)$

Algorithm:

Compute kernel matrix: $\mathbf{K}_{ij} = K(x_i, x_j)$
Center kernel matrix
Eigendecompose: $\mathbf{K} = \mathbf{U}\mathbf{\Lambda}\mathbf{U}^T$
Project: $z_i = \mathbf{U}^T\mathbf{k}(x_i)$

[Figure: ../figures/kernel_pca_concept.png]

Common Kernels:

RBF: $K(x,y) = \exp(-\gamma\|x-y\|^2)$
Polynomial: $K(x,y) = (x^Ty + c)^d$
Sigmoid: $K(x,y) = \tanh(\alpha x^Ty + c)$

\begin{exampleblock}{Applications}

Manifold learning
Non-linear dimensionality reduction
Feature extraction for non-linear data

\end{exampleblock}

Trade-off

More flexible but computationally expensive: $\mathcal{O}(n^3)$

Sparse PCA

MotivationStandard PCA produces components that are linear combinations of all original features, making interpretation difficult.

Goal: Find PCs with sparse loadings

Most loadings are exactly zero
Only few features contribute to each PC
Easier interpretation
Feature selection built-in

Formulation: $$\min_{\mathbf{V}} \|\mathbf{X} - \mathbf{X}\mathbf{V}\mathbf{V}^T\|_F^2 + \lambda\sum_{j=1}^{k}\|\mathbf{v}_j\|_1$$

First term: Reconstruction error
Second term: Sparsity penalty (L1 regularization)
$\lambda$: Controls sparsity level

[Figure: ../figures/sparse_pca.png]

Comparison: \begin{tabular}{|l|c|c|} \hline Method & Variance & Sparsity \\ \hline Standard PCA & 100\% & 0\% \\ Sparse PCA & 95-98\% & 70-90\% \\ \hline \end{tabular} Applications:

Gene expression analysis
Financial data analysis
Text mining
Any domain requiring interpretability

Trade-off

Slight loss of explained variance for much better interpretability.

Incremental PCA

MotivationStandard PCA requires entire dataset in memory. Not feasible for:
Very large datasets
Streaming data
Online learning scenarios
Limited memory systems

Key Idea: Process data in mini-batches:

Initialize with first batch
Update PCs incrementally with new batches
Maintain approximate PCs online
Never load full dataset

Algorithm Sketch: \begin{algorithmic}[1] \STATE Initialize: $\mathbf{V}_0$, $\mathbf{\Sigma}_0$, $n_0$ \FOR{each batch $\mathbf{X}_b$} \STATE Update mean: $\bar{\mathbf{x}}_{new}$ \STATE Update covariance: $\mathbf{\Sigma}_{new}$ \STATE SVD: Update $\mathbf{V}$ \STATE $n \gets n + n_b$ \ENDFOR \end{algorithmic}

[Figure: ../figures/incremental_pca.png]

Advantages:

Memory efficient: Constant memory
Scalable: Handles large data
Online: Updates with new data
Streaming: Real-time processing

Complexity:

Per batch: $\mathcal{O}(bd^2)$ where $b$ = batch size
Memory: $\mathcal{O}(d^2)$ instead of $\mathcal{O}(nd)$
Nearly identical results to batch PCA

\begin{exampleblock}{Use Cases}

Large-scale image processing
Sensor data streams
Online recommendation systems

\end{exampleblock}

Probabilistic PCA

MotivationStandard PCA is deterministic. Probabilistic PCA provides:
Generative model of data
Handling of missing values
Uncertainty quantification
Bayesian extensions

Model: $$\begin{aligned}\mathbf{z} \sim \mathcal{N}(0, \mathbf{I}_k) \\ \mathbf{x}|\mathbf{z} \sim \mathcal{N}(\mathbf{W}\mathbf{z} + \boldsymbol{\mu}, \sigma^2\mathbf{I}_d)\end{aligned}$$ where:

$\mathbf{z} \in \mathbb{R}^k$: Latent variables
$\mathbf{W} \in \mathbb{R}^{d \times k}$: Loading matrix
$\sigma^2$: Isotropic noise

Maximum Likelihood Solution: $$\mathbf{W}_{ML} = \mathbf{V}_k(\mathbf{\Lambda}_k - \sigma^2\mathbf{I})^{1/2}\mathbf{R}$$ where $\mathbf{R}$ is arbitrary rotation.

EM Algorithm: \begin{algorithmic}[1] \STATE Initialize $\mathbf{W}$, $\sigma^2$ \REPEAT \STATE E-step: Compute $\mathbb{E}[\mathbf{z}|\mathbf{x}]$ \STATE M-step: Update $\mathbf{W}$, $\sigma^2$ \UNTIL{convergence} \end{algorithmic} Advantages:

Handle missing data naturally
Mixture of PPCAs for clustering
Bayesian inference possible
Uncertainty quantification

Relationship to Standard PCA:

As $\sigma^2 \to 0$: Recovers standard PCA
Same subspace in limit
Additional noise model

\begin{exampleblock}{Extensions}

Factor Analysis: Non-isotropic noise
Variational PCA: Variational inference
Sparse PPCA: Sparse loadings

\end{exampleblock}

Robust PCA

MotivationStandard PCA is sensitive to:
Outliers: Anomalous data points
Corrupted entries: Missing or noisy values
Sparse errors: Localized corruption

Problem Formulation: Decompose $\mathbf{X}$ into: $$\mathbf{X} = \mathbf{L} + \mathbf{S}$$ where:

$\mathbf{L}$: Low-rank (clean data)
$\mathbf{S}$: Sparse (outliers/corruption)

Optimization: $$\min_{\mathbf{L},\mathbf{S}} \|\mathbf{L}\|_* + \lambda\|\mathbf{S}\|_1$$ $$\text{subject to } \mathbf{X} = \mathbf{L} + \mathbf{S}$$

$\|\mathbf{L}\|_*$: Nuclear norm (rank proxy)
$\|\mathbf{S}\|_1$: L1 norm (sparsity)

Applications: \begin{exampleblock}{Video Surveillance}

$\mathbf{L}$: Static background
$\mathbf{S}$: Moving objects
Separates foreground/background

\end{exampleblock} \begin{exampleblock}{Face Recognition}

$\mathbf{L}$: Face structure
$\mathbf{S}$: Occlusions, shadows
Robust to partial occlusion

\end{exampleblock} Solution Methods:

Principal Component Pursuit (PCP)
Alternating Direction Method (ADMM)
Proximal gradient methods

Complexity:

$\mathcal{O}(nd^2k)$ per iteration
Slower than standard PCA
Worth it for corrupted data

Scree Plot: Visualizing Variance

DefinitionA scree plot displays eigenvalues (or explained variance) for each principal component in decreasing order.

How to Read:

X-axis: Principal component number
Y-axis: Eigenvalue or explained variance
Look for "elbow" point
Steep drop indicates important PCs
Flat tail indicates noise

Elbow Method:

Find point where curve bends
Keep PCs before elbow
Remaining PCs contribute little
Subjective but practical

Rule of Thumb

Keep components before the "elbow" where variance drops sharply.

[Figure: ../figures/scree_plot.png]

Example Interpretation:

PC1: Large eigenvalue (most variance)
PC2-3: Moderate decrease
PC4+: Flat (noise level)
Decision: Keep 3-4 components

\begin{exampleblock}{Kaiser Criterion} Alternative rule: Keep components with eigenvalue $> 1$ (for correlation matrix). \end{exampleblock}

Explained Variance Ratio

Individual Explained Variance: For component $i$: $$\text{EVR}_i = \frac{\lambda_i}{\sum_{j=1}^{d}\lambda_j}$$ Cumulative Explained Variance: For first $k$ components: $$\text{CEV}_k = \frac{\sum_{i=1}^{k}\lambda_i}{\sum_{j=1}^{d}\lambda_j}$$ Selection Criteria:

Fixed threshold: $\text{CEV}_k \geq 0.95$ (95\%)
Fixed number: $k = 10$ components
Elbow method: Visual inspection
Cross-validation: Best downstream performance

Common Choice

Keep components explaining 90-95\% of total variance.

Example Calculation: Eigenvalues: $[5.2, 2.1, 0.8, 0.3, 0.1]$ Total: $\sum \lambda_i = 8.5$ \begin{tabular}{|c|c|c|} \hline PC & EVR & CEV \\ \hline 1 & 61.2\% & 61.2\% \\ 2 & 24.7\% & 85.9\% \\ 3 & 9.4\% & 95.3\% \\ 4 & 3.5\% & 98.8\% \\ 5 & 1.2\% & 100.0\% \\ \hline \end{tabular} Decision:

For 85\% variance: Keep 2 PCs
For 95\% variance: Keep 3 PCs
For 99\% variance: Keep 4 PCs

\begin{exampleblock}{Trade-off} More components: Better reconstruction, less compression \end{exampleblock}

Reconstruction Error

DefinitionReconstruction error measures information lost when using only $k < d$ principal components.

Formula: $$E_k = \|\mathbf{X} - \hat{\mathbf{X}}_k\|_F^2 = \sum_{i=k+1}^{d}\lambda_i$$ where $\hat{\mathbf{X}}_k = \mathbf{X}\mathbf{V}_k\mathbf{V}_k^T$ Normalized Error: $$\text{Relative Error} = \frac{E_k}{\|\mathbf{X}\|_F^2} = \frac{\sum_{i=k+1}^{d}\lambda_i}{\sum_{i=1}^{d}\lambda_i}$$ Properties:

Monotonically decreasing in $k$
Zero when $k = d$ (perfect reconstruction)
Directly related to discarded eigenvalues

[Figure: ../figures/reconstruction_error.png]

Selection Strategy: Set error tolerance $\epsilon$: $$E_k \leq \epsilon \cdot \|\mathbf{X}\|_F^2$$ Choose smallest $k$ satisfying constraint. Example:

Total variance: 100
Target: 5\% error tolerance
Need: CEV $\geq$ 95\%
$\sum_{i=1}^{k}\lambda_i \geq 95$

\begin{exampleblock}{Application} Image compression: Balance file size vs quality using reconstruction error. \end{exampleblock}

Cross-Validation for Component Selection

Task-Specific Selection: When PCA is used for downstream task:

Apply PCA with different $k$ values
Train model on reduced data
Evaluate on validation set
Choose $k$ with best performance

Procedure: \begin{algorithmic}[1] \STATE Split data: Train/Validation \FOR{$k = 1$ to $k_{max}$} \STATE Fit PCA on training set \STATE Transform train \& validation \STATE Train classifier/regressor \STATE Evaluate performance \ENDFOR \STATE Select $k$ with best validation score \end{algorithmic}

Important

Fit PCA only on training data to avoid data leakage!

Example: Classification Task \begin{tabular}{|c|c|c|} \hline k & Train Acc & Val Acc \\ \hline 5 & 0.75 & 0.72 \\ 10 & 0.82 & 0.79 \\ 20 & 0.89 & 0.85 \\ 50 & 0.95 & 0.84 \\ 100 & 0.98 & 0.82 \\ \hline \end{tabular} Analysis:

Best validation: $k = 20$
$k = 50, 100$: Overfitting
Task-specific optimum

Considerations:

Variance explained vs task performance
Computational cost
Interpretability needs
Domain constraints

\begin{exampleblock}{Best Practice} Use cross-validation when PCA is preprocessing step for supervised learning. \end{exampleblock}

Component Correlation Analysis

Goal: Understand relationships between original features and principal components. Loading Matrix: $$\mathbf{V} = [\mathbf{v}_1, \mathbf{v}_2, …, \mathbf{v}_d]$$ Interpretation:

$v_{ij}$: Contribution of feature $j$ to PC $i$
Large $|v_{ij}|$: Feature $j$ important for PC $i$
Sign indicates direction of contribution
Loadings are correlations (if standardized)

Biplot:

Visualize data and loadings together
Arrow length: Variable importance
Arrow direction: Correlation with PCs
Angle between arrows: Variable correlation

Interpretation

Close to parallel arrows indicate highly correlated variables.

[Figure: ../figures/component_correlation.png]

Example Loading Matrix: \scriptsize \begin{tabular}{|l|c|c|c|} \hline Feature & PC1 & PC2 & PC3 \\ \hline Height & 0.82 & 0.15 & -0.05 \\ Weight & 0.79 & 0.20 & 0.08 \\ Age & 0.12 & 0.91 & 0.10 \\ Income & -0.05 & 0.08 & 0.95 \\ \hline \end{tabular} Interpretation:

PC1: "Body size" (height, weight)
PC2: "Age" dimension
PC3: "Income" dimension
Clear separation of concepts

Application: Image Compression

Problem: Images contain redundant information. Can we store them more efficiently? PCA Approach:

Treat each image as vector (flatten pixels)
Build dataset: $\mathbf{X} \in \mathbb{R}^{n \times d}$
Apply PCA: Find eigenvectors
Project: $\mathbf{Z} = \mathbf{X}\mathbf{V}_k$
Store: $\mathbf{Z}$ and $\mathbf{V}_k$ instead of $\mathbf{X}$

Compression Ratio: Original: $n \times d$ values Compressed: $n \times k + k \times d$ values Ratio: $\frac{nd}{nk + kd} \approx \frac{d}{k}$ (for large $n$) Example:

$d = 10,000$ pixels
$k = 100$ components
Compression: $100:1$

[Figure: ../figures/image_compression.png]

Quality vs Compression: \scriptsize \begin{tabular}{|c|c|c|} \hline PCs & Compression & Quality \\ \hline 10 & 95\% & Poor \\ 50 & 75\% & Fair \\ 100 & 50\% & Good \\ 200 & 0\% & Excellent \\ \hline \end{tabular} \begin{exampleblock}{Trade-off} Balance between file size and visual quality. Typical choice: 80-90\% variance retained. \end{exampleblock}

Application: Face Recognition (Eigenfaces)

Eigenfaces Method:

Collect face images: $n$ images, $d$ pixels each
Apply PCA: Find "eigenfaces" (principal components)
Each eigenface captures facial variation
Represent faces in eigenface space
Recognition: Nearest neighbor in PC space

Advantages:

Dimensionality reduction: $10,000 \to 100$
Fast matching in low-dimensional space
Captures important facial features
Robust to minor variations

Process:

Training: Build eigenface basis
Enrollment: Project new face to PC space
Recognition: Compare with database
Decision: Threshold distance

[Figure: ../figures/eigenfaces_application.png]

Typical Eigenfaces:

PC1: Average lighting
PC2: Left-right contrast
PC3: Facial expression
PC4-10: Detailed features
PC11+: Fine details/noise

\begin{exampleblock}{Historical Note} Eigenfaces pioneered in 1991 by Turk and Pentland. Still used as baseline method. \end{exampleblock}

Limitations

Sensitive to: lighting, pose, facial expression. Modern methods use deep learning.

Application: Noise Filtering

Principle: Noise typically has:

Low variance
High frequency
Random direction
Captured by small eigenvalues

Signal has:

High variance
Low frequency
Structured patterns
Captured by large eigenvalues

Denoising Procedure:

Apply PCA to noisy data
Keep only top $k$ components (signal)
Discard remaining components (noise)
Reconstruct: $\hat{\mathbf{X}} = \mathbf{X}\mathbf{V}_k\mathbf{V}_k^T$

Key Insight

PCA acts as low-pass filter, removing high-frequency noise.

[Figure: ../figures/noise_filtering.png]

Example: Signal Denoising Original signal + Gaussian noise

Noise variance: 0.1
PCA: Keep 5 components
SNR improvement: 15 dB

Applications:

Image denoising
Audio signal processing
Sensor data cleaning
Medical imaging
Financial time series

\begin{exampleblock}{Challenge} Choosing $k$: Too small loses signal, too large keeps noise. Use cross-validation or domain knowledge. \end{exampleblock}

Application: Data Visualization

Challenge: Cannot visualize data in $d > 3$ dimensions. PCA Solution: Project to 2D or 3D for visualization:

Apply PCA: Get principal components
Keep PC1 and PC2 (or PC1, PC2, PC3)
Plot data in reduced space
Preserves maximum variance

Interpretation:

PC1 (x-axis): Direction of most variance
PC2 (y-axis): Second most variance
Distances approximately preserved
Clusters may emerge

Use Cases:

Exploratory data analysis
Cluster visualization
Outlier detection
Pattern discovery
Presentation to stakeholders

[Figure: ../figures/data_visualization_application.png]

\begin{exampleblock}{Example: Iris Dataset}

Original: 4 dimensions
PCA: 2D projection
Explained variance: 95.8\%
Clear species separation visible

\end{exampleblock} Comparison with Other Methods: \scriptsize \begin{tabular}{|l|c|c|} \hline Method & Linear & Global \\ \hline PCA & Yes & Yes \\ t-SNE & No & No \\ UMAP & No & Local \\ MDS & Yes/No & Yes \\ \hline \end{tabular}

Limitation

PCA preserves global structure but may miss non-linear patterns.

Application: Exploratory Data Analysis

Digits Dataset Visualization: High-dimensional handwritten digit images (64 dimensions) projected to 2D. Insights from PCA:

Digit clusters visible in PC space
Similar digits closer together
Outliers easily identified
Confusion patterns apparent

Practical Workflow:

Load dataset
Standardize features
Apply PCA (keep 2-3 PCs)
Visualize with scatter plot
Color by labels (if available)
Identify patterns/outliers

EDA Benefit

Quick visual check before applying complex ML models.

[Figure: ../figures/digits_visualization.png]

Observations:

Digits 0 and 1 well-separated
Digits 3, 5, 8 overlap slightly
Some outliers (miswritten digits)
PC1 and PC2 capture 30\% variance

\begin{exampleblock}{Feature Engineering} PCA projections can be used as features for downstream classification:

Reduce 64D to 20D
Train classifier on PC scores
Faster training
Often better generalization

\end{exampleblock}

Application: Feature Engineering

PCA as Preprocessing: Use PCA-transformed features for ML models. Benefits:

Decorrelation: Remove multicollinearity
Dimensionality: Reduce feature count
Noise reduction: Filter out noisy components
Speed: Faster model training
Regularization: Implicit regularization effect

Pipeline: \begin{algorithmic}[1] \STATE Train set: Fit PCA \STATE Train set: Transform with PCA \STATE Test set: Transform with same PCA \STATE Train classifier on PC scores \STATE Evaluate on test PC scores \end{algorithmic}

Warning

Never fit PCA on test data! This causes data leakage.

[Figure: ../figures/feature_engineering_application.png]

Example Results: \scriptsize \begin{tabular}{|l|c|c|} \hline Features & Accuracy & Time \\ \hline Original (1000D) & 0.85 & 120s \\ PCA (100D) & 0.87 & 15s \\ PCA (50D) & 0.86 & 8s \\ PCA (10D) & 0.79 & 2s \\ \hline \end{tabular} Observations:

Sweet spot: 50-100 components
Improved accuracy with PCA
Much faster training
Slight degradation with too few PCs

\begin{exampleblock}{Use Case} High-dimensional datasets where features are correlated (genomics, text, images). \end{exampleblock}

Best Practices: Standardization

Critical Decision

Should you standardize features before PCA?

Standardization: $$z_j = \frac{x_j - \mu_j}{\sigma_j}$$ When to Standardize:

Features have different units
Features have different scales
Want equal weight for all features
Domain knowledge suggests equal importance

When NOT to Standardize:

Features already on same scale
Scale carries information
Domain knowledge: some features more important
Interpretation requires original scale

[Figure: ../figures/standardization_importance.png]

Example Impact: Dataset: Height (cm), Weight (kg), Age (years) Without standardization:

Height range: 150-200
Weight range: 50-100
Age range: 20-70
PC1 dominated by height

With standardization:

All ranges: -2 to 2
Equal opportunity for all
More balanced PCs

\begin{exampleblock}{Recommendation} Default: Standardize unless you have good reason not to. \end{exampleblock}

Computational Complexity

Standard PCA Complexity: \begin{tabular}{|l|c|} \hline Operation & Complexity \\ \hline Centering & $\mathcal{O}(nd)$ \\ Covariance & $\mathcal{O}(nd^2)$ \\ Eigendecomp & $\mathcal{O}(d^3)$ \\ SVD & $\mathcal{O}(\min(nd^2, n^2d))$ \\ \hline Total & $\mathcal{O}(nd^2 + d^3)$ \\ \hline \end{tabular} Memory Requirements:

Data: $\mathcal{O}(nd)$
Covariance: $\mathcal{O}(d^2)$
Eigenvectors: $\mathcal{O}(d^2)$
Total: $\mathcal{O}(nd + d^2)$

Scalability Issues:

Large $d$: Covariance matrix huge
Large $n$: Memory for data matrix
Both large: Computational bottleneck

[Figure: ../figures/computational_complexity.png]

Solutions for Large-Scale: \begin{exampleblock}{Incremental PCA}

Process mini-batches
Memory: $\mathcal{O}(bd + d^2)$
Time: $\mathcal{O}(ndk)$

\end{exampleblock} \begin{exampleblock}{Randomized PCA}

Approximate top-$k$ PCs
Time: $\mathcal{O}(ndk)$
Much faster for $k \ll d$

\end{exampleblock} \begin{exampleblock}{Sparse PCA}

Exploit data sparsity
Reduce effective dimensionality

\end{exampleblock}

Rule of Thumb

Standard PCA: $n, d < 10,000$\\ Incremental: $n > 100,000$\\ Randomized: $d > 10,000$, $k \ll d$

Common Pitfalls and How to Avoid Them

1. Data Leakage:

Mistake

Fitting PCA on entire dataset including test set.

Correct approach:

Fit PCA only on training set
Transform train and test separately
Use same transformation for both

2. Forgetting to Center:

Mistake

Applying PCA without centering data.

Why it matters:

PCA assumes zero mean
Results will be incorrect
Always center first!

3. Wrong Scaling Choice:

Mistake

Not standardizing when features have different scales.

Impact:

PCs dominated by large-scale features
Misleading variance explanation

4. Interpreting PCs as Features:

Mistake

Assuming PCs have same meaning as original features.

Reality:

PCs are linear combinations
May not have intuitive interpretation
Use loadings for understanding

5. Assuming Linearity:

Mistake

Applying PCA to manifold data with non-linear structure.

Solution:

Check for non-linearity first
Consider Kernel PCA or manifold methods

6. Ignoring Outliers:

Mistake

Not handling outliers before PCA.

Impact:

PCs skewed by outliers
Variance misrepresented
Use robust PCA if needed

Interpreting PCA Results

What to Report:

Number of components: $k$ chosen
Explained variance: Per component and cumulative
Scree plot: Visualize eigenvalue decay
Loading matrix: Top features per PC
Biplot: If $d$ is small
Reconstruction error: If applicable

Interpreting Loadings:

Examine largest magnitude loadings
Group features by sign
Name PC based on dominant features
Example: "Size component", "Age component"

\begin{exampleblock}{Example} PC1 with high loadings on [height, weight, BMI]:

Interpretation: "Body size" component
Positive values: Larger individuals
Negative values: Smaller individuals

\end{exampleblock}

Statistical Significance: Bootstrap approach:

Resample data multiple times
Compute PCA on each sample
Check stability of components
Report confidence intervals

Permutation test:

Randomly permute features
Compare eigenvalues to null distribution
Test if variance is significant

Practical Checklist:

$\checkmark$ Data centered/standardized?
$\checkmark$ Scree plot shows elbow?
$\checkmark$ Enough variance explained?
$\checkmark$ PCs interpretable?
$\checkmark$ No data leakage?
$\checkmark$ Outliers addressed?
$\checkmark$ Results validated?

Documentation

Always document preprocessing choices and justification for $k$.

Key Takeaways

Core Concepts:

PCA: Linear dimensionality reduction via variance maximization
Principal components: Orthogonal directions of maximum variance
Eigendecomposition: Mathematical foundation
SVD: Practical computation method
Variance explained: Quantifies information retention

Key Steps:

Center (and optionally standardize) data
Compute covariance or apply SVD
Extract eigenvectors/singular vectors
Project data onto top-$k$ components
Evaluate and interpret results

Variants:

Kernel PCA: Non-linear extension
Sparse PCA: Interpretable loadings
Incremental PCA: Large-scale data
Robust PCA: Handle outliers

Applications:

Data visualization and exploration
Image compression and processing
Face recognition (eigenfaces)
Noise filtering and denoising
Feature engineering for ML
Dimensionality reduction

Best Practices:

Always center data, standardize if needed
Use scree plot and explained variance
Avoid data leakage in train/test split
Check for outliers and non-linearity
Validate component selection
Document all preprocessing choices

Limitations:

Assumes linear relationships
Sensitive to outliers and scaling
Loses interpretability
May not preserve non-linear structure

End of Module 07

Principal Component Analysis

Questions?