Principal Component Analysis (PCA) - Foundations of Machine Learning I Book

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and statistics. It transforms high-dimensional data into a lower-dimensional form while retaining as much variance (information) as possible.

PCA is particularly useful when:

You have many features and want to reduce computational complexity
You need to visualize high-dimensional data in 2D or 3D
Features are correlated and you want to eliminate redundancy
You want to reduce noise and improve model performance

In this chapter, we will explore the mathematical foundations of PCA, connecting to concepts from linear algebra and calculus, and demonstrate how to apply PCA in practice.

Why Dimensionality Reduction?¶

In many real-world datasets, we encounter the “curse of dimensionality” - as the number of features increases:

Computational cost grows exponentially
Visualization becomes impossible beyond 3 dimensions
Model complexity increases, leading to overfitting
Feature correlation may introduce redundancy

PCA addresses these challenges by finding a new set of uncorrelated variables (principal components) that capture the maximum variance in the data.

Mathematical Foundations of PCA¶

PCA relies on concepts from linear algebra and calculus. Understanding the mathematical foundation will help you appreciate how PCA works and when to use it.

Variance and Covariance¶

Variance measures how much a single variable spreads out from its mean:

\text{Var}(X) = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2

(1)

Covariance measures how two variables change together:

\text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})

(2)

Where:

$x_i$ and $y_i$ are individual data points
$\bar{x}$ and $\bar{y}$ are the means of $X$ and $Y$
$n$ is the number of observations

The covariance matrix $\Sigma$ for a dataset with $p$ features is a $p \times p$ symmetric matrix where:

\Sigma = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_p) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_p) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_p, X_1) & \text{Cov}(X_p, X_2) & \cdots & \text{Var}(X_p) \end{bmatrix}

(3)

Eigenvalues and Eigenvectors¶

PCA finds the directions of maximum variance by computing the eigenvalues and eigenvectors of the covariance matrix.

For a square matrix $A$ , a vector $\mathbf{v}$ is an eigenvector and $\lambda$ is its corresponding eigenvalue if:

A\mathbf{v} = \lambda\mathbf{v}

(4)

This means that when matrix $A$ is applied to eigenvector $\mathbf{v}$ , the vector only gets scaled by $\lambda$ without changing direction.

To find eigenvalues, we solve the characteristic equation:

\det(A - \lambda I) = 0

(5)

Where:

$\det$ is the determinant
$I$ is the identity matrix
$\lambda$ are the eigenvalues that satisfy this equation

In the context of PCA:

Eigenvectors of the covariance matrix represent the directions (principal components) of maximum variance
Eigenvalues represent the amount of variance explained by each principal component
Larger eigenvalues correspond to more important principal components

The PCA Algorithm¶

PCA follows these steps to transform the data:

Step 1: Standardize the Data¶

First, we center the data by subtracting the mean from each feature:

X_{\text{centered}} = X - \mu

(6)

Often, we also scale the data to unit variance (z-score normalization):

X_{\text{standardized}} = \frac{X - \mu}{\sigma}

(7)

Where $\mu$ is the mean vector and $\sigma$ is the standard deviation vector.

Why standardize? Features with larger scales would dominate the principal components. Standardization ensures each feature contributes equally.

Step 2: Compute the Covariance Matrix¶

Calculate the covariance matrix of the standardized data:

\Sigma = \frac{1}{n-1} X^T X

(8)

Where $X$ is the centered/standardized data matrix ( $n \times p$ ), with $n$ samples and $p$ features.

Step 3: Compute Eigenvalues and Eigenvectors¶

Solve the eigenvalue problem for the covariance matrix:

\Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i

(9)

This yields $p$ eigenvalues $\lambda_1, \lambda_2, \ldots, \lambda_p$ and corresponding eigenvectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_p$ .

Step 4: Sort Eigenvalues and Select Principal Components¶

Sort eigenvalues in descending order:

\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p

(10)

Select the top $k$ eigenvectors corresponding to the $k$ largest eigenvalues. These form the principal components.

Step 5: Transform the Data¶

Project the original data onto the selected principal components:

Z = X W_k

(11)

Where:

$Z$ is the transformed data ( $n \times k$ )
$X$ is the standardized original data ( $n \times p$ )
$W_k$ is the matrix of $k$ selected eigenvectors ( $p \times k$ )

Variance Explained¶

An important aspect of PCA is understanding how much information (variance) is retained after dimensionality reduction.

The proportion of variance explained by the $i$ -th principal component is:

\text{Variance Explained}_i = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}

(12)

The cumulative variance explained by the first $k$ components is:

\text{Cumulative Variance}_k = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{j=1}^{p} \lambda_j}

(13)

Rule of thumb: Select enough principal components to explain at least 80-95% of the total variance.

Geometric Interpretation¶

Geometrically, PCA can be understood as:

Finding new axes: The principal components represent new orthogonal (perpendicular) axes in the feature space
Rotating the data: PCA rotates the data so that the maximum variance lies along the first axis (PC1), the second maximum variance along the second axis (PC2), and so on
Projection: The transformed data are the projections of the original data points onto these new axes

This rotation aligns the data with the directions of maximum variance, making it easier to identify patterns and reduce dimensionality.

Connection to Calculus: Optimization Perspective¶

PCA can also be viewed as an optimization problem. We want to find the direction $\mathbf{w}$ that maximizes the variance of the projected data:

\max_{\mathbf{w}} \mathbf{w}^T \Sigma \mathbf{w} \quad \text{subject to} \quad \|\mathbf{w}\| = 1

(14)

Using Lagrange multipliers from calculus, we form the Lagrangian:

L(\mathbf{w}, \lambda) = \mathbf{w}^T \Sigma \mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)

(15)

Taking the derivative with respect to $\mathbf{w}$ and setting it to zero:

\frac{\partial L}{\partial \mathbf{w}} = 2\Sigma\mathbf{w} - 2\lambda\mathbf{w} = 0

(16)

This simplifies to:

\Sigma\mathbf{w} = \lambda\mathbf{w}

(17)

This is exactly the eigenvalue equation! The solution is the eigenvector corresponding to the largest eigenvalue. Subsequent principal components are found by maximizing variance in directions orthogonal to previous components.

PCA in Python¶

In Python, we can implement PCA using:

Manual implementation: Using NumPy to compute covariance matrix, eigenvalues, and eigenvectors
Scikit-learn: Using the PCA class for efficient implementation

Basic Usage with Scikit-learn¶

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)

# Access explained variance
print("Variance explained by each component:")
print(pca.explained_variance_ratio_)

# Access principal components (eigenvectors)
print("\nPrincipal components:")
print(pca.components_)

When to Use PCA¶

PCA is particularly useful when:

Reducing computational cost: With hundreds or thousands of features
Visualization: Reducing to 2-3 dimensions for plotting
Removing multicollinearity: When features are highly correlated
Noise reduction: As minor components often represent noise
Feature extraction: Creating new features that capture the most important patterns

Important considerations:

PCA assumes linear relationships between features
The transformed features (principal components) are linear combinations of original features, which may be harder to interpret
PCA is sensitive to scaling, so always standardize your data first
PCA is an unsupervised technique - it doesn’t consider the target variable

Applications of PCA¶

PCA is widely used across various domains:

Image Processing: Facial recognition, image compression
Genomics: Analyzing gene expression data
Finance: Portfolio risk analysis, detecting patterns in stock prices
Natural Language Processing: Topic modeling, document similarity
Computer Vision: Object detection, image reconstruction

Summary¶

Principal Component Analysis is a powerful dimensionality reduction technique that:

Transforms data into a new coordinate system where axes represent directions of maximum variance
Uses eigenvectors and eigenvalues of the covariance matrix to identify these directions
Can be understood through optimization using Lagrange multipliers from calculus
Helps reduce computational complexity, visualize data, and eliminate feature redundancy
Requires standardization and careful selection of the number of components to retain

In the next section, we will explore practical implementations of PCA, both from scratch and using Python libraries.

References and Further Reading¶

Principal Component Analysis was originally developed by Karl Pearson in 1901 Pearson (1901) and later refined by Harold Hotelling in 1933 Hotelling (1933). For a comprehensive modern treatment of PCA, see Jolliffe’s textbook Jolliffe (2002) and the review article by Jolliffe and Cadima Jolliffe & Cadima (2016).

The Python implementation used in this book relies on the scikit-learn library Pedregosa et al. (2011), which provides efficient and well-tested implementations of PCA and other machine learning algorithms.

References¶

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. 10.1080/14786440109462720
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441. 10.1037/h0071325
Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer. 10.1007/b98835
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202. 10.1098/rsta.2015.0202
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & others. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.