Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique used in machine learning and statistics. It transforms high-dimensional data into a lower-dimensional form while retaining as much variance (information) as possible.

PCA is particularly useful when:

  • You have many features and want to reduce computational complexity

  • You need to visualize high-dimensional data in 2D or 3D

  • Features are correlated and you want to eliminate redundancy

  • You want to reduce noise and improve model performance

In this chapter, we will explore the mathematical foundations of PCA, connecting to concepts from linear algebra and calculus, and demonstrate how to apply PCA in practice.

Why Dimensionality Reduction?

In many real-world datasets, we encounter the “curse of dimensionality” - as the number of features increases:

  • Computational cost grows exponentially

  • Visualization becomes impossible beyond 3 dimensions

  • Model complexity increases, leading to overfitting

  • Feature correlation may introduce redundancy

PCA addresses these challenges by finding a new set of uncorrelated variables (principal components) that capture the maximum variance in the data.

Mathematical Foundations of PCA

PCA relies on concepts from linear algebra and calculus. Understanding the mathematical foundation will help you appreciate how PCA works and when to use it.

Variance and Covariance

Variance measures how much a single variable spreads out from its mean:

Var(X)=1n1i=1n(xixˉ)2\text{Var}(X) = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2

Covariance measures how two variables change together:

Cov(X,Y)=1n1i=1n(xixˉ)(yiyˉ)\text{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})

Where:

  • xix_i and yiy_i are individual data points

  • xˉ\bar{x} and yˉ\bar{y} are the means of XX and YY

  • nn is the number of observations

The covariance matrix Σ\Sigma for a dataset with pp features is a p×pp \times p symmetric matrix where:

Σ=[Var(X1)Cov(X1,X2)Cov(X1,Xp)Cov(X2,X1)Var(X2)Cov(X2,Xp)Cov(Xp,X1)Cov(Xp,X2)Var(Xp)]\Sigma = \begin{bmatrix} \text{Var}(X_1) & \text{Cov}(X_1, X_2) & \cdots & \text{Cov}(X_1, X_p) \\ \text{Cov}(X_2, X_1) & \text{Var}(X_2) & \cdots & \text{Cov}(X_2, X_p) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(X_p, X_1) & \text{Cov}(X_p, X_2) & \cdots & \text{Var}(X_p) \end{bmatrix}

Eigenvalues and Eigenvectors

PCA finds the directions of maximum variance by computing the eigenvalues and eigenvectors of the covariance matrix.

For a square matrix AA, a vector v\mathbf{v} is an eigenvector and λ\lambda is its corresponding eigenvalue if:

Av=λvA\mathbf{v} = \lambda\mathbf{v}

This means that when matrix AA is applied to eigenvector v\mathbf{v}, the vector only gets scaled by λ\lambda without changing direction.

To find eigenvalues, we solve the characteristic equation:

det(AλI)=0\det(A - \lambda I) = 0

Where:

  • det\det is the determinant

  • II is the identity matrix

  • λ\lambda are the eigenvalues that satisfy this equation

In the context of PCA:

  • Eigenvectors of the covariance matrix represent the directions (principal components) of maximum variance

  • Eigenvalues represent the amount of variance explained by each principal component

  • Larger eigenvalues correspond to more important principal components

The PCA Algorithm

PCA follows these steps to transform the data:

Step 1: Standardize the Data

First, we center the data by subtracting the mean from each feature:

Xcentered=XμX_{\text{centered}} = X - \mu

Often, we also scale the data to unit variance (z-score normalization):

Xstandardized=XμσX_{\text{standardized}} = \frac{X - \mu}{\sigma}

Where μ\mu is the mean vector and σ\sigma is the standard deviation vector.

Why standardize? Features with larger scales would dominate the principal components. Standardization ensures each feature contributes equally.

Step 2: Compute the Covariance Matrix

Calculate the covariance matrix of the standardized data:

Σ=1n1XTX\Sigma = \frac{1}{n-1} X^T X

Where XX is the centered/standardized data matrix (n×pn \times p), with nn samples and pp features.

Step 3: Compute Eigenvalues and Eigenvectors

Solve the eigenvalue problem for the covariance matrix:

Σvi=λivi\Sigma \mathbf{v}_i = \lambda_i \mathbf{v}_i

This yields pp eigenvalues λ1,λ2,,λp\lambda_1, \lambda_2, \ldots, \lambda_p and corresponding eigenvectors v1,v2,,vp\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_p.

Step 4: Sort Eigenvalues and Select Principal Components

Sort eigenvalues in descending order:

λ1λ2λp\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p

Select the top kk eigenvectors corresponding to the kk largest eigenvalues. These form the principal components.

Step 5: Transform the Data

Project the original data onto the selected principal components:

Z=XWkZ = X W_k

Where:

  • ZZ is the transformed data (n×kn \times k)

  • XX is the standardized original data (n×pn \times p)

  • WkW_k is the matrix of kk selected eigenvectors (p×kp \times k)

Variance Explained

An important aspect of PCA is understanding how much information (variance) is retained after dimensionality reduction.

The proportion of variance explained by the ii-th principal component is:

Variance Explainedi=λij=1pλj\text{Variance Explained}_i = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}

The cumulative variance explained by the first kk components is:

Cumulative Variancek=i=1kλij=1pλj\text{Cumulative Variance}_k = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{j=1}^{p} \lambda_j}

Rule of thumb: Select enough principal components to explain at least 80-95% of the total variance.

Geometric Interpretation

Geometrically, PCA can be understood as:

  1. Finding new axes: The principal components represent new orthogonal (perpendicular) axes in the feature space

  2. Rotating the data: PCA rotates the data so that the maximum variance lies along the first axis (PC1), the second maximum variance along the second axis (PC2), and so on

  3. Projection: The transformed data are the projections of the original data points onto these new axes

This rotation aligns the data with the directions of maximum variance, making it easier to identify patterns and reduce dimensionality.

Connection to Calculus: Optimization Perspective

PCA can also be viewed as an optimization problem. We want to find the direction w\mathbf{w} that maximizes the variance of the projected data:

maxwwTΣwsubject tow=1\max_{\mathbf{w}} \mathbf{w}^T \Sigma \mathbf{w} \quad \text{subject to} \quad \|\mathbf{w}\| = 1

Using Lagrange multipliers from calculus, we form the Lagrangian:

L(w,λ)=wTΣwλ(wTw1)L(\mathbf{w}, \lambda) = \mathbf{w}^T \Sigma \mathbf{w} - \lambda(\mathbf{w}^T\mathbf{w} - 1)

Taking the derivative with respect to w\mathbf{w} and setting it to zero:

Lw=2Σw2λw=0\frac{\partial L}{\partial \mathbf{w}} = 2\Sigma\mathbf{w} - 2\lambda\mathbf{w} = 0

This simplifies to:

Σw=λw\Sigma\mathbf{w} = \lambda\mathbf{w}

This is exactly the eigenvalue equation! The solution is the eigenvector corresponding to the largest eigenvalue. Subsequent principal components are found by maximizing variance in directions orthogonal to previous components.

PCA in Python

In Python, we can implement PCA using:

  • Manual implementation: Using NumPy to compute covariance matrix, eigenvalues, and eigenvectors

  • Scikit-learn: Using the PCA class for efficient implementation

Basic Usage with Scikit-learn

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)

# Access explained variance
print("Variance explained by each component:")
print(pca.explained_variance_ratio_)

# Access principal components (eigenvectors)
print("\nPrincipal components:")
print(pca.components_)

When to Use PCA

PCA is particularly useful when:

  • Reducing computational cost: With hundreds or thousands of features

  • Visualization: Reducing to 2-3 dimensions for plotting

  • Removing multicollinearity: When features are highly correlated

  • Noise reduction: As minor components often represent noise

  • Feature extraction: Creating new features that capture the most important patterns

Important considerations:

  • PCA assumes linear relationships between features

  • The transformed features (principal components) are linear combinations of original features, which may be harder to interpret

  • PCA is sensitive to scaling, so always standardize your data first

  • PCA is an unsupervised technique - it doesn’t consider the target variable

Applications of PCA

PCA is widely used across various domains:

  • Image Processing: Facial recognition, image compression

  • Genomics: Analyzing gene expression data

  • Finance: Portfolio risk analysis, detecting patterns in stock prices

  • Natural Language Processing: Topic modeling, document similarity

  • Computer Vision: Object detection, image reconstruction

Summary

Principal Component Analysis is a powerful dimensionality reduction technique that:

  • Transforms data into a new coordinate system where axes represent directions of maximum variance

  • Uses eigenvectors and eigenvalues of the covariance matrix to identify these directions

  • Can be understood through optimization using Lagrange multipliers from calculus

  • Helps reduce computational complexity, visualize data, and eliminate feature redundancy

  • Requires standardization and careful selection of the number of components to retain

In the next section, we will explore practical implementations of PCA, both from scratch and using Python libraries.

References and Further Reading

Principal Component Analysis was originally developed by Karl Pearson in 1901 Pearson (1901) and later refined by Harold Hotelling in 1933 Hotelling (1933). For a comprehensive modern treatment of PCA, see Jolliffe’s textbook Jolliffe (2002) and the review article by Jolliffe and Cadima Jolliffe & Cadima (2016).

The Python implementation used in this book relies on the scikit-learn library Pedregosa et al. (2011), which provides efficient and well-tested implementations of PCA and other machine learning algorithms.

References
  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. 10.1080/14786440109462720
  2. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441. 10.1037/h0071325
  3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer. 10.1007/b98835
  4. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202. 10.1098/rsta.2015.0202
  5. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & others. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.