Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PCA Implementation Examples

In this notebook, we will implement Principal Component Analysis in two ways:

  1. From Scratch: Using NumPy to understand the mathematics behind PCA

  2. Using Scikit-learn: Leveraging the efficient PCA implementation from the library

We’ll use the famous Iris dataset to demonstrate PCA in action.

Load and Explore the Data

First, let’s load the Iris dataset and examine its structure:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features: sepal length, sepal width, petal length, petal width
y = iris.target  # Target: species (0, 1, 2)
feature_names = iris.feature_names
target_names = iris.target_names

# Create a DataFrame for easier visualization
df = pd.DataFrame(X, columns=feature_names)
df['species'] = pd.Categorical.from_codes(y, target_names)

print("Dataset shape:", X.shape)
print("\nFirst few rows:")
print(df.head())
print("\nDataset statistics:")
print(df.describe())

Let’s visualize the relationships between features:

# Scatter plot of two features
fig = px.scatter(df, x='sepal length (cm)', y='sepal width (cm)', 
                 color='species', title='Iris Dataset: Sepal Length vs Width',
                 height=400)
fig.show()

Part 1: PCA Implementation from Scratch

Now, let’s implement PCA step by step without using any PCA libraries.

Step 1: Standardize the Data

We need to standardize the features to have mean 0 and standard deviation 1:

# Manual standardization
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_standardized = (X - X_mean) / X_std

print("Original data mean:", X_mean)
print("Original data std:", X_std)
print("\nStandardized data mean:", np.mean(X_standardized, axis=0))
print("Standardized data std:", np.std(X_standardized, axis=0))

Step 2: Compute the Covariance Matrix

The covariance matrix captures how features vary together:

# Compute covariance matrix manually
n_samples = X_standardized.shape[0]
covariance_matrix = (X_standardized.T @ X_standardized) / (n_samples - 1)

print("Covariance Matrix:")
print(covariance_matrix)
print("\nShape:", covariance_matrix.shape)

# Visualize the covariance matrix
fig = px.imshow(covariance_matrix, 
                x=feature_names, y=feature_names,
                labels=dict(color="Covariance"),
                title="Covariance Matrix of Iris Features",
                color_continuous_scale='RdBu_r',
                aspect='auto', height=400)
fig.show()

Step 3: Compute Eigenvalues and Eigenvectors

The eigenvectors represent the principal components (directions of maximum variance), and the eigenvalues represent the amount of variance explained by each component:

# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

print("Eigenvalues:")
print(eigenvalues)
print("\nEigenvectors (columns are eigenvectors):")
print(eigenvectors)

# Verify the eigenvalue equation: Σv = λv
print("\nVerification: Σv = λv for first eigenvector:")
v1 = eigenvectors[:, 0]
lambda1 = eigenvalues[0]
print("Σv =", covariance_matrix @ v1)
print("λv =", lambda1 * v1)
print("Are they equal?", np.allclose(covariance_matrix @ v1, lambda1 * v1))

Step 4: Sort Eigenvalues and Select Principal Components

We sort the eigenvalues in descending order and select the corresponding eigenvectors:

# Sort eigenvalues and eigenvectors in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues_sorted = eigenvalues[sorted_indices]
eigenvectors_sorted = eigenvectors[:, sorted_indices]

print("Sorted Eigenvalues:")
print(eigenvalues_sorted)

# Calculate variance explained by each component
total_variance = np.sum(eigenvalues_sorted)
variance_explained = eigenvalues_sorted / total_variance
cumulative_variance = np.cumsum(variance_explained)

print("\nVariance Explained by Each Component:")
for i, (var, cum_var) in enumerate(zip(variance_explained, cumulative_variance)):
    print(f"PC{i+1}: {var:.4f} ({var*100:.2f}%) - Cumulative: {cum_var:.4f} ({cum_var*100:.2f}%)")

Let’s visualize the explained variance:

# Create a bar plot of variance explained
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Variance explained by each component
ax1.bar(range(1, len(variance_explained) + 1), variance_explained, alpha=0.7, color='steelblue')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Variance Explained Ratio')
ax1.set_title('Variance Explained by Each Principal Component')
ax1.set_xticks(range(1, len(variance_explained) + 1))

# Cumulative variance explained
ax2.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='-', color='steelblue')
ax2.axhline(y=0.95, color='r', linestyle='--', label='95% Threshold')
ax2.set_xlabel('Number of Principal Components')
ax2.set_ylabel('Cumulative Variance Explained')
ax2.set_title('Cumulative Variance Explained')
ax2.set_xticks(range(1, len(cumulative_variance) + 1))
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nNumber of components needed for 95% variance: {np.argmax(cumulative_variance >= 0.95) + 1}")

Step 5: Transform the Data

Now we project the original data onto the principal components. Let’s reduce to 2 dimensions for visualization:

# Select the first 2 principal components
n_components = 2
W = eigenvectors_sorted[:, :n_components]  # Matrix of first 2 eigenvectors

print("Shape of W (projection matrix):", W.shape)
print("\nFirst 2 Principal Components (eigenvectors):")
print(W)

# Project the data onto the principal components
X_pca_manual = X_standardized @ W

print("\nShape of transformed data:", X_pca_manual.shape)
print("\nFirst few transformed samples:")
print(X_pca_manual[:5])

Visualize the PCA Results

Let’s visualize the data in the new 2D principal component space:

# Create a DataFrame for the PCA results
pca_df = pd.DataFrame(X_pca_manual, columns=['PC1', 'PC2'])
pca_df['species'] = pd.Categorical.from_codes(y, target_names)

# Scatter plot of PCA results
fig = px.scatter(pca_df, x='PC1', y='PC2', color='species',
                 title='Iris Dataset Projected onto First 2 Principal Components (Manual PCA)',
                 labels={'PC1': f'PC1 ({variance_explained[0]*100:.1f}% variance)',
                        'PC2': f'PC2 ({variance_explained[1]*100:.1f}% variance)'},
                 height=500)
fig.show()

print(f"\nTotal variance captured by 2 components: {cumulative_variance[1]*100:.2f}%")

Interpret the Principal Components

Let’s examine the loadings (coefficients) of each original feature in the principal components:

# Create a DataFrame showing the contribution of each feature to each PC
loadings_df = pd.DataFrame(
    W.T,
    columns=feature_names,
    index=['PC1', 'PC2']
)

print("Feature Loadings (contribution of each feature to PCs):")
print(loadings_df)

# Visualize loadings as a heatmap
fig = px.imshow(loadings_df, 
                labels=dict(x="Feature", y="Principal Component", color="Loading"),
                title="Feature Loadings on Principal Components",
                color_continuous_scale='RdBu_r',
                aspect='auto', height=300)
fig.show()

# Interpretation
print("\nInterpretation:")
print("PC1: All features contribute positively, especially petal length and width.")
print("     This component captures the overall size of the flower.")
print("PC2: Sepal length and width contribute in opposite directions.")
print("     This component captures the shape variation.")

Part 2: PCA Using Scikit-learn

Now let’s implement the same analysis using scikit-learn’s PCA implementation. This is more efficient and handles edge cases better.

Apply PCA with Scikit-learn

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Standardized data mean:", X_scaled.mean(axis=0))
print("Standardized data std:", X_scaled.std(axis=0))
# Apply PCA
pca_sklearn = PCA(n_components=2)
X_pca_sklearn = pca_sklearn.fit_transform(X_scaled)

print("Shape of transformed data:", X_pca_sklearn.shape)
print("\nVariance explained by each component:")
print(pca_sklearn.explained_variance_ratio_)
print("\nCumulative variance explained:")
print(np.cumsum(pca_sklearn.explained_variance_ratio_))
print("\nPrincipal Components (eigenvectors):")
print(pca_sklearn.components_)

Compare Results

Let’s verify that our manual implementation matches scikit-learn’s results:

print("Variance explained (manual vs sklearn):")
print("Manual:", variance_explained[:2])
print("Sklearn:", pca_sklearn.explained_variance_ratio_)
print("\nAre variance ratios close?", np.allclose(variance_explained[:2], pca_sklearn.explained_variance_ratio_))

# Note: The signs of eigenvectors might be flipped, but they represent the same direction
print("\nPrincipal components are the same (accounting for possible sign flips):")
for i in range(2):
    manual_pc = W[:, i]
    sklearn_pc = pca_sklearn.components_[i]
    # Check if they are the same or opposite (both are valid)
    same_or_opposite = np.allclose(manual_pc, sklearn_pc) or np.allclose(manual_pc, -sklearn_pc)
    print(f"PC{i+1}: {same_or_opposite}")

Visualize Scikit-learn Results

# Create a DataFrame for sklearn PCA results
pca_sklearn_df = pd.DataFrame(X_pca_sklearn, columns=['PC1', 'PC2'])
pca_sklearn_df['species'] = pd.Categorical.from_codes(y, target_names)

# Scatter plot
fig = px.scatter(pca_sklearn_df, x='PC1', y='PC2', color='species',
                 title='Iris Dataset Projected onto First 2 Principal Components (Scikit-learn PCA)',
                 labels={'PC1': f'PC1 ({pca_sklearn.explained_variance_ratio_[0]*100:.1f}% variance)',
                        'PC2': f'PC2 ({pca_sklearn.explained_variance_ratio_[1]*100:.1f}% variance)'},
                 height=500)
fig.show()

Using PCA for Different Numbers of Components

Let’s explore how many components we need to capture different amounts of variance:

# Apply PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)

# Find number of components for different variance thresholds
cumvar = np.cumsum(pca_full.explained_variance_ratio_)

print("Number of components needed for:")
for threshold in [0.80, 0.90, 0.95, 0.99]:
    n_comp = np.argmax(cumvar >= threshold) + 1
    actual_var = cumvar[n_comp - 1]
    print(f"  {threshold*100:.0f}% variance: {n_comp} components (actual: {actual_var*100:.2f}%)")

Alternative: Automatically Select Components

Scikit-learn can automatically select the number of components:

# Use PCA to retain 95% of variance
pca_auto = PCA(n_components=0.95)  # Retain 95% variance
X_pca_auto = pca_auto.fit_transform(X_scaled)

print(f"Number of components selected: {pca_auto.n_components_}")
print(f"Total variance explained: {np.sum(pca_auto.explained_variance_ratio_)*100:.2f}%")
print(f"\nVariance explained by each component:")
for i, var in enumerate(pca_auto.explained_variance_ratio_):
    print(f"  PC{i+1}: {var*100:.2f}%")

Inverse Transform: Reconstructing Original Data

One interesting property of PCA is that we can reconstruct an approximation of the original data from the principal components:

# Transform and then inverse transform
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)
X_reconstructed = pca_2d.inverse_transform(X_pca_2d)

# Calculate reconstruction error
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction error (MSE): {reconstruction_error:.6f}")

# Compare original and reconstructed data for first sample
comparison_df = pd.DataFrame({
    'Feature': feature_names,
    'Original (scaled)': X_scaled[0],
    'Reconstructed': X_reconstructed[0],
    'Difference': X_scaled[0] - X_reconstructed[0]
})
print("\nComparison for first sample:")
print(comparison_df)

PCA for 3D Visualization

Let’s create a 3D visualization using the first three principal components:

# Apply PCA with 3 components
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)

# Create a 3D scatter plot
pca_3d_df = pd.DataFrame(X_pca_3d, columns=['PC1', 'PC2', 'PC3'])
pca_3d_df['species'] = pd.Categorical.from_codes(y, target_names)

fig = px.scatter_3d(pca_3d_df, x='PC1', y='PC2', z='PC3', color='species',
                    title='Iris Dataset in 3D PCA Space',
                    labels={'PC1': f'PC1 ({pca_3d.explained_variance_ratio_[0]*100:.1f}%)',
                           'PC2': f'PC2 ({pca_3d.explained_variance_ratio_[1]*100:.1f}%)',
                           'PC3': f'PC3 ({pca_3d.explained_variance_ratio_[2]*100:.1f}%)'},
                    height=600)
fig.show()

print(f"Total variance explained by 3 components: {np.sum(pca_3d.explained_variance_ratio_)*100:.2f}%")

Summary

In this notebook, we:

  1. Implemented PCA from scratch using NumPy:

    • Standardized the data

    • Computed the covariance matrix

    • Found eigenvalues and eigenvectors

    • Selected principal components based on explained variance

    • Transformed the data to the new coordinate system

  2. Used scikit-learn’s PCA implementation:

    • Verified our manual implementation

    • Explored automatic component selection

    • Demonstrated inverse transformation

    • Created 2D and 3D visualizations

Key Takeaways:

  • PCA reduces dimensionality while preserving as much variance as possible

  • The first 2 principal components capture ~95% of the variance in the Iris dataset

  • Principal components are orthogonal linear combinations of original features

  • PCA makes it easier to visualize and understand high-dimensional data

  • Always standardize data before applying PCA to ensure all features contribute equally

References and Further Reading

The examples in this notebook demonstrate both manual and scikit-learn implementations of Principal Component Analysis. PCA was originally developed by Karl Pearson Pearson (1901) and later extended by Harold Hotelling Hotelling (1933).

For more detailed coverage of PCA theory and applications, we recommend:

The scikit-learn library Pedregosa et al. (2011) provides the PCA class used in this notebook, which implements efficient algorithms for computing principal components.

References
  1. Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. 10.1080/14786440109462720
  2. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441. 10.1037/h0071325
  3. Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer. 10.1007/b98835
  4. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202. 10.1098/rsta.2015.0202
  5. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & others. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.