In this notebook, we will implement Principal Component Analysis in two ways:
From Scratch: Using NumPy to understand the mathematics behind PCA
Using Scikit-learn: Leveraging the efficient PCA implementation from the library
We’ll use the famous Iris dataset to demonstrate PCA in action.
Load and Explore the Data¶
First, let’s load the Iris dataset and examine its structure:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.datasets import load_iris
# Load the Iris dataset
iris = load_iris()
X = iris.data # Features: sepal length, sepal width, petal length, petal width
y = iris.target # Target: species (0, 1, 2)
feature_names = iris.feature_names
target_names = iris.target_names
# Create a DataFrame for easier visualization
df = pd.DataFrame(X, columns=feature_names)
df['species'] = pd.Categorical.from_codes(y, target_names)
print("Dataset shape:", X.shape)
print("\nFirst few rows:")
print(df.head())
print("\nDataset statistics:")
print(df.describe())Let’s visualize the relationships between features:
# Scatter plot of two features
fig = px.scatter(df, x='sepal length (cm)', y='sepal width (cm)',
color='species', title='Iris Dataset: Sepal Length vs Width',
height=400)
fig.show()Part 1: PCA Implementation from Scratch¶
Now, let’s implement PCA step by step without using any PCA libraries.
Step 1: Standardize the Data¶
We need to standardize the features to have mean 0 and standard deviation 1:
# Manual standardization
X_mean = np.mean(X, axis=0)
X_std = np.std(X, axis=0)
X_standardized = (X - X_mean) / X_std
print("Original data mean:", X_mean)
print("Original data std:", X_std)
print("\nStandardized data mean:", np.mean(X_standardized, axis=0))
print("Standardized data std:", np.std(X_standardized, axis=0))Step 2: Compute the Covariance Matrix¶
The covariance matrix captures how features vary together:
# Compute covariance matrix manually
n_samples = X_standardized.shape[0]
covariance_matrix = (X_standardized.T @ X_standardized) / (n_samples - 1)
print("Covariance Matrix:")
print(covariance_matrix)
print("\nShape:", covariance_matrix.shape)
# Visualize the covariance matrix
fig = px.imshow(covariance_matrix,
x=feature_names, y=feature_names,
labels=dict(color="Covariance"),
title="Covariance Matrix of Iris Features",
color_continuous_scale='RdBu_r',
aspect='auto', height=400)
fig.show()Step 3: Compute Eigenvalues and Eigenvectors¶
The eigenvectors represent the principal components (directions of maximum variance), and the eigenvalues represent the amount of variance explained by each component:
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
print("Eigenvalues:")
print(eigenvalues)
print("\nEigenvectors (columns are eigenvectors):")
print(eigenvectors)
# Verify the eigenvalue equation: Σv = λv
print("\nVerification: Σv = λv for first eigenvector:")
v1 = eigenvectors[:, 0]
lambda1 = eigenvalues[0]
print("Σv =", covariance_matrix @ v1)
print("λv =", lambda1 * v1)
print("Are they equal?", np.allclose(covariance_matrix @ v1, lambda1 * v1))Step 4: Sort Eigenvalues and Select Principal Components¶
We sort the eigenvalues in descending order and select the corresponding eigenvectors:
# Sort eigenvalues and eigenvectors in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]
eigenvalues_sorted = eigenvalues[sorted_indices]
eigenvectors_sorted = eigenvectors[:, sorted_indices]
print("Sorted Eigenvalues:")
print(eigenvalues_sorted)
# Calculate variance explained by each component
total_variance = np.sum(eigenvalues_sorted)
variance_explained = eigenvalues_sorted / total_variance
cumulative_variance = np.cumsum(variance_explained)
print("\nVariance Explained by Each Component:")
for i, (var, cum_var) in enumerate(zip(variance_explained, cumulative_variance)):
print(f"PC{i+1}: {var:.4f} ({var*100:.2f}%) - Cumulative: {cum_var:.4f} ({cum_var*100:.2f}%)")Let’s visualize the explained variance:
# Create a bar plot of variance explained
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Variance explained by each component
ax1.bar(range(1, len(variance_explained) + 1), variance_explained, alpha=0.7, color='steelblue')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Variance Explained Ratio')
ax1.set_title('Variance Explained by Each Principal Component')
ax1.set_xticks(range(1, len(variance_explained) + 1))
# Cumulative variance explained
ax2.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o', linestyle='-', color='steelblue')
ax2.axhline(y=0.95, color='r', linestyle='--', label='95% Threshold')
ax2.set_xlabel('Number of Principal Components')
ax2.set_ylabel('Cumulative Variance Explained')
ax2.set_title('Cumulative Variance Explained')
ax2.set_xticks(range(1, len(cumulative_variance) + 1))
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nNumber of components needed for 95% variance: {np.argmax(cumulative_variance >= 0.95) + 1}")Step 5: Transform the Data¶
Now we project the original data onto the principal components. Let’s reduce to 2 dimensions for visualization:
# Select the first 2 principal components
n_components = 2
W = eigenvectors_sorted[:, :n_components] # Matrix of first 2 eigenvectors
print("Shape of W (projection matrix):", W.shape)
print("\nFirst 2 Principal Components (eigenvectors):")
print(W)
# Project the data onto the principal components
X_pca_manual = X_standardized @ W
print("\nShape of transformed data:", X_pca_manual.shape)
print("\nFirst few transformed samples:")
print(X_pca_manual[:5])Visualize the PCA Results¶
Let’s visualize the data in the new 2D principal component space:
# Create a DataFrame for the PCA results
pca_df = pd.DataFrame(X_pca_manual, columns=['PC1', 'PC2'])
pca_df['species'] = pd.Categorical.from_codes(y, target_names)
# Scatter plot of PCA results
fig = px.scatter(pca_df, x='PC1', y='PC2', color='species',
title='Iris Dataset Projected onto First 2 Principal Components (Manual PCA)',
labels={'PC1': f'PC1 ({variance_explained[0]*100:.1f}% variance)',
'PC2': f'PC2 ({variance_explained[1]*100:.1f}% variance)'},
height=500)
fig.show()
print(f"\nTotal variance captured by 2 components: {cumulative_variance[1]*100:.2f}%")Interpret the Principal Components¶
Let’s examine the loadings (coefficients) of each original feature in the principal components:
# Create a DataFrame showing the contribution of each feature to each PC
loadings_df = pd.DataFrame(
W.T,
columns=feature_names,
index=['PC1', 'PC2']
)
print("Feature Loadings (contribution of each feature to PCs):")
print(loadings_df)
# Visualize loadings as a heatmap
fig = px.imshow(loadings_df,
labels=dict(x="Feature", y="Principal Component", color="Loading"),
title="Feature Loadings on Principal Components",
color_continuous_scale='RdBu_r',
aspect='auto', height=300)
fig.show()
# Interpretation
print("\nInterpretation:")
print("PC1: All features contribute positively, especially petal length and width.")
print(" This component captures the overall size of the flower.")
print("PC2: Sepal length and width contribute in opposite directions.")
print(" This component captures the shape variation.")Part 2: PCA Using Scikit-learn¶
Now let’s implement the same analysis using scikit-learn’s PCA implementation. This is more efficient and handles edge cases better.
Apply PCA with Scikit-learn¶
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("Standardized data mean:", X_scaled.mean(axis=0))
print("Standardized data std:", X_scaled.std(axis=0))# Apply PCA
pca_sklearn = PCA(n_components=2)
X_pca_sklearn = pca_sklearn.fit_transform(X_scaled)
print("Shape of transformed data:", X_pca_sklearn.shape)
print("\nVariance explained by each component:")
print(pca_sklearn.explained_variance_ratio_)
print("\nCumulative variance explained:")
print(np.cumsum(pca_sklearn.explained_variance_ratio_))
print("\nPrincipal Components (eigenvectors):")
print(pca_sklearn.components_)Compare Results¶
Let’s verify that our manual implementation matches scikit-learn’s results:
print("Variance explained (manual vs sklearn):")
print("Manual:", variance_explained[:2])
print("Sklearn:", pca_sklearn.explained_variance_ratio_)
print("\nAre variance ratios close?", np.allclose(variance_explained[:2], pca_sklearn.explained_variance_ratio_))
# Note: The signs of eigenvectors might be flipped, but they represent the same direction
print("\nPrincipal components are the same (accounting for possible sign flips):")
for i in range(2):
manual_pc = W[:, i]
sklearn_pc = pca_sklearn.components_[i]
# Check if they are the same or opposite (both are valid)
same_or_opposite = np.allclose(manual_pc, sklearn_pc) or np.allclose(manual_pc, -sklearn_pc)
print(f"PC{i+1}: {same_or_opposite}")Visualize Scikit-learn Results¶
# Create a DataFrame for sklearn PCA results
pca_sklearn_df = pd.DataFrame(X_pca_sklearn, columns=['PC1', 'PC2'])
pca_sklearn_df['species'] = pd.Categorical.from_codes(y, target_names)
# Scatter plot
fig = px.scatter(pca_sklearn_df, x='PC1', y='PC2', color='species',
title='Iris Dataset Projected onto First 2 Principal Components (Scikit-learn PCA)',
labels={'PC1': f'PC1 ({pca_sklearn.explained_variance_ratio_[0]*100:.1f}% variance)',
'PC2': f'PC2 ({pca_sklearn.explained_variance_ratio_[1]*100:.1f}% variance)'},
height=500)
fig.show()Using PCA for Different Numbers of Components¶
Let’s explore how many components we need to capture different amounts of variance:
# Apply PCA with all components
pca_full = PCA()
pca_full.fit(X_scaled)
# Find number of components for different variance thresholds
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
print("Number of components needed for:")
for threshold in [0.80, 0.90, 0.95, 0.99]:
n_comp = np.argmax(cumvar >= threshold) + 1
actual_var = cumvar[n_comp - 1]
print(f" {threshold*100:.0f}% variance: {n_comp} components (actual: {actual_var*100:.2f}%)")Alternative: Automatically Select Components¶
Scikit-learn can automatically select the number of components:
# Use PCA to retain 95% of variance
pca_auto = PCA(n_components=0.95) # Retain 95% variance
X_pca_auto = pca_auto.fit_transform(X_scaled)
print(f"Number of components selected: {pca_auto.n_components_}")
print(f"Total variance explained: {np.sum(pca_auto.explained_variance_ratio_)*100:.2f}%")
print(f"\nVariance explained by each component:")
for i, var in enumerate(pca_auto.explained_variance_ratio_):
print(f" PC{i+1}: {var*100:.2f}%")Inverse Transform: Reconstructing Original Data¶
One interesting property of PCA is that we can reconstruct an approximation of the original data from the principal components:
# Transform and then inverse transform
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_scaled)
X_reconstructed = pca_2d.inverse_transform(X_pca_2d)
# Calculate reconstruction error
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction error (MSE): {reconstruction_error:.6f}")
# Compare original and reconstructed data for first sample
comparison_df = pd.DataFrame({
'Feature': feature_names,
'Original (scaled)': X_scaled[0],
'Reconstructed': X_reconstructed[0],
'Difference': X_scaled[0] - X_reconstructed[0]
})
print("\nComparison for first sample:")
print(comparison_df)PCA for 3D Visualization¶
Let’s create a 3D visualization using the first three principal components:
# Apply PCA with 3 components
pca_3d = PCA(n_components=3)
X_pca_3d = pca_3d.fit_transform(X_scaled)
# Create a 3D scatter plot
pca_3d_df = pd.DataFrame(X_pca_3d, columns=['PC1', 'PC2', 'PC3'])
pca_3d_df['species'] = pd.Categorical.from_codes(y, target_names)
fig = px.scatter_3d(pca_3d_df, x='PC1', y='PC2', z='PC3', color='species',
title='Iris Dataset in 3D PCA Space',
labels={'PC1': f'PC1 ({pca_3d.explained_variance_ratio_[0]*100:.1f}%)',
'PC2': f'PC2 ({pca_3d.explained_variance_ratio_[1]*100:.1f}%)',
'PC3': f'PC3 ({pca_3d.explained_variance_ratio_[2]*100:.1f}%)'},
height=600)
fig.show()
print(f"Total variance explained by 3 components: {np.sum(pca_3d.explained_variance_ratio_)*100:.2f}%")Summary¶
In this notebook, we:
Implemented PCA from scratch using NumPy:
Standardized the data
Computed the covariance matrix
Found eigenvalues and eigenvectors
Selected principal components based on explained variance
Transformed the data to the new coordinate system
Used scikit-learn’s PCA implementation:
Verified our manual implementation
Explored automatic component selection
Demonstrated inverse transformation
Created 2D and 3D visualizations
Key Takeaways:
PCA reduces dimensionality while preserving as much variance as possible
The first 2 principal components capture ~95% of the variance in the Iris dataset
Principal components are orthogonal linear combinations of original features
PCA makes it easier to visualize and understand high-dimensional data
Always standardize data before applying PCA to ensure all features contribute equally
References and Further Reading¶
The examples in this notebook demonstrate both manual and scikit-learn implementations of Principal Component Analysis. PCA was originally developed by Karl Pearson Pearson (1901) and later extended by Harold Hotelling Hotelling (1933).
For more detailed coverage of PCA theory and applications, we recommend:
Jolliffe’s comprehensive textbook on Principal Component Analysis Jolliffe (2002)
The recent review article by Jolliffe and Cadima Jolliffe & Cadima (2016)
The scikit-learn library Pedregosa et al. (2011) provides the PCA class used in this notebook, which implements efficient algorithms for computing principal components.
- Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559–572. 10.1080/14786440109462720
- Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417–441. 10.1037/h0071325
- Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer. 10.1007/b98835
- Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202. 10.1098/rsta.2015.0202
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., & others. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.