K-Nearest Neighbors (KNN) Overview - Foundations of Machine Learning I Book

This notebook provides a comprehensive overview of the K-Nearest Neighbors algorithm, with a focus on:

How the KNN algorithm works
How changing the k hyperparameter relates to overfitting and underfitting
The Bias-Variance Tradeoff and why it is core to machine learning

1. What is KNN?¶

K-Nearest Neighbors is one of the simplest and most intuitive supervised learning algorithms. It can be used for both classification and regression tasks.

Core Idea: To make a prediction for a new data point, KNN looks at the k closest data points (neighbors) in the training set and uses their labels to determine the prediction.

Classification: The predicted class is the majority vote among the k nearest neighbors.
Regression: The predicted value is the average (or weighted average) of the k nearest neighbors’ values.

Key Characteristics¶

Instance-based (lazy) learning: KNN stores the entire training dataset and only performs computation at prediction time.
Non-parametric: KNN makes no assumptions about the underlying data distribution.
Distance-based: Predictions depend on a distance metric (commonly Euclidean distance).

2. A Simple KNN Example¶

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Generate a simple 2D dataset
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit KNN with k=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

print(f"Training Accuracy (k=5): {knn.score(X_train, y_train):.3f}")
print(f"Test Accuracy (k=5): {knn.score(X_test, y_test):.3f}")

3. The Role of k: Overfitting vs. Underfitting¶

The hyperparameter k (the number of neighbors) is the most important tuning knob in KNN. Changing k directly controls the complexity of the model and determines whether the model overfits or underfits.

Small k (e.g., k=1)¶

The model is highly flexible — it captures every local variation in the training data.
Decision boundaries are jagged and complex.
The model fits the training data very closely, including noise.
This leads to overfitting: High training accuracy but poor generalization to unseen data.

Large k (e.g., k=n)¶

The model is very rigid — it smooths over local patterns and responds only to global trends.
Decision boundaries are overly smooth.
The model ignores important structure in the data.
This leads to underfitting: Both training and test accuracy are low.

The Sweet Spot¶

An intermediate value of k balances flexibility and rigidity.
The model captures the true patterns without fitting to noise.
This yields the best generalization performance on unseen data.

Visualizing the Effect of k¶

from matplotlib.colors import ListedColormap

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
k_values = [1, 7, 50]

for ax, k in zip(axes, k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    
    # Create decision boundary mesh
    h = 0.2
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=ListedColormap(['red', 'blue']),
               edgecolors='k', s=20)
    ax.set_title(f'k = {k}\nTrain: {knn.score(X_train, y_train):.2f}, Test: {knn.score(X_test, y_test):.2f}')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')

plt.tight_layout()
plt.suptitle('Effect of k on Decision Boundaries', y=1.02, fontsize=14)
plt.show()

Training vs. Test Accuracy Across k Values¶

k_range = range(1, 51)
train_scores = []
test_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    train_scores.append(knn.score(X_train, y_train))
    test_scores.append(knn.score(X_test, y_test))

plt.figure(figsize=(10, 5))
plt.plot(k_range, train_scores, label='Training Accuracy', marker='o', markersize=3)
plt.plot(k_range, test_scores, label='Test Accuracy', marker='s', markersize=3)
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('Training vs. Test Accuracy as a Function of k')
plt.legend()
plt.axvline(x=k_range[np.argmax(test_scores)], color='green', linestyle='--', 
            label=f'Best k={k_range[np.argmax(test_scores)]}')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Best k value: {k_range[np.argmax(test_scores)]}")
print(f"Best test accuracy: {max(test_scores):.3f}")

4. The Bias-Variance Tradeoff¶

The behavior we observe with different k values is a direct manifestation of one of the most fundamental concepts in machine learning: the Bias-Variance Tradeoff.

What is Bias?¶

Bias is the error introduced by approximating a complex real-world problem with a simplified model. It measures how far off the model’s average predictions are from the correct values.

High bias → The model is too simple and misses important patterns (underfitting).
Low bias → The model can capture complex relationships in the data.

In KNN:

Large k → High bias (the model averages over too many neighbors, smoothing away true patterns)

What is Variance?¶

Variance is the amount by which the model’s predictions would change if it were trained on a different dataset. It measures the model’s sensitivity to the specific training data.

High variance → The model is overly sensitive to training data (overfitting).
Low variance → The model produces consistent predictions across different training sets.

In KNN:

Small k → High variance (the model’s predictions change dramatically with small changes in training data)

The Tradeoff¶

Total prediction error can be decomposed as:

\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}

(1)

As model complexity increases (small k), bias decreases but variance increases.
As model complexity decreases (large k), variance decreases but bias increases.
The optimal model minimizes the total error by finding the right balance.

Visualizing the Bias-Variance Tradeoff in KNN¶

fig, ax = plt.subplots(figsize=(10, 6))

# Conceptual illustration of bias-variance tradeoff
k_vals = np.arange(1, 51)
# Simulated bias^2 and variance curves (conceptual)
bias_squared = 0.02 * (k_vals - 1) ** 0.8
variance = 0.5 / (k_vals ** 0.5)
total_error = bias_squared + variance + 0.05  # irreducible noise = 0.05

ax.plot(k_vals, bias_squared, 'b-', linewidth=2, label='Bias²')
ax.plot(k_vals, variance, 'r-', linewidth=2, label='Variance')
ax.plot(k_vals, total_error, 'k--', linewidth=2, label='Total Error')
ax.axhline(y=0.05, color='gray', linestyle=':', label='Irreducible Noise')

optimal_k = k_vals[np.argmin(total_error)]
ax.axvline(x=optimal_k, color='green', linestyle='--', alpha=0.7, label=f'Optimal k ≈ {optimal_k}')

ax.set_xlabel('k (Number of Neighbors)', fontsize=12)
ax.set_ylabel('Error', fontsize=12)
ax.set_title('Bias-Variance Tradeoff in KNN', fontsize=14)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)

# Add annotations
ax.annotate('Overfitting\n(Low k, High Variance)', xy=(3, 0.35), fontsize=10, color='red',
            ha='center')
ax.annotate('Underfitting\n(High k, High Bias)', xy=(45, 0.35), fontsize=10, color='blue',
            ha='center')

plt.tight_layout()
plt.show()

Summary Table: k, Bias, Variance, and Fit¶

k Value	Bias	Variance	Model Complexity	Risk
Small (k=1)	Low	High	High (complex boundaries)	Overfitting
Medium	Moderate	Moderate	Balanced	Best generalization
Large (k=n)	High	Low	Low (smooth boundaries)	Underfitting

5. Why the Bias-Variance Tradeoff is Core to Machine Learning¶

The bias-variance tradeoff is not just a KNN concept — it is a fundamental principle that underlies all of machine learning. Here is why every practitioner must deeply understand it:

5.1 It Explains Why Models Fail¶

Every time a model performs poorly, the root cause can be traced back to this tradeoff:

Model too simple? → High bias → Underfitting → The model cannot capture the patterns in the data.
Model too complex? → High variance → Overfitting → The model memorizes noise instead of learning generalizable patterns.

Understanding bias and variance gives you a diagnostic framework to identify and fix model issues.

5.2 It Guides Every Modeling Decision¶

Nearly every decision in building a machine learning model is a bias-variance decision in disguise:

Decision	Bias-Variance Implication
Choosing k in KNN	Directly controls bias vs. variance
Choosing tree depth in Decision Trees	Deeper = lower bias, higher variance
Adding regularization (L1/L2)	Increases bias to reduce variance
Adding more features	May reduce bias but increase variance
Ensemble methods (Random Forest, Boosting)	Reduce variance (bagging) or bias (boosting)
Collecting more training data	Reduces variance without increasing bias

5.3 It Defines the Goal of Machine Learning¶

The ultimate goal of supervised learning is generalization — performing well on unseen data. The bias-variance tradeoff tells us that:

Perfect training performance is not the goal. The goal is to find the sweet spot where total error (on unseen data) is minimized.

This insight is why we use:

Cross-validation — to estimate generalization performance
Regularization — to deliberately increase bias in exchange for lower variance
Train/validation/test splits — to detect overfitting before deployment

5.4 It Connects All Algorithms¶

Whether you are working with linear regression, neural networks, SVMs, or decision trees, the bias-variance tradeoff is always at play. Understanding it allows you to:

Transfer intuition across algorithms
Diagnose problems faster
Make better hyperparameter choices
Build more robust models

5.5 It’s the Foundation of Model Selection¶

Model selection — choosing between competing algorithms and configurations — is fundamentally about navigating the bias-variance tradeoff. Techniques like:

Grid search / Random search for hyperparameters
Learning curves (plotting error vs. training size)
Validation curves (plotting error vs. model complexity)

...are all tools for finding the optimal point on the bias-variance spectrum.

6. Key Takeaways¶

KNN is an intuitive, non-parametric algorithm that makes predictions based on the k nearest neighbors in the training data.
The k hyperparameter directly controls overfitting vs. underfitting:
- Small k → complex model → overfitting (high variance, low bias)
- Large k → simple model → underfitting (high bias, low variance)
The Bias-Variance Tradeoff decomposes prediction error into bias², variance, and irreducible noise. The goal is to minimize total error, not just one component.
The Bias-Variance Tradeoff is core to ML because:
- It explains why models fail
- It guides every modeling decision
- It defines the true goal (generalization, not memorization)
- It connects all ML algorithms under one unifying framework
- It is the foundation of model selection and evaluation
Mastering this tradeoff is what separates practitioners who blindly try algorithms from those who systematically build high-performing models.

7. Practice Questions¶

If your KNN model has 100% training accuracy but 60% test accuracy, is it overfitting or underfitting? What should you do with k?
If both training and test accuracy are around 55%, what is the likely problem? How would you adjust k?
Explain in your own words why collecting more training data helps with variance but not with bias.
A colleague says “I always use k=1 because it gives the best training accuracy.” What would you tell them?
How does the bias-variance tradeoff apply to choosing the depth of a decision tree?