Logistic Regression Overview - Foundations of Machine Learning I Book

Logistic Regression is a supervised learning algorithm used for binary classification problems, where the target variable can take on only two possible outcomes (e.g., “yes” or “no”, “pass” or “fail”, “spam” or “not spam”). Despite its name containing “regression,” logistic regression is actually a classification algorithm.

Real-World Applications¶

Logistic regression is widely used across many domains:

Medicine: Predicting whether a patient has a disease based on symptoms and test results
Finance: Determining credit risk (will a customer default on a loan?)
Marketing: Predicting customer churn (will a customer leave the service?)
Email Filtering: Classifying emails as spam or not spam
Admissions: Predicting whether a student will be admitted to a university

In these scenarios, we’re not predicting a continuous value like in linear regression, but rather the probability that an instance belongs to a particular class.

From Linear Regression to Logistic Regression¶

To understand logistic regression, let’s first recall linear regression:

y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

(1)

Linear regression predicts continuous values and can output any real number. However, for classification, we need to predict probabilities that are bounded between 0 and 1.

The Sigmoid (Logistic) Function¶

The key to logistic regression is the sigmoid function (also called the logistic function), which transforms any real-valued number into a value between 0 and 1:

\sigma(z) = \frac{1}{1 + e^{-z}}

(2)

Where:

$z$ is the linear combination of input features: $z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n$
$e$ is Euler’s number (approximately 2.71828)
$\sigma(z)$ represents the probability that $y = 1$ given the input features

The complete logistic regression model is:

P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}

(3)

Properties of the Sigmoid Function¶

The sigmoid function has several important properties:

When $z \to \infty$ , $\sigma(z) \to 1$
When $z \to -\infty$ , $\sigma(z) \to 0$
When $z = 0$ , $\sigma(z) = 0.5$
The function is smooth and differentiable everywhere (important for gradient-based optimization)
It has an S-shaped curve

Let’s visualize the sigmoid function:

import numpy as np
import plotly.graph_objects as go

# Define the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Create z values
z = np.linspace(-10, 10, 200)
y = sigmoid(z)

# Create the plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=z, y=y, mode='lines', name='Sigmoid Function',
                         line=dict(color='blue', width=2)))

# Add a horizontal line at y=0.5 for the decision boundary
fig.add_hline(y=0.5, line_dash="dash", line_color="red", 
              annotation_text="Decision Boundary (p=0.5)")

fig.update_layout(
    title="The Sigmoid (Logistic) Function",
    xaxis_title="z (Linear Combination of Features)",
    yaxis_title="σ(z) - Probability",
    height=400
)

fig.show()

Log-Odds and the Logit Function¶

To understand logistic regression from a probability perspective, we introduce the concept of odds and log-odds.

Odds¶

If the probability of an event is $p$ , then the odds of that event are:

\text{Odds} = \frac{p}{1-p}

(4)

For example, if the probability of rain is 0.75, the odds are $\frac{0.75}{0.25} = 3$ , meaning rain is 3 times more likely than no rain.

Log-Odds (Logit)¶

The log-odds (or logit) is the natural logarithm of the odds:

\text{logit}(p) = \log\left(\frac{p}{1-p}\right)

(5)

In logistic regression, we model the log-odds as a linear function of the input features:

\log\left(\frac{P(y=1|X)}{1-P(y=1|X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

(6)

This is why logistic regression is also called a Generalized Linear Model (GLM) - it models a transformation of the probability (the log-odds) as a linear combination of features.

By taking the exponential of both sides and rearranging, we arrive back at the sigmoid function we saw earlier.

Mathematical Connections¶

Logistic regression connects to several mathematical concepts from your previous coursework:

Probability Theory¶

The output is a probability: $P(y=1|X) \in [0,1]$
We use the Bernoulli distribution (binary outcomes)
Maximum Likelihood Estimation (MLE) is used to find optimal parameters

Calculus¶

The derivative of the sigmoid function has a convenient form:

\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))

(7)

This makes gradient descent optimization efficient
The loss function involves logarithms and requires calculus to minimize

Linear Algebra¶

The linear combination of features can be written as a dot product:

z = \boldsymbol{\beta}^T \mathbf{X} = \beta_0 + \sum_{i=1}^{n} \beta_i X_i

(8)

Matrix operations are used for efficient computation with multiple samples

The Loss Function: Binary Cross-Entropy¶

To train a logistic regression model, we need to define a loss function that measures how well our predictions match the actual labels. The standard loss function for logistic regression is binary cross-entropy (also called log loss):

For a single observation:

L(y, \hat{p}) = -[y \log(\hat{p}) + (1-y) \log(1-\hat{p})]

(9)

Where:

$y$ is the actual label (0 or 1)
$\hat{p}$ is the predicted probability that $y=1$

For the entire dataset, we take the average:

J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{p}^{(i)}) + (1-y^{(i)}) \log(1-\hat{p}^{(i)})]

(10)

Where $m$ is the number of training examples.

This loss function:

Heavily penalizes confident wrong predictions
Is derived from the principle of Maximum Likelihood Estimation
Is convex, which guarantees a global minimum

Model Training: Gradient Descent¶

Like linear regression, logistic regression uses gradient descent to find the optimal parameters $\boldsymbol{\beta}$ that minimize the loss function.

The update rule for each parameter is:

\beta_j := \beta_j - \alpha \frac{\partial J(\boldsymbol{\beta})}{\partial \beta_j}

(11)

Where:

$\alpha$ is the learning rate
The gradient (partial derivative) tells us the direction to adjust the parameter

For logistic regression, the gradient has the form:

\frac{\partial J(\boldsymbol{\beta})}{\partial \beta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{p}^{(i)} - y^{(i)}) x_j^{(i)}

(12)

Notice this looks similar to linear regression, but $\hat{p}$ comes from the sigmoid function!

Making Predictions¶

Once trained, logistic regression makes predictions in two steps:

Calculate the probability:
$\hat{p} = P(y=1|X) = \sigma(\boldsymbol{\beta}^T \mathbf{X})$
(13)
Apply a decision threshold (typically 0.5):
$\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \end{cases}$
(14)

The threshold can be adjusted based on the problem requirements:

Higher threshold: More conservative predictions of the positive class (fewer false positives)
Lower threshold: More aggressive predictions of the positive class (fewer false negatives)

Logistic Regression in Python¶

Let’s implement logistic regression using scikit-learn with a practical example. We’ll create a dataset to predict whether a student will pass an exam based on hours studied and previous test scores.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a sample dataset
np.random.seed(42)
n_samples = 200

# Features: hours studied and previous test score
hours_studied = np.random.uniform(0, 10, n_samples)
previous_score = np.random.uniform(40, 100, n_samples)

# Target: pass (1) or fail (0)
# Higher hours and previous scores increase probability of passing
z = -8 + 0.7 * hours_studied + 0.1 * previous_score + np.random.normal(0, 1, n_samples)
probability = 1 / (1 + np.exp(-z))
passed = (probability > 0.5).astype(int)

# Create DataFrame
df = pd.DataFrame({
    'HoursStudied': hours_studied,
    'PreviousScore': previous_score,
    'Passed': passed
})

print("Dataset Overview:")
print(df.head(10))
print(f"\nPass Rate: {df['Passed'].mean():.2%}")

Data Visualization¶

Let’s visualize the relationship between our features and the target variable:

import plotly.express as px

fig = px.scatter(df, x='HoursStudied', y='PreviousScore', 
                 color='Passed', 
                 color_discrete_map={0: 'red', 1: 'green'},
                 labels={'Passed': 'Exam Result'},
                 title='Student Exam Results by Study Hours and Previous Score',
                 height=450)

fig.update_traces(marker=dict(size=8, line=dict(width=1, color='DarkSlateGrey')))
fig.show()

Data Splitting¶

We split our data into training and test sets:

# Prepare features and target
X = df[['HoursStudied', 'PreviousScore']]
y = df['Passed']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Feature Scaling¶

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling applied")
print(f"Original HoursStudied range: [{X_train['HoursStudied'].min():.2f}, {X_train['HoursStudied'].max():.2f}]")
print(f"Scaled HoursStudied range: [{X_train_scaled[:, 0].min():.2f}, {X_train_scaled[:, 0].max():.2f}]")

Model Training¶

Now we train the logistic regression model:

# Initialize and train the model
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Display the learned parameters
print("Model Coefficients:")
print(f"  Intercept (β₀): {log_reg.intercept_[0]:.4f}")
print(f"  HoursStudied (β₁): {log_reg.coef_[0][0]:.4f}")
print(f"  PreviousScore (β₂): {log_reg.coef_[0][1]:.4f}")

print("\nInterpretation:")
print(f"  A one standard deviation increase in HoursStudied increases")
print(f"  the log-odds of passing by {log_reg.coef_[0][0]:.4f}")
print(f"  A one standard deviation increase in PreviousScore increases")
print(f"  the log-odds of passing by {log_reg.coef_[0][1]:.4f}")

Making Predictions¶

Logistic regression provides both predicted probabilities and class labels:

# Get predicted probabilities
y_pred_proba = log_reg.predict_proba(X_test_scaled)

# Get predicted class labels
y_pred = log_reg.predict(X_test_scaled)

# Create a results DataFrame
results_df = pd.DataFrame({
    'HoursStudied': X_test['HoursStudied'].values,
    'PreviousScore': X_test['PreviousScore'].values,
    'Actual': y_test.values,
    'Prob_Fail': y_pred_proba[:, 0],
    'Prob_Pass': y_pred_proba[:, 1],
    'Predicted': y_pred
})

print("Sample Predictions:")
print(results_df.head(10))

Evaluation Metrics for Classification¶

Unlike regression, we use different metrics to evaluate classification models:

Confusion Matrix¶

The confusion matrix shows the counts of correct and incorrect predictions:

	Predicted Negative (0)	Predicted Positive (1)
Actual Negative (0)	True Negative (TN)	False Positive (FP)
Actual Positive (1)	False Negative (FN)	True Positive (TP)

Key Metrics¶

From the confusion matrix, we derive several important metrics:

Accuracy: Overall proportion of correct predictions

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

(15)

Precision: Of all positive predictions, what proportion was actually positive?

\text{Precision} = \frac{TP}{TP + FP}

(16)

Recall (Sensitivity): Of all actual positives, what proportion was correctly identified?

\text{Recall} = \frac{TP}{TP + FN}

(17)

F1-Score: Harmonic mean of precision and recall

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

(18)

Let’s calculate these metrics:

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"\nTrue Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")

# Calculate metrics
print("\n" + "="*50)
print("MODEL PERFORMANCE METRICS")
print("="*50)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")

# Detailed classification report
print("\n" + "="*50)
print("CLASSIFICATION REPORT")
print("="*50)
print(classification_report(y_test, y_pred, target_names=['Failed', 'Passed']))

Visualizing the Confusion Matrix¶

import plotly.figure_factory as ff

# Create confusion matrix heatmap
z = cm
x = ['Predicted: Failed (0)', 'Predicted: Passed (1)']
y = ['Actual: Failed (0)', 'Actual: Passed (1)']

# Create annotations for each cell
z_text = [[str(y) for y in x] for x in z]

fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text,
                                  colorscale='Blues', showscale=True)

fig.update_layout(
    title='Confusion Matrix',
    xaxis_title='Predicted Label',
    yaxis_title='Actual Label',
    height=400
)

fig.show()

ROC Curve and AUC¶

The Receiver Operating Characteristic (ROC) curve is a graphical plot that shows the trade-off between the true positive rate (recall) and false positive rate at various threshold settings.

The Area Under the Curve (AUC) provides a single number summary of model performance:

AUC = 1.0: Perfect classifier
AUC = 0.5: Random classifier (no better than chance)
AUC < 0.5: Worse than random (predictions are inverted)

The ROC curve helps you choose the optimal threshold for your specific application.

from sklearn.metrics import roc_curve, roc_auc_score

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:, 1])
auc_score = roc_auc_score(y_test, y_pred_proba[:, 1])

# Create ROC curve plot
fig = go.Figure()

# ROC curve
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',
                         name=f'ROC Curve (AUC = {auc_score:.3f})',
                         line=dict(color='blue', width=2)))

# Diagonal line (random classifier)
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines',
                         name='Random Classifier',
                         line=dict(color='red', width=2, dash='dash')))

fig.update_layout(
    title='ROC Curve',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate (Recall)',
    height=450,
    showlegend=True
)

fig.show()

print(f"\nAUC Score: {auc_score:.4f}")

Decision Boundary Visualization¶

For a two-feature model, we can visualize the decision boundary - the line that separates the two classes:

# Create a mesh grid
h = 0.02  # step size in the mesh
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict for each point in the mesh
Z = log_reg.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

# Create contour plot
fig = go.Figure()

# Add contour for probabilities
fig.add_trace(go.Contour(x=xx[0], y=yy[:, 0], z=Z,
                         colorscale='RdYlGn',
                         contours=dict(start=0, end=1, size=0.1),
                         name='Probability',
                         showscale=True,
                         colorbar=dict(title='P(Pass)')))

# Add training points
train_pass = y_train == 1
train_fail = y_train == 0

fig.add_trace(go.Scatter(x=X_train_scaled[train_pass, 0],
                         y=X_train_scaled[train_pass, 1],
                         mode='markers',
                         marker=dict(color='green', size=8, symbol='circle',
                                   line=dict(color='darkgreen', width=1)),
                         name='Passed (Train)'))

fig.add_trace(go.Scatter(x=X_train_scaled[train_fail, 0],
                         y=X_train_scaled[train_fail, 1],
                         mode='markers',
                         marker=dict(color='red', size=8, symbol='x',
                                   line=dict(color='darkred', width=1)),
                         name='Failed (Train)'))

fig.update_layout(
    title='Decision Boundary and Probability Contours',
    xaxis_title='Hours Studied (Scaled)',
    yaxis_title='Previous Score (Scaled)',
    height=500
)

fig.show()

Multiclass Logistic Regression¶

While we’ve focused on binary classification, logistic regression can be extended to multiclass problems using:

One-vs-Rest (OvR): Train one classifier per class, treating it as the positive class and all others as negative
Multinomial/Softmax: Generalize the sigmoid function to handle multiple classes simultaneously

Scikit-learn’s LogisticRegression automatically handles multiclass classification using the OvR strategy by default.

Regularization in Logistic Regression¶

Like linear regression, logistic regression can benefit from regularization to prevent overfitting:

L2 Regularization (Ridge)¶

Adds a penalty term to the loss function:

J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{p}^{(i)}) + (1-y^{(i)}) \log(1-\hat{p}^{(i)})] + \lambda \sum_{j=1}^{n} \beta_j^2

(19)

L1 Regularization (Lasso)¶

Uses absolute values instead:

J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{p}^{(i)}) + (1-y^{(i)}) \log(1-\hat{p}^{(i)})] + \lambda \sum_{j=1}^{n} |\beta_j|

(20)

In scikit-learn, the regularization strength is controlled by the parameter C, where smaller values specify stronger regularization.

Assumptions and Limitations¶

Logistic regression makes several assumptions:

Linear relationship: The log-odds should be linearly related to the features
Independence: Observations should be independent
No multicollinearity: Features should not be highly correlated
Large sample size: More reliable with larger datasets

Limitations:

Cannot capture complex non-linear relationships (without feature engineering)
Sensitive to outliers
Assumes linear decision boundaries

Summary¶

Logistic Regression is a foundational classification algorithm that:

Uses the sigmoid function to model probabilities
Learns through gradient descent on the binary cross-entropy loss
Provides interpretable coefficients showing feature importance
Works well for linearly separable problems
Serves as a building block for more complex models

Key takeaways:

The output is a probability, not a continuous value
We model the log-odds as a linear function of features
Evaluation uses classification metrics (accuracy, precision, recall, F1, AUC)
The decision threshold can be adjusted based on application requirements
Regularization helps prevent overfitting

Despite its simplicity, logistic regression remains widely used in practice due to its efficiency, interpretability, and effectiveness on many real-world problems.