Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Logistic Regression Overview

Logistic Regression is a supervised learning algorithm used for binary classification problems, where the target variable can take on only two possible outcomes (e.g., “yes” or “no”, “pass” or “fail”, “spam” or “not spam”). Despite its name containing “regression,” logistic regression is actually a classification algorithm.

Real-World Applications

Logistic regression is widely used across many domains:

  • Medicine: Predicting whether a patient has a disease based on symptoms and test results

  • Finance: Determining credit risk (will a customer default on a loan?)

  • Marketing: Predicting customer churn (will a customer leave the service?)

  • Email Filtering: Classifying emails as spam or not spam

  • Admissions: Predicting whether a student will be admitted to a university

In these scenarios, we’re not predicting a continuous value like in linear regression, but rather the probability that an instance belongs to a particular class.

From Linear Regression to Logistic Regression

To understand logistic regression, let’s first recall linear regression:

y=β0+β1X1+β2X2++βnXny = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

Linear regression predicts continuous values and can output any real number. However, for classification, we need to predict probabilities that are bounded between 0 and 1.

The Sigmoid (Logistic) Function

The key to logistic regression is the sigmoid function (also called the logistic function), which transforms any real-valued number into a value between 0 and 1:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Where:

  • zz is the linear combination of input features: z=β0+β1X1+β2X2++βnXnz = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

  • ee is Euler’s number (approximately 2.71828)

  • σ(z)\sigma(z) represents the probability that y=1y = 1 given the input features

The complete logistic regression model is:

P(y=1X)=11+e(β0+β1X1+β2X2++βnXn)P(y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n)}}

Properties of the Sigmoid Function

The sigmoid function has several important properties:

  • When zz \to \infty, σ(z)1\sigma(z) \to 1

  • When zz \to -\infty, σ(z)0\sigma(z) \to 0

  • When z=0z = 0, σ(z)=0.5\sigma(z) = 0.5

  • The function is smooth and differentiable everywhere (important for gradient-based optimization)

  • It has an S-shaped curve

Let’s visualize the sigmoid function:

import numpy as np
import plotly.graph_objects as go

# Define the sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Create z values
z = np.linspace(-10, 10, 200)
y = sigmoid(z)

# Create the plot
fig = go.Figure()
fig.add_trace(go.Scatter(x=z, y=y, mode='lines', name='Sigmoid Function',
                         line=dict(color='blue', width=2)))

# Add a horizontal line at y=0.5 for the decision boundary
fig.add_hline(y=0.5, line_dash="dash", line_color="red", 
              annotation_text="Decision Boundary (p=0.5)")

fig.update_layout(
    title="The Sigmoid (Logistic) Function",
    xaxis_title="z (Linear Combination of Features)",
    yaxis_title="σ(z) - Probability",
    height=400
)

fig.show()

Log-Odds and the Logit Function

To understand logistic regression from a probability perspective, we introduce the concept of odds and log-odds.

Odds

If the probability of an event is pp, then the odds of that event are:

Odds=p1p\text{Odds} = \frac{p}{1-p}

For example, if the probability of rain is 0.75, the odds are 0.750.25=3\frac{0.75}{0.25} = 3, meaning rain is 3 times more likely than no rain.

Log-Odds (Logit)

The log-odds (or logit) is the natural logarithm of the odds:

logit(p)=log(p1p)\text{logit}(p) = \log\left(\frac{p}{1-p}\right)

In logistic regression, we model the log-odds as a linear function of the input features:

log(P(y=1X)1P(y=1X))=β0+β1X1+β2X2++βnXn\log\left(\frac{P(y=1|X)}{1-P(y=1|X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_n X_n

This is why logistic regression is also called a Generalized Linear Model (GLM) - it models a transformation of the probability (the log-odds) as a linear combination of features.

By taking the exponential of both sides and rearranging, we arrive back at the sigmoid function we saw earlier.

Mathematical Connections

Logistic regression connects to several mathematical concepts from your previous coursework:

Probability Theory

  • The output is a probability: P(y=1X)[0,1]P(y=1|X) \in [0,1]

  • We use the Bernoulli distribution (binary outcomes)

  • Maximum Likelihood Estimation (MLE) is used to find optimal parameters

Calculus

  • The derivative of the sigmoid function has a convenient form:

dσ(z)dz=σ(z)(1σ(z))\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))
  • This makes gradient descent optimization efficient

  • The loss function involves logarithms and requires calculus to minimize

Linear Algebra

  • The linear combination of features can be written as a dot product:

z=βTX=β0+i=1nβiXiz = \boldsymbol{\beta}^T \mathbf{X} = \beta_0 + \sum_{i=1}^{n} \beta_i X_i
  • Matrix operations are used for efficient computation with multiple samples

The Loss Function: Binary Cross-Entropy

To train a logistic regression model, we need to define a loss function that measures how well our predictions match the actual labels. The standard loss function for logistic regression is binary cross-entropy (also called log loss):

For a single observation:

L(y,p^)=[ylog(p^)+(1y)log(1p^)]L(y, \hat{p}) = -[y \log(\hat{p}) + (1-y) \log(1-\hat{p})]

Where:

  • yy is the actual label (0 or 1)

  • p^\hat{p} is the predicted probability that y=1y=1

For the entire dataset, we take the average:

J(β)=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))]J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{p}^{(i)}) + (1-y^{(i)}) \log(1-\hat{p}^{(i)})]

Where mm is the number of training examples.

This loss function:

  • Heavily penalizes confident wrong predictions

  • Is derived from the principle of Maximum Likelihood Estimation

  • Is convex, which guarantees a global minimum

Model Training: Gradient Descent

Like linear regression, logistic regression uses gradient descent to find the optimal parameters β\boldsymbol{\beta} that minimize the loss function.

The update rule for each parameter is:

βj:=βjαJ(β)βj\beta_j := \beta_j - \alpha \frac{\partial J(\boldsymbol{\beta})}{\partial \beta_j}

Where:

  • α\alpha is the learning rate

  • The gradient (partial derivative) tells us the direction to adjust the parameter

For logistic regression, the gradient has the form:

J(β)βj=1mi=1m(p^(i)y(i))xj(i)\frac{\partial J(\boldsymbol{\beta})}{\partial \beta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{p}^{(i)} - y^{(i)}) x_j^{(i)}

Notice this looks similar to linear regression, but p^\hat{p} comes from the sigmoid function!

Making Predictions

Once trained, logistic regression makes predictions in two steps:

  1. Calculate the probability:

    p^=P(y=1X)=σ(βTX)\hat{p} = P(y=1|X) = \sigma(\boldsymbol{\beta}^T \mathbf{X})
  2. Apply a decision threshold (typically 0.5):

    y^={1if p^0.50if p^<0.5\hat{y} = \begin{cases} 1 & \text{if } \hat{p} \geq 0.5 \\ 0 & \text{if } \hat{p} < 0.5 \end{cases}

The threshold can be adjusted based on the problem requirements:

  • Higher threshold: More conservative predictions of the positive class (fewer false positives)

  • Lower threshold: More aggressive predictions of the positive class (fewer false negatives)

Logistic Regression in Python

Let’s implement logistic regression using scikit-learn with a practical example. We’ll create a dataset to predict whether a student will pass an exam based on hours studied and previous test scores.

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a sample dataset
np.random.seed(42)
n_samples = 200

# Features: hours studied and previous test score
hours_studied = np.random.uniform(0, 10, n_samples)
previous_score = np.random.uniform(40, 100, n_samples)

# Target: pass (1) or fail (0)
# Higher hours and previous scores increase probability of passing
z = -8 + 0.7 * hours_studied + 0.1 * previous_score + np.random.normal(0, 1, n_samples)
probability = 1 / (1 + np.exp(-z))
passed = (probability > 0.5).astype(int)

# Create DataFrame
df = pd.DataFrame({
    'HoursStudied': hours_studied,
    'PreviousScore': previous_score,
    'Passed': passed
})

print("Dataset Overview:")
print(df.head(10))
print(f"\nPass Rate: {df['Passed'].mean():.2%}")

Data Visualization

Let’s visualize the relationship between our features and the target variable:

import plotly.express as px

fig = px.scatter(df, x='HoursStudied', y='PreviousScore', 
                 color='Passed', 
                 color_discrete_map={0: 'red', 1: 'green'},
                 labels={'Passed': 'Exam Result'},
                 title='Student Exam Results by Study Hours and Previous Score',
                 height=450)

fig.update_traces(marker=dict(size=8, line=dict(width=1, color='DarkSlateGrey')))
fig.show()

Data Splitting

We split our data into training and test sets:

# Prepare features and target
X = df[['HoursStudied', 'PreviousScore']]
y = df['Passed']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Feature Scaling

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Feature scaling applied")
print(f"Original HoursStudied range: [{X_train['HoursStudied'].min():.2f}, {X_train['HoursStudied'].max():.2f}]")
print(f"Scaled HoursStudied range: [{X_train_scaled[:, 0].min():.2f}, {X_train_scaled[:, 0].max():.2f}]")

Model Training

Now we train the logistic regression model:

# Initialize and train the model
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Display the learned parameters
print("Model Coefficients:")
print(f"  Intercept (β₀): {log_reg.intercept_[0]:.4f}")
print(f"  HoursStudied (β₁): {log_reg.coef_[0][0]:.4f}")
print(f"  PreviousScore (β₂): {log_reg.coef_[0][1]:.4f}")

print("\nInterpretation:")
print(f"  A one standard deviation increase in HoursStudied increases")
print(f"  the log-odds of passing by {log_reg.coef_[0][0]:.4f}")
print(f"  A one standard deviation increase in PreviousScore increases")
print(f"  the log-odds of passing by {log_reg.coef_[0][1]:.4f}")

Making Predictions

Logistic regression provides both predicted probabilities and class labels:

# Get predicted probabilities
y_pred_proba = log_reg.predict_proba(X_test_scaled)

# Get predicted class labels
y_pred = log_reg.predict(X_test_scaled)

# Create a results DataFrame
results_df = pd.DataFrame({
    'HoursStudied': X_test['HoursStudied'].values,
    'PreviousScore': X_test['PreviousScore'].values,
    'Actual': y_test.values,
    'Prob_Fail': y_pred_proba[:, 0],
    'Prob_Pass': y_pred_proba[:, 1],
    'Predicted': y_pred
})

print("Sample Predictions:")
print(results_df.head(10))

Evaluation Metrics for Classification

Unlike regression, we use different metrics to evaluate classification models:

Confusion Matrix

The confusion matrix shows the counts of correct and incorrect predictions:

Predicted Negative (0)Predicted Positive (1)
Actual Negative (0)True Negative (TN)False Positive (FP)
Actual Positive (1)False Negative (FN)True Positive (TP)

Key Metrics

From the confusion matrix, we derive several important metrics:

Accuracy: Overall proportion of correct predictions

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Precision: Of all positive predictions, what proportion was actually positive?

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Recall (Sensitivity): Of all actual positives, what proportion was correctly identified?

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

F1-Score: Harmonic mean of precision and recall

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Let’s calculate these metrics:

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print(f"\nTrue Negatives: {cm[0,0]}")
print(f"False Positives: {cm[0,1]}")
print(f"False Negatives: {cm[1,0]}")
print(f"True Positives: {cm[1,1]}")

# Calculate metrics
print("\n" + "="*50)
print("MODEL PERFORMANCE METRICS")
print("="*50)
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred):.4f}")

# Detailed classification report
print("\n" + "="*50)
print("CLASSIFICATION REPORT")
print("="*50)
print(classification_report(y_test, y_pred, target_names=['Failed', 'Passed']))

Visualizing the Confusion Matrix

import plotly.figure_factory as ff

# Create confusion matrix heatmap
z = cm
x = ['Predicted: Failed (0)', 'Predicted: Passed (1)']
y = ['Actual: Failed (0)', 'Actual: Passed (1)']

# Create annotations for each cell
z_text = [[str(y) for y in x] for x in z]

fig = ff.create_annotated_heatmap(z, x=x, y=y, annotation_text=z_text,
                                  colorscale='Blues', showscale=True)

fig.update_layout(
    title='Confusion Matrix',
    xaxis_title='Predicted Label',
    yaxis_title='Actual Label',
    height=400
)

fig.show()

ROC Curve and AUC

The Receiver Operating Characteristic (ROC) curve is a graphical plot that shows the trade-off between the true positive rate (recall) and false positive rate at various threshold settings.

The Area Under the Curve (AUC) provides a single number summary of model performance:

  • AUC = 1.0: Perfect classifier

  • AUC = 0.5: Random classifier (no better than chance)

  • AUC < 0.5: Worse than random (predictions are inverted)

The ROC curve helps you choose the optimal threshold for your specific application.

from sklearn.metrics import roc_curve, roc_auc_score

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba[:, 1])
auc_score = roc_auc_score(y_test, y_pred_proba[:, 1])

# Create ROC curve plot
fig = go.Figure()

# ROC curve
fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',
                         name=f'ROC Curve (AUC = {auc_score:.3f})',
                         line=dict(color='blue', width=2)))

# Diagonal line (random classifier)
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines',
                         name='Random Classifier',
                         line=dict(color='red', width=2, dash='dash')))

fig.update_layout(
    title='ROC Curve',
    xaxis_title='False Positive Rate',
    yaxis_title='True Positive Rate (Recall)',
    height=450,
    showlegend=True
)

fig.show()

print(f"\nAUC Score: {auc_score:.4f}")

Decision Boundary Visualization

For a two-feature model, we can visualize the decision boundary - the line that separates the two classes:

# Create a mesh grid
h = 0.02  # step size in the mesh
x_min, x_max = X_train_scaled[:, 0].min() - 1, X_train_scaled[:, 0].max() + 1
y_min, y_max = X_train_scaled[:, 1].min() - 1, X_train_scaled[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict for each point in the mesh
Z = log_reg.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

# Create contour plot
fig = go.Figure()

# Add contour for probabilities
fig.add_trace(go.Contour(x=xx[0], y=yy[:, 0], z=Z,
                         colorscale='RdYlGn',
                         contours=dict(start=0, end=1, size=0.1),
                         name='Probability',
                         showscale=True,
                         colorbar=dict(title='P(Pass)')))

# Add training points
train_pass = y_train == 1
train_fail = y_train == 0

fig.add_trace(go.Scatter(x=X_train_scaled[train_pass, 0],
                         y=X_train_scaled[train_pass, 1],
                         mode='markers',
                         marker=dict(color='green', size=8, symbol='circle',
                                   line=dict(color='darkgreen', width=1)),
                         name='Passed (Train)'))

fig.add_trace(go.Scatter(x=X_train_scaled[train_fail, 0],
                         y=X_train_scaled[train_fail, 1],
                         mode='markers',
                         marker=dict(color='red', size=8, symbol='x',
                                   line=dict(color='darkred', width=1)),
                         name='Failed (Train)'))

fig.update_layout(
    title='Decision Boundary and Probability Contours',
    xaxis_title='Hours Studied (Scaled)',
    yaxis_title='Previous Score (Scaled)',
    height=500
)

fig.show()

Multiclass Logistic Regression

While we’ve focused on binary classification, logistic regression can be extended to multiclass problems using:

  • One-vs-Rest (OvR): Train one classifier per class, treating it as the positive class and all others as negative

  • Multinomial/Softmax: Generalize the sigmoid function to handle multiple classes simultaneously

Scikit-learn’s LogisticRegression automatically handles multiclass classification using the OvR strategy by default.

Regularization in Logistic Regression

Like linear regression, logistic regression can benefit from regularization to prevent overfitting:

L2 Regularization (Ridge)

Adds a penalty term to the loss function:

J(β)=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))]+λj=1nβj2J(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{p}^{(i)}) + (1-y^{(i)}) \log(1-\hat{p}^{(i)})] + \lambda \sum_{j=1}^{n} \beta_j^2

L1 Regularization (Lasso)

Uses absolute values instead:

J(β)=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))]+λj=1nβjJ(\boldsymbol{\beta}) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{p}^{(i)}) + (1-y^{(i)}) \log(1-\hat{p}^{(i)})] + \lambda \sum_{j=1}^{n} |\beta_j|

In scikit-learn, the regularization strength is controlled by the parameter C, where smaller values specify stronger regularization.

Assumptions and Limitations

Logistic regression makes several assumptions:

  1. Linear relationship: The log-odds should be linearly related to the features

  2. Independence: Observations should be independent

  3. No multicollinearity: Features should not be highly correlated

  4. Large sample size: More reliable with larger datasets

Limitations:

  • Cannot capture complex non-linear relationships (without feature engineering)

  • Sensitive to outliers

  • Assumes linear decision boundaries

Summary

Logistic Regression is a foundational classification algorithm that:

  • Uses the sigmoid function to model probabilities

  • Learns through gradient descent on the binary cross-entropy loss

  • Provides interpretable coefficients showing feature importance

  • Works well for linearly separable problems

  • Serves as a building block for more complex models

Key takeaways:

  • The output is a probability, not a continuous value

  • We model the log-odds as a linear function of features

  • Evaluation uses classification metrics (accuracy, precision, recall, F1, AUC)

  • The decision threshold can be adjusted based on application requirements

  • Regularization helps prevent overfitting

Despite its simplicity, logistic regression remains widely used in practice due to its efficiency, interpretability, and effectiveness on many real-world problems.