Skip to article frontmatterSkip to article content

Linear Regression with statsmodels

Let’s explore linear regression using a familiar example dataset of student grades. Our goal will be to train a model to predict a student’s grade given the number of hours they have studied.

In this implementation, we will use the statsmodels package to achieve this.

Data Loading

Loading the data:

from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/grades.csv"

df = read_csv(request_url)
df.head()

Data Exploration

Dropping null values:

df.dropna(inplace=True)
df.tail()

Exploring relationship between variables:

import plotly.express as px

px.scatter(df, x="StudyHours", y="Grade", height=350,
            title="Relationship between Study Hours and Grades",
            trendline="ols", trendline_color_override="red",
)

Data Splitting

X/Y Split

Identifying the dependent and independent variables:

x = df["StudyHours"]
print(x.shape)

y = df["Grade"]
print(y.shape)

Adding Constants

When using `statsmodels`, the [documentation](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html) instructs us to manually add a column of ones (to help the model perform calculations related to the y-intercept):
import statsmodels.api as sm

x = sm.add_constant(x) # adding in a column of constants, as per the OLS docs
x.head()

Train Test Split

Now we split the training and test sets:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)

Model Selection and Training

Selecting a linear regression (OLS) model, and training it on the training data to learn the ideal weights:

import statsmodels.api as sm

model = sm.OLS(y_train, x_train, missing="drop")
print(type(model))

results = model.fit()
print(type(results))

It is possible to access the training results, including some summary statistics:

print(results.summary())

In a regression analysis, each coefficient (the number associated with a feature in the model) has a corresponding t-statistic that tests whether the coefficient is meaningfully different from zero.

Interpreting the results:

  + **T-statistic**: Measures how many standard deviations the coefficient is away from zero. A larger t-statistic suggests that the coefficient is far from zero and potentially significant.

  + **P-value** (P>|t|): This tells you the probability that the observed t-statistic would occur if the coefficient were actually zero (the null hypothesis). If this probability is very small (typically < 0.05), it means it’s unlikely that the coefficient is zero, suggesting it is statistically significant.

Interpreting p-values:

  + A low p-value (typically less than 0.05) suggests that you can reject the null hypothesis, meaning the coefficient is statistically significant and likely has an impact on the dependent variable.
  + A high p-value (greater than 0.05) indicates that the coefficient is not statistically significant, implying that the feature may not contribute meaningfully to the model.

The training results contain an r-squared score, however this represents the error for the training data. To get the real results of how the model generalizes to the test data, we will calculate the r-squared score and other metrics on the test results later.

The part of the training results we care about are the the learned weights (i.e. coefficients), which we use to arrive at the line of best fit:

print(results.params)
print("-----------")
print(f"y =", f"{results.params['StudyHours'].round(3)}x ",
            f"+ {results.params['const'].round(3)}"
)

The training results also contain the fittedvalues (predictions), as well as the resid (residuals or errors). We can compare each of the predicted values against the actual known values, to verify the residuals for illustration purposes:

from pandas import DataFrame

# get all rows from the original dataset that wound up in the training set:
training_set = df.loc[x_train.index].copy()

# create a dataset for the predictions and the residuals:
training_preds = DataFrame({
    "Predictions": results.fittedvalues,
    "Residuals": results.resid
})
# merge the training set with the results:
training_set = training_set.merge(training_preds, how="inner",
                                  left_index=True, right_index=True
)

# calculate error for each datapoint:
training_set["My Error"] = training_set["Grade"] - training_set["Predictions"]
training_set

It is possible to calculate the training metrics ourselves, to verify the regression results summary we saw above:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

r2 = r2_score(training_set["Grade"], training_set["Predictions"])
print("R^2:", round(r2, 3))

mae = mean_absolute_error(training_set["Grade"], training_set["Predictions"])
print("MAE:", round(mae, 3))

mse = mean_squared_error(training_set["Grade"], training_set["Predictions"])
print("MSE:", round(mse,3))

rmse = mse ** .5
print("RMSE:", rmse.round(3))

Remember, these metrics only tell us about the predictive accuracy on the training data, and we care more about evaluating metrics on the test set.

Model Predictions and Evaluation

Alright, we trained the model, but how well does it do in making predictions?

We use the trained model to make predictions on the unseen (test) data:

preds = results.get_prediction(x_test)
print(type(preds))

preds_df = preds.summary_frame(alpha=0.05)
preds_df

When we use the summary_frame method on prediction results, it returns a DataFrame with several columns. Here’s a breakdown of what they mean:

  • Prediction (mean): This is the predicted value for each observation based on the model. It’s essentially the point prediction (ŷ) for the corresponding input.

  • Standard Error (mean_se): This stands for the standard error of the predicted mean. It measures the uncertainty associated with the predicted value due to sampling variability. A smaller mean_se indicates higher confidence in the predicted mean.

  • Confidence Interval (mean_ci_lower and mean_ci_upper): Represents the range in which the true mean prediction is likely to lie. For predicting the average value (e.g. “the average apple weight is between 140 and 160 grams”).
  • Prediction Interval (obs_ci_lower and obs_ci_upper): Represents the range in which an individual new observation is likely to lie. For predicting the range where individual values could fall (e.g. “an individual apple might weigh between 120 and 180 grams”).

Merging the actual values in, so we can compare predicted values vs actual values:

# reconstruct test set:
test_set = df.loc[x_test.index].copy()

# merge in prediction results:
test_set = test_set.merge(preds_df, how="inner",
                         left_index=True, right_index=True
)
test_set.rename(columns={"mean":"Prediction", "mean_se": "Standard Error",
                        "mean_ci_upper": "CI Upper", "mean_ci_lower": "CI Lower",
                        "obs_ci_upper": "PI Upper", "obs_ci_lower": "PI Lower",
                        }, inplace=True
)
test_set["My Error"] = test_set["Grade"] - test_set["Prediction"]
test_set

Visualizing the predictions:

import plotly.express as px

chart_df = test_set.copy()

# Create the plot
fig = px.scatter(chart_df, x="StudyHours", y="Grade", height=350,
                 title="True vs Predicted Grade (with Pred Interval and Confidence Interval)"
                 )

fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'],
                mode='lines+markers',
                name='Prediction (with PI)',
                marker=dict(color='mediumpurple'), #size=10, symbol="x"
                error_y=dict(type='data', symmetric=False,
                             array=chart_df['PI Upper'],
                             arrayminus=chart_df['PI Lower']),
                legendrank=1)

fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Prediction'],
                mode='lines+markers',
                name='Prediction (with CI)',
                marker=dict(color='lightblue',# size=10, #symbol="x"
                ),
                error_y=dict(type='data', symmetric=False,
                             array=chart_df['CI Upper'],
                             arrayminus=chart_df['CI Lower']),
                legendrank=1)

# Now add the scatter for the true values again, placing this plot behind
# (hack to get the actual values in front)
fig.add_scatter(x=chart_df["StudyHours"], y=chart_df['Grade'],
                mode='markers',
                name='True Value',
                marker=dict(color='blue', size=10, symbol="x"),
                legendrank=2)

fig.show()

Evaluating the model using our evaluation metrics from sklearn:

from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

y_pred = test_set["Prediction"]

r2 = r2_score(y_test, y_pred)
print("R^2:", round(r2, 3))

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", round(mae, 3))

mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse,3))

rmse = mse ** .5
print("RMSE:", rmse.round(3))

We see the same results on the test set that we did with the sklearn regression implementation.