Skip to article frontmatterSkip to article content

Linear Regression with sklearn

Let’s explore linear regression using an example dataset of student grades. Our goal will be to train a model to predict a student’s grade given the number of hours they have studied.

Data Loading

Loading the data:

import pandas as pd
from pandas import read_csv, Series, DataFrame
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
import numpy as np
import statsmodels
import plotly.express as px
from pandas import read_csv

repo_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance"
request_url = f"{repo_url}/main/docs/data/grades.csv"

df = read_csv(request_url)
print(len(df))
df.head()
Loading...

Data Exploration

Checking for null values:

df["StudyHours"].isna().sum()
np.int64(1)
df.tail()
Loading...

For “Ali”, we don’t have a grade or number of study hours, so we should drop that row.

For “Juan”, since there is no label, we can’t use this record to train the model, but we could use the trained model to predict their grade later (given 8 study hours).

Dropping nulls:

df.dropna(inplace=True)
df.tail()
Loading...

Exploring relationship between variables:

Checking for normality and outliers:

px.violin(df, x="StudyHours", box=True, points="all", height=350,
    title="Distribution of Study Hours",
)
Loading...
df = df[df['StudyHours'] >= 2]
df.reset_index(drop=True, inplace=True)
df.head()
Loading...
px.violin(df, x="StudyHours", box=True, points="all", height=350,
    title="Distribution of Study Hours",
)
Loading...
px.violin(df, x="Grade", box=True, points="all", height=350,
            title="Distribution of Grade"
)
Loading...

Data Splitting

X/Y Split

Start Here for In-Class Assignment

Identifying the dependent and independent variables.

#x = df["StudyHours"] # ValueError: Expected 2D array, got 1D array instead
x = df[["StudyHours"]] # model wants x to be a matrix / DataFrame
print(x.shape)

y = df["Grade"]
print(y.shape)
(21, 1)
(21,)
When using `sklearn`, we must construct the features as a two-dimensional array (even if the data only contains one column).


### Train Test Split

Splitting the data randomly into test and training sets. We will train the model on the training set, and evaluate the model using the test set. This helps for generalizability, and to prevent overfitting.
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=99)
print("TRAIN:", x_train.shape, y_train.shape)
print("TEST:", x_test.shape, y_test.shape)
TRAIN: (15, 1) (15,)
TEST: (6, 1) (6,)

Model Selection and Training

Selecting a linear regression (OLS), and training it on the training data to learn the ideal weights:

model = LinearRegression()

model.fit(x_train, y_train)
Loading...

After the model is trained, we have access to the ideal weights (i.e. “coefficients”). There is one coefficient for each feature (in this case only one: number of hours studied).

print("COEFS:", model.coef_.round(3)) # one for each feature
print("Y INTERCEPT:", model.intercept_.round(3))
COEFS: [6.269]
Y INTERCEPT: -16.095
coefs = Series(model.coef_, index=model.feature_names_in_)
print(coefs)
StudyHours    6.268758
dtype: float64

The coefficients and y-intercept tell us the line of best fit:

print("--------------")
print(f"EQUATION FOR LINE OF BEST FIT:")
print(f"y = ({round(model.coef_[0], 3)} * StudyHours) + {round(model.intercept_, 3)}")
--------------
EQUATION FOR LINE OF BEST FIT:
y = (6.269 * StudyHours) + -16.095

Model Predictions and Evaluation

Alright, we trained the model, but how well does it do in making predictions?

We use the trained model to make predictions on the unseen (test) data:

y_pred = model.predict(x_test)
print(y_pred)
[34.05538974 40.32414798 81.07107656 40.32414798 46.59290623 46.59290623]

We can then compare each of the predicted values against the actual known values:

# get all rows from the original dataset that wound up in the test set:
test_set = df.loc[x_test.index].copy()

# create a column for the predictions:
test_set["PredictedGrade"] = y_pred.round(1)

# calculate error for each datapoint:
test_set["Error"] = (y_pred - y_test).round(1)

test_set.sort_values(by="StudyHours", ascending=False)
Loading...

Plotting the errors on a graph:

px.scatter(test_set, x="StudyHours", y=["Grade", "PredictedGrade"],
           hover_data="Name", height=350,
           title=f"Prediction errors (test set)",
           labels={"value":""}
)
Loading...

To get a measure for how well the model did across the entire test dataset, we can use any number of desired regression metrics (r-squared score, mean squared error, mean absolute error, root mean sqared error), to see how well the model does.

It is possible for us to compute our own metrics:

my_mae = test_set["Error"].abs().mean()
print("MY MAE:", my_mae.round(3))
MY MAE: 6.833
my_mse = (test_set["Error"] ** 2).mean()
print("MY MSE:", my_mse.round(1))
MY MSE: 87.2

However more commonly we will use metric functions from the sklearn.metrics submodule:

r2 = r2_score(y_test, y_pred)
print("R^2:", round(r2, 3))

mae = mean_absolute_error(y_test, y_pred)
print("MAE:", round(mae, 3))

mse = mean_squared_error(y_test, y_pred)
print("MSE:", round(mse,3))
R^2: 0.682
MAE: 6.823
MSE: 86.795
rmse = mse ** 0.5
print("RMSE:", round(rmse, 3))
RMSE: 9.316
# Is there another to help us better understand if this is big or not, in terms of the target variable?

Inference

Now that the model has been trained and deemed to have a sufficient performance, we can use it to make predictions on unseen data (sometimes called “inference”):

x_new = DataFrame({"StudyHours": [0, 4, 8, 12, 16, 20]})

model.predict(x_new)
array([-16.0946762 , 8.98035677, 34.05538974, 59.13042271, 84.20545568, 109.28048865])

Alright, we have trained a model and used it to make predictions.

In Class

Using the same dataset but run the code below to add another feature, rerun the model as a multiple regression problem with the new feature. Keep grade as the target variable. What is the RSME of the new model? Did the coefficient value of study time change?

# Generate random noise
np.random.seed(42)
noise = np.random.normal(0, 1, len(df))

# Create the GPA feature
df['gpa'] = 2.0 + (df['Grade'] - df['Grade'].mean()) * 0.6 / df['Grade'].std() + noise * 0.1

# Ensure GPA is within the range [2.0, 4.0]
df['gpa'] = df['gpa'].clip(2.0, 4.0)

print(df[['Grade', 'gpa']].corr())
df.head()
Loading...