Analytics LABS

Analytics LABS#

Lab 1 - Mastermind entropy#

DS 1001 - Spring 2023 - Professors Wright and Alonzi

Due: End of lab period (or later that day)

Submission format: Word doc or PDF summarizing your findings

Individual Assignment

General Descripition: This lab is designed to introduce the basic concepts behind information gain and entropy. Mastermind is a game were through a series of questions one player tries to determine a “code” created by another player. In doing so the value of asking questions that provide a high level of information will become paramount.

Preparatory Assignments - None

Why am I doing this? In order to gain a better understanding of information gain and entropy through hands on experience.

What am I going to do? Work with a partner or group of three and alternate playing the game. At the onset of each game the code-maker should calculate the entropy of the chosen pegs for the “code”. The code-breaker should note the results of the game . This should include whether the code was broken or not and the number of rounds used. Try to be the code-maker and breaker at least 3 times. The code-maker should intentionally use very different types of “code”, this should impact how the game is played.

Note the range of entropy for binary cases is 0 to 1 for more than 2 classes it is 0 to log~2~ k, where k is the number of classes.

Answer these questions:

What combinations of pegs (code) seemed to be harder to break?
Did your approach to asking questions change as you played?
Describe where in the game information gain is being presented?

Tips for success:

Don’t worry about winning, instead think about what is happening during game play.
Work as a team with your group.
Try to document the process used right at the start of the lab, don’t wait till after you have played to start taking notes.

How will I know I have succeeded:

Specs Category	Specs Details
Formatting	- Submit via Canvas a PDF or Word document
Text	- Answer the above questions - When you were the code-maker submit the Entropy - When you were the code breaker submit the results

Acknowledgements: Special thanks for Jess Taggart from UVA CTE for coaching us. This structure is pulled directory from Steifer & Palmer (2020).

Lab 2 - Garbage In Garbage Out#

# pip install statsmodels
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import statsmodels.api as sm

What we are going to do is create a linear model using old faithful data. Then we are going to add Gaussian noise iteratively to a dataset and pass the new noisy dataset to a the same linear regression model and track the ability of the model to learn the target as noise increases.

Read this brief wiki on garbage in garbage out: article
Work through the below notebook, paying close attention to the comments and the outputs of the code.
Answer the questions at the end as they relate to the process of adding noise to a dataset and then passing it to a linear regression model.

# The dataset we will be using is a well known measure of eruption and wait times for old faithful. 
# Take a look at the documentation 
# here: https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/faithful


# This is our data source (.tsv == tab separated), here we are creating a variable that holds the url to the data.
# If you click the {x} in the ribbon to the left you will see the variables you have created.
url="https://gist.githubusercontent.com/curran/4b59d1046d9e66f2787780ad51a1cd87/raw/9ec906b78a98cf300947a37b56cfe70d01183200/data.tsv"


# Here we are reading in the data, and use the "tab" as a separator so the data will be loaded 
# in correctly.  pd stands for pandas and "pd.read_csv" is the function we are 
# using to read in the data from the url.
old_faith = pd.read_csv(url, sep='\t')

Let’s take a quick look at a scatter plot to get an idea of the distribution of the data.#

#plt (matplotlib.pyplot) is the library we are using to create our scatter plot.  We are using the "scatter" function, 
# and the alpha is the transparency of the dots.
plt.scatter(old_faith.waiting,old_faith.eruptions, alpha=0.5)
# here we are adding the labels to the x and y axis
plt.ylabel("Eruption Seconds")
plt.xlabel("Waiting Minutes")

#this is the function that renders the plot
plt.show()

# Note the patterns that you see, (you'll need this for your submission) 

../_images/2f2cbe6833754ede9f22dc785ccf59a29daee4db167f77b780917fbf4a0ddeab.png

Next lets build a linear regression model to predict Eruption Length based on Waiting Time#

#Here we are using the statsmodels library to create a linear regression model. sm stands for statsmodels and OLS 
# stands for Ordinary Least Squares.  We are using the "fit" function to fit the model to the data.
lm = sm.OLS(old_faith['waiting'],old_faith['eruptions']).fit()

# Now we are printing out the summary of the model (lm). The summary function will give you the R2 measure, along with 
# other measures of model fit. 
lm.summary()

#Using the R2 measure, how well does the model predict eruption length (scale of 0 to 1,
#  where 1 is perfect, again needed for your submission)?

OLS Regression Results
Dep. Variable:	waiting	R-squared (uncentered):	0.973
Model:	OLS	Adj. R-squared (uncentered):	0.973
Method:	Least Squares	F-statistic:	9621.
Date:	Thu, 30 May 2024	Prob (F-statistic):	9.97e-214
Time:	17:12:00	Log-Likelihood:	-1060.7
No. Observations:	272	AIC:	2123.
Df Residuals:	271	BIC:	2127.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
eruptions	19.4019	0.198	98.086	0.000	19.013	19.791

Omnibus:	31.065	Durbin-Watson:	2.646
Prob(Omnibus):	0.000	Jarque-Bera (JB):	10.019
Skew:	0.148	Prob(JB):	0.00667
Kurtosis:	2.108	Cond. No.	1.00

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now we are creating “noise” that aligns to a normal or gaussian distribution, thus gaussian noise, to add to the data.#

# mu= mean and sigma = standard deviation, these are the inputs you will be changing
mu, sigma = 0, .5 
# We need to create noise with the same dimensions as the dataset (272,2), this will allow us 
# to combine them easily
noise = np.random.normal(mu, sigma, [272,2])
# The print function will print the first 5 rows of the noise array 
print(noise[0:5])

[[ 0.55659442 -0.21980044]
 [-0.19208169 -0.02039839]
 [ 0.09807567  1.02917425]
 [-0.29852211 -0.24970532]
 [ 0.12535908 -0.21473562]]

Really a small change, let’s see what happens to model quality#

# Here we are adding the noise to the dataset
old_faith_1=old_faith + noise

#Create a new model with the noise added to the data
lm_1 = sm.OLS(old_faith_1['waiting'],old_faith_1['eruptions']).fit()

#Checking the R2 number, note what happens.
lm_1.summary()

OLS Regression Results
Dep. Variable:	waiting	R-squared (uncentered):	0.957
Model:	OLS	Adj. R-squared (uncentered):	0.957
Method:	Least Squares	F-statistic:	6045.
Date:	Thu, 30 May 2024	Prob (F-statistic):	2.51e-187
Time:	17:12:00	Log-Likelihood:	-1121.7
No. Observations:	272	AIC:	2245.
Df Residuals:	271	BIC:	2249.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
eruptions	19.0266	0.245	77.751	0.000	18.545	19.508

Omnibus:	0.516	Durbin-Watson:	2.356
Prob(Omnibus):	0.773	Jarque-Bera (JB):	0.588
Skew:	-0.101	Prob(JB):	0.745
Kurtosis:	2.895	Cond. No.	1.00

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now let’s take a look at that same scatter plot but with the new data#

plt.scatter(old_faith_1.waiting,old_faith_1.eruptions, alpha=0.5)
plt.ylabel("Eruption Seconds")
plt.xlabel("Waiting Minutes")

plt.show()

../_images/60ec3f65d19d88e95d587a1206fb905ee4892046defca820aed05161333f9019.png

Task 1: Slowly change the standard deviation (increments of .25, stopping at 3.5), add the noise array to the dataset with each increase, re-run the model and see what happens with the R2 number, note the pace of change of the model (write it down).#

Task 2: Following the same process above but instead adjust the mean to three different increasing increments but leave the standard dev at .05, note how this changes the model fit.#

Questions for submission:#

What patterns do you see in the original data (from the scatter plot)?
How well did the first model fit the original data (from the model output)?
What happens when you add the standard deviation oriented noise to the data, at what level does the model appear to degrade significantly?
What happens when you add the mean noise to the data while holding the standard dev constant, does the model degrade at the same level as the standard deviation noise? Why or why not?
Explain the concept behind noise versus signal as it relates to the old faithful dataset, how does this relate to garbage in garbage out?
Explain what the linear regression model is doing.

Lab 3 - kNN#

Overview#

The goal of this lab is provide you with a better understand of how machine learning models work. The model you will be building is a supervised classification model, k nearest neighbor (kNN). This notebook will walk you through a shorten series of steps necessary to prepare the data, build the model, evaluated the output and visualize the predictions. You’ll need to run all the cells to make the model work correctly. Do so one at a time, in order, by pressing the play button in the top left corner of each cell.

You will then be asked to interpret and manipulate the model features with the goal of better understanding how kNN generates it’s predictions. The lab will focus on the concepts of over/under fitting and how kNN generates probability estimates.

Dataset#

The data we will be using is the cereal dataset from the UCI Machine Learning Repository. The dataset contains 77 observations of 16 variables. The goal of this lab is to build a kNN model that can predict the target variable based on the 15 predictor variables.

Part 1: Data Preparation and Model Building#

Load the Libraries#

#First we import the libraries we need
import pandas as pd
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot as plt

Load the Dataset#

#read in the cereal dataset, we are reading the data in from a URL associated with the Foundations of 
#machine learning class

cereal = pd.read_csv("https://raw.githubusercontent.com/UVADS/DS-3001/main/data/cereal.csv")

cereal.info() # Let's check the structure of the dataset and see if we have any issues with variable classes
#usually it's converting things to category

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 16 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      77 non-null     object 
 1   mfr       77 non-null     object 
 2   type      77 non-null     object 
 3   calories  77 non-null     int64  
 4   protein   77 non-null     int64  
 5   fat       77 non-null     int64  
 6   sodium    77 non-null     int64  
 7   fiber     77 non-null     float64
 8   carbo     77 non-null     float64
 9   sugars    77 non-null     int64  
 10  potass    77 non-null     int64  
 11  vitamins  77 non-null     int64  
 12  shelf     77 non-null     int64  
 13  weight    77 non-null     float64
 14  cups      77 non-null     float64
 15  rating    77 non-null     float64
dtypes: float64(5), int64(8), object(3)
memory usage: 9.8+ KB

Scale/Normalizing/Variable Classes (lots more but these are the focus of this lab)#

#Looks like columns 1,2,11 and 12 need to be converted to category

Column_index_list = [1,2,11,12]
cereal.iloc[:,Column_index_list]= cereal.iloc[:,Column_index_list].astype('category') 
cereal.type = cereal.type.astype('category')
#iloc accesses the index of a dataframe, bypassing having to manually type in the names of each column

cereal.dtypes #another way of checking the structure of the dataset. Simpler, but does not give an index

name          object
mfr           object
type        category
calories       int64
protein        int64
fat            int64
sodium         int64
fiber        float64
carbo        float64
sugars         int64
potass         int64
vitamins    category
shelf       category
weight       float64
cups         float64
rating       float64
dtype: object

#Let's take a closer look at mfr
cereal.mfr.value_counts() #value_counts() simply displays variable counts as a vertical table.

mfr
K    23
G    22
P     9
Q     8
R     8
N     6
A     1
Name: count, dtype: int64

Example of how to collapse categories#

#Usually don't want more than 5ish groups, so we should collapse this factor  
#Keep the large groups (G, K) and group all the smaller categories as "Other"

top = ['K','G']
cereal.mfr = (cereal.mfr.apply(lambda x: x if x in top else "Other")).astype('category')
#lambda is a small anonymous function that can take any number of arguments but can only have one expression
#a example lambda function is lambda a: a+10, if we passed 5 to this we would get back 15
#lambda functions are best used inside of another function, like in this example when it is used inside the apply function
#to use an if function in a lambda statement, the True return value comes first (x), then the if statement, then else, and then the False return

cereal.mfr.value_counts() #This is a lot better

mfr
Other    32
K        23
G        22
Name: count, dtype: int64

Normalizing the numeric values#

# Here we are normalizing the numeric values and creating a index based on numeric columns:
abc = list(cereal.select_dtypes('number')) #select function to find the numeric variables and create a list  

cereal[abc] = MinMaxScaler().fit_transform(cereal[abc])
cereal #notice the difference in the range of values for the numeric variables

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	Other	C	0.181818	0.6	0.2	0.406250	0.714286	0.250000	0.4375	0.848943	25	3	0.5	0.064	0.665593
1	100% Natural Bran	Other	C	0.636364	0.4	1.0	0.046875	0.142857	0.375000	0.5625	0.410876	0	3	0.5	0.600	0.210685
2	All-Bran	K	C	0.181818	0.6	0.2	0.812500	0.642857	0.333333	0.3750	0.969789	25	3	0.5	0.064	0.546941
3	All-Bran with Extra Fiber	K	C	0.000000	0.6	0.0	0.437500	1.000000	0.375000	0.0625	1.000000	25	3	0.5	0.200	1.000000
4	Almond Delight	Other	C	0.545455	0.2	0.4	0.625000	0.071429	0.625000	0.5625	0.000000	25	3	0.5	0.400	0.215987
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	G	C	0.545455	0.2	0.2	0.781250	0.000000	0.916667	0.2500	0.184290	25	3	0.5	0.400	0.278387
73	Trix	G	C	0.545455	0.0	0.2	0.437500	0.000000	0.583333	0.8125	0.078550	25	2	0.5	0.600	0.128340
74	Wheat Chex	Other	C	0.454545	0.4	0.2	0.718750	0.214286	0.750000	0.2500	0.350453	25	1	0.5	0.336	0.419558
75	Wheaties	G	C	0.454545	0.4	0.2	0.625000	0.214286	0.750000	0.2500	0.335347	25	1	0.5	0.600	0.443410
76	Wheaties Honey Gold	G	C	0.545455	0.2	0.2	0.625000	0.071429	0.708333	0.5625	0.184290	25	1	0.5	0.400	0.239812

77 rows × 16 columns

One-Hot Encoding the categorical values#

# Next let's one-hot encode those categorical variables

category_list = list(cereal.select_dtypes('category')) #select function to find the categorical variables and create a list  

cereal_1h = pd.get_dummies(cereal, columns = category_list) 
#get_dummies encodes categorical variables into binary by adding in indicator column for each group of a category 
#and assigning it 0 if false or 1 if true
cereal_1h #see the difference? This is one-hot encoding!

	name	calories	protein	fat	sodium	fiber	carbo	sugars	potass	weight	...	mfr_K	mfr_Other	type_C	type_H	vitamins_0	vitamins_25	vitamins_100	shelf_1	shelf_2	shelf_3
0	100% Bran	0.181818	0.6	0.2	0.406250	0.714286	0.250000	0.4375	0.848943	0.5	...	False	True	True	False	False	True	False	False	False	True
1	100% Natural Bran	0.636364	0.4	1.0	0.046875	0.142857	0.375000	0.5625	0.410876	0.5	...	False	True	True	False	True	False	False	False	False	True
2	All-Bran	0.181818	0.6	0.2	0.812500	0.642857	0.333333	0.3750	0.969789	0.5	...	True	False	True	False	False	True	False	False	False	True
3	All-Bran with Extra Fiber	0.000000	0.6	0.0	0.437500	1.000000	0.375000	0.0625	1.000000	0.5	...	True	False	True	False	False	True	False	False	False	True
4	Almond Delight	0.545455	0.2	0.4	0.625000	0.071429	0.625000	0.5625	0.000000	0.5	...	False	True	True	False	False	True	False	False	False	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	0.545455	0.2	0.2	0.781250	0.000000	0.916667	0.2500	0.184290	0.5	...	False	False	True	False	False	True	False	False	False	True
73	Trix	0.545455	0.0	0.2	0.437500	0.000000	0.583333	0.8125	0.078550	0.5	...	False	False	True	False	False	True	False	False	True	False
74	Wheat Chex	0.454545	0.4	0.2	0.718750	0.214286	0.750000	0.2500	0.350453	0.5	...	False	True	True	False	False	True	False	True	False	False
75	Wheaties	0.454545	0.4	0.2	0.625000	0.214286	0.750000	0.2500	0.335347	0.5	...	False	False	True	False	False	True	False	True	False	False
76	Wheaties Honey Gold	0.545455	0.2	0.2	0.625000	0.071429	0.708333	0.5625	0.184290	0.5	...	False	False	True	False	False	True	False	True	False	False

77 rows × 23 columns

Creating a binary target variable so we can run this model as a classification problem.#

#add this as a predictor instead of replacing the numeric version, changing the target into a binary variable 
#to allow for classification 
cereal_1h['rating_f'] = pd.cut(cereal_1h.rating, bins = [-1,0.43,1], labels =[0,1])
#If we want two segments we input three numbers, start, cut and stop values

cereal_1h #notice the new column rating_f, it is now binary based on if the continuous value is above 0.43 or not

	name	calories	protein	fat	sodium	fiber	carbo	sugars	potass	weight	...	mfr_Other	type_C	type_H	vitamins_0	vitamins_25	vitamins_100	shelf_1	shelf_2	shelf_3	rating_f
0	100% Bran	0.181818	0.6	0.2	0.406250	0.714286	0.250000	0.4375	0.848943	0.5	...	True	True	False	False	True	False	False	False	True	1
1	100% Natural Bran	0.636364	0.4	1.0	0.046875	0.142857	0.375000	0.5625	0.410876	0.5	...	True	True	False	True	False	False	False	False	True	0
2	All-Bran	0.181818	0.6	0.2	0.812500	0.642857	0.333333	0.3750	0.969789	0.5	...	False	True	False	False	True	False	False	False	True	1
3	All-Bran with Extra Fiber	0.000000	0.6	0.0	0.437500	1.000000	0.375000	0.0625	1.000000	0.5	...	False	True	False	False	True	False	False	False	True	1
4	Almond Delight	0.545455	0.2	0.4	0.625000	0.071429	0.625000	0.5625	0.000000	0.5	...	True	True	False	False	True	False	False	False	True	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	0.545455	0.2	0.2	0.781250	0.000000	0.916667	0.2500	0.184290	0.5	...	False	True	False	False	True	False	False	False	True	0
73	Trix	0.545455	0.0	0.2	0.437500	0.000000	0.583333	0.8125	0.078550	0.5	...	False	True	False	False	True	False	False	True	False	0
74	Wheat Chex	0.454545	0.4	0.2	0.718750	0.214286	0.750000	0.2500	0.350453	0.5	...	True	True	False	False	True	False	True	False	False	0
75	Wheaties	0.454545	0.4	0.2	0.625000	0.214286	0.750000	0.2500	0.335347	0.5	...	False	True	False	False	True	False	True	False	False	1
76	Wheaties Honey Gold	0.545455	0.2	0.2	0.625000	0.071429	0.708333	0.5625	0.184290	0.5	...	False	True	False	False	True	False	True	False	False	0

77 rows × 24 columns

Dropping Variables and Partitioning (Train/Test)#

#Divide up our data into two parts, Training and Test but first we need to...
#clean up our dataset a bit by dropping the original rating variable and the cereal name since we can't really use them

cereal_dt = cereal_1h.drop(['name','rating'],axis=1) #creating a new dataframe so we don't delete these columns from our working environment. 
cereal_dt

	calories	protein	fat	sodium	fiber	carbo	sugars	potass	weight	cups	...	mfr_Other	type_C	type_H	vitamins_0	vitamins_25	vitamins_100	shelf_1	shelf_2	shelf_3	rating_f
0	0.181818	0.6	0.2	0.406250	0.714286	0.250000	0.4375	0.848943	0.5	0.064	...	True	True	False	False	True	False	False	False	True	1
1	0.636364	0.4	1.0	0.046875	0.142857	0.375000	0.5625	0.410876	0.5	0.600	...	True	True	False	True	False	False	False	False	True	0
2	0.181818	0.6	0.2	0.812500	0.642857	0.333333	0.3750	0.969789	0.5	0.064	...	False	True	False	False	True	False	False	False	True	1
3	0.000000	0.6	0.0	0.437500	1.000000	0.375000	0.0625	1.000000	0.5	0.200	...	False	True	False	False	True	False	False	False	True	1
4	0.545455	0.2	0.4	0.625000	0.071429	0.625000	0.5625	0.000000	0.5	0.400	...	True	True	False	False	True	False	False	False	True	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	0.545455	0.2	0.2	0.781250	0.000000	0.916667	0.2500	0.184290	0.5	0.400	...	False	True	False	False	True	False	False	False	True	0
73	0.545455	0.0	0.2	0.437500	0.000000	0.583333	0.8125	0.078550	0.5	0.600	...	False	True	False	False	True	False	False	True	False	0
74	0.454545	0.4	0.2	0.718750	0.214286	0.750000	0.2500	0.350453	0.5	0.336	...	True	True	False	False	True	False	True	False	False	0
75	0.454545	0.4	0.2	0.625000	0.214286	0.750000	0.2500	0.335347	0.5	0.600	...	False	True	False	False	True	False	True	False	False	1
76	0.545455	0.2	0.2	0.625000	0.071429	0.708333	0.5625	0.184290	0.5	0.400	...	False	True	False	False	True	False	True	False	False	0

77 rows × 22 columns

# Now we partition
Train, Test = train_test_split(cereal_dt,  train_size = 55, stratify = cereal_dt.rating_f) 
#stratify perserves class proportions when splitting, reducing sampling error 

Model Building Cell#

import random
random.seed(1984)   # kNN is a random algorithm, so we use `random.seed(x)` to make results repeatable

X_train = Train.drop(['rating_f'], axis=1).values #features
y_train = Train['rating_f'].values #target

neigh = KNeighborsClassifier(n_neighbors=3) #this is the value to change
neigh.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Evaluation Cell#

X_test = Test.drop(['rating_f'], axis=1).values
y_test = Test['rating_f'].values

neigh.score(X_test, y_test)

0.7727272727272727

Task 1#

Go back two cells to the “Model Building Cell” and change the value of k. Then rerun the “Evaluation cell”. What happens to the accuracy of the model? Track the output of the model for odd values between 3 and 21. Note the changes by drawing a graph, by hand, of the accuracy versus k.
Using what we learning in class about over and under fitting explain what is happening to the model as we change the value of k, keeping in mind that the range of evaluation is between 0 and 1, with 1 being a perfect classification model.
Lastly recommend a value for k and explain why you recommend that value.

Part 2: Probabilities of the Model#

The model is predicting the probability of each observation belonging to each class. So for every row in the test dataset a probability is calculated for each class [0,1]. The class with the highest probability is the predicted class. So let’s take a closer look at the probabilities. The next five cells are just getting the data organized so the results are easier to view.

# we want to make a table containing: probability, expected, and actual values

test_probs = neigh.predict_proba(X_test)
test_preds = neigh.predict(X_test)

# convert probabilities to pd df
test_probabilities = pd.DataFrame(test_probs, columns = ['bad_cereal_prob', 'good_cereal_prob'])
test_probabilities

	bad_cereal_prob	good_cereal_prob
0	0.666667	0.333333
1	0.666667	0.333333
2	0.666667	0.333333
3	0.666667	0.333333
4	0.666667	0.333333
5	0.000000	1.000000
6	1.000000	0.000000
7	1.000000	0.000000
8	1.000000	0.000000
9	1.000000	0.000000
10	0.666667	0.333333
11	1.000000	0.000000
12	1.000000	0.000000
13	1.000000	0.000000
14	1.000000	0.000000
15	1.000000	0.000000
16	1.000000	0.000000
17	0.666667	0.333333
18	0.666667	0.333333
19	1.000000	0.000000
20	0.333333	0.666667
21	0.333333	0.666667

final_model = pd.DataFrame({'actual_class': y_test.tolist(),
                           'pred_class': test_preds.tolist(),
                           'pred_prob': [test_probabilities['good_cereal_prob'][i] if test_preds[i]==1 else test_probabilities['bad_cereal_prob'][i] for i in range(len(test_preds))]})
# that last line is some list comprehension -- to understand that here in particular click the following link:
# https://stackoverflow.com/questions/4260280/if-else-in-a-list-comprehension

final_model.head()

	actual_class	pred_prob
0	0	0.666667
1	0	0.666667
2	0	0.666667
3	1	0.666667
4	0	0.666667

Now we are going to add the percentage of the probability for the positive class (good cereal) to the dataframe.#

# add a column about the probability the observation is in the positive class
final_model['pos_pred'] = [final_model.pred_prob[i] if final_model.pred_class[i]==1 else 1-final_model.pred_prob[i] for i in range(len(final_model.pred_class))]

final_model #look at this dataframe for the first question in Task 2

	actual_class	pred_class	pred_prob	pos_pred
0	0	0	0.666667	0.333333
1	0	0	0.666667	0.333333
2	0	0	0.666667	0.333333
3	1	0	0.666667	0.333333
4	0	0	0.666667	0.333333
5	1	1	1.000000	1.000000
6	0	0	1.000000	0.000000
7	0	0	1.000000	0.000000
8	0	0	1.000000	0.000000
9	0	0	1.000000	0.000000
10	0	0	0.666667	0.333333
11	0	0	1.000000	0.000000
12	0	0	1.000000	0.000000
13	0	0	1.000000	0.000000
14	1	0	1.000000	0.000000
15	0	0	1.000000	0.000000
16	1	0	1.000000	0.000000
17	0	0	0.666667	0.333333
18	0	0	0.666667	0.333333
19	1	0	1.000000	0.000000
20	0	1	0.666667	0.666667
21	1	1	0.666667	0.666667

# convert classes to categories
final_model.actual_class = final_model.actual_class.astype('category')
final_model.pred_class = final_model.pred_class.astype('category')

Graph the probabilities#

# create probability distribution graph
import seaborn as sns

sns.histplot(final_model, x="pos_pred")

<Axes: xlabel='pos_pred', ylabel='Count'>

../_images/703f5bc2a487fb664c5496f2bcb0c87e1c157c5eb41e964aafa398811905369f.png

Task 2: Determine how the probabilities are calculated#

First, looking at the “final model” dataframe from three cells above, list the rows that have errors in the classification.
Use the code below to again adjust the k value in the “Second Model Building Cell” and the run the “Graph Cell”. Use the same range of values as you did in Task 1. This time look at the graph and note the changes. Then based on what you are seeing determine how the probabilities are being generated.
Lastly, the model will predict anything above 50% as the positive class (good cereal). Given this information what are the risks of having too small of a k value?

Second Model Building Cell#

random.seed(1984)   # kNN is a random algorithm, so we use `random.seed(x)` to make results repeatable

X_train = Train.drop(['rating_f'], axis=1).values #features
y_train = Train['rating_f'].values #target

neigh = KNeighborsClassifier(n_neighbors=5) #this is the value to change
neigh.fit(X_train, y_train)

KNeighborsClassifier()