First, import your libraries!
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot as plt
Data prep¶
Based on the following data summary, what questions and business metric should we use?
bank_data = pd.read_csv("https://raw.githubusercontent.com/UVADS/DS-3001/main/data/bank.csv")
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43628 entries, 0 to 43627
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 43628 non-null int64
1 job 43628 non-null object
2 marital 43628 non-null object
3 education 43628 non-null object
4 default 43628 non-null object
5 balance 43628 non-null int64
6 housing 43628 non-null object
7 contact 43628 non-null object
8 duration 43628 non-null int64
9 campaign 43628 non-null int64
10 pdays 43628 non-null int64
11 previous 43628 non-null int64
12 poutcome 43628 non-null object
13 signed up 43628 non-null int64
dtypes: int64(7), object(7)
memory usage: 4.7+ MB
< your answer here >
Now, let’s check the composition of the data.
bank_data.marital.value_counts() # 3 levels
married 26241
single 12355
divorced 5032
Name: marital, dtype: int64
bank_data.education.value_counts() # 4 levels
secondary 22404
tertiary 12863
primary 6584
unknown 1777
Name: education, dtype: int64
bank_data.default.value_counts() # 2 levels
no 42844
yes 784
Name: default, dtype: int64
bank_data.job.value_counts() # 12 levels! What should we do?
blue-collar 9366
management 9142
technician 7321
admin. 5001
services 4010
retired 2184
self-employed 1530
entrepreneur 1433
unemployed 1259
housemaid 1199
student 907
unknown 276
Name: job, dtype: int64
bank_data.contact.value_counts() # 3 levels -- difference between cellular and telephone?
cellular 28295
unknown 12523
telephone 2810
Name: contact, dtype: int64
bank_data.housing.value_counts() # 2 levels
yes 24231
no 19397
Name: housing, dtype: int64
bank_data.poutcome.value_counts() # 4 levels
unknown 35684
failure 4723
other 1783
success 1438
Name: poutcome, dtype: int64
bank_data['signed up'].value_counts() # 2 levels
0 38554
1 5074
Name: signed up, dtype: int64
We should collapse the variable with 12 levels. In Python, this process is slightly different than it is in R.
employed = ['admin', 'blue-collar', 'entrepreneur', 'housemaid', 'management',
'self-employed', 'services', 'technician']
# unemployed = ['student', 'unemployed', 'unknown']
bank_data.job = bank_data.job.apply(lambda x: "Employed" if x in employed else "Unemployed")
bank_data.job.value_counts()
Employed 34001
Unemployed 9627
Name: job, dtype: int64
Now, we convert the appropriate columns to factors.
# bank_data.info() # check the variables
cat = ['job', 'marital', 'education', 'default', 'housing', 'contact',
'poutcome', 'signed up'] # select the columns to convert
bank_data[cat] = bank_data[cat].astype('category')
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43628 entries, 0 to 43627
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 43628 non-null int64
1 job 43628 non-null category
2 marital 43628 non-null category
3 education 43628 non-null category
4 default 43628 non-null category
5 balance 43628 non-null int64
6 housing 43628 non-null category
7 contact 43628 non-null category
8 duration 43628 non-null int64
9 campaign 43628 non-null int64
10 pdays 43628 non-null int64
11 previous 43628 non-null int64
12 poutcome 43628 non-null category
13 signed up 43628 non-null category
dtypes: category(8), int64(6)
memory usage: 2.3 MB
Check for missing data¶
R uses the mice package, which allows you to visualize the NaNs in a dataset and later impute it. There is no equivalent package in Python, but we can still complete the same steps.
Using the seaborn package, we can see the distribution of missing values. Along the x-axis, you will see the proportion of the data missing for that variable.
import seaborn as sns
sns.displot(
data=bank_data.isna().melt(value_name="missing"),
y="variable",
hue="missing",
multiple="fill",
aspect=1.25
)
# plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)
# the above line will same the image to your computer!
<seaborn.axisgrid.FacetGrid at 0x7fd15aec46a0>

No missing data!!
Next, we normalize the numeric variables.
numeric_cols = bank_data.select_dtypes(include='int64').columns
print(numeric_cols)
Index(['age', 'balance', 'duration', 'campaign', 'pdays', 'previous'], dtype='object')
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
d = scaler.fit_transform(bank_data[numeric_cols]) # conduct data transformation
scaled_df = pd.DataFrame(d, columns=numeric_cols) # convert back to pd df; transformation converts to array
bank_data[numeric_cols] = scaled_df # put data back into the main df
bank_data.describe() # as we can see, the data is now normalized!
Now, we onehot encode the data -- for reference, this is the process of converting categorical variables to a usable form for a machine learning algorithm.
cat_cols = bank_data.select_dtypes(include='category').columns
print(cat_cols)
Index(['job', 'marital', 'education', 'default', 'housing', 'contact',
'poutcome', 'signed up'],
dtype='object')
encoded = pd.get_dummies(bank_data[cat_cols])
encoded.head() # note the new columns
bank_data = bank_data.drop(cat_cols, axis=1)
bank_data = bank_data.join(encoded)
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43628 entries, 0 to 43627
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 43628 non-null float64
1 balance 43628 non-null float64
2 duration 43628 non-null float64
3 campaign 43628 non-null float64
4 pdays 43628 non-null float64
5 previous 43628 non-null float64
6 job_Employed 43628 non-null uint8
7 job_Unemployed 43628 non-null uint8
8 marital_divorced 43628 non-null uint8
9 marital_married 43628 non-null uint8
10 marital_single 43628 non-null uint8
11 education_primary 43628 non-null uint8
12 education_secondary 43628 non-null uint8
13 education_tertiary 43628 non-null uint8
14 education_unknown 43628 non-null uint8
15 default_no 43628 non-null uint8
16 default_yes 43628 non-null uint8
17 housing_no 43628 non-null uint8
18 housing_yes 43628 non-null uint8
19 contact_cellular 43628 non-null uint8
20 contact_telephone 43628 non-null uint8
21 contact_unknown 43628 non-null uint8
22 poutcome_failure 43628 non-null uint8
23 poutcome_other 43628 non-null uint8
24 poutcome_success 43628 non-null uint8
25 poutcome_unknown 43628 non-null uint8
26 signed up_0 43628 non-null uint8
27 signed up_1 43628 non-null uint8
dtypes: float64(6), uint8(22)
memory usage: 2.9 MB
The data is ready! Now, let’s build our model.
Train model¶
We’ll run the kNN algorithm on the banking data. First, we’ll check the prevalence of the target class.
bank_data['signed up_1'].value_counts()[1] / bank_data['signed up_1'].count()
0.11630145777940772
This means that at random, we have an 11.6% chance of correctly picking a subscribed individual. Let’s see if kNN can do any better.
"""
X = bank_data.drop(['signed up_1'], axis=1).values # independent variables
y = bank_data['signed up_1'].values # dependent variable
"""
train, test = train_test_split(bank_data, test_size=0.4, stratify = bank_data['signed up_1'])
test, val = train_test_split(test, test_size=0.5, stratify=test['signed up_1'])
Now, let’s train the classifier for k=9.
import random
random.seed(1984) # kNN is a random algorithm, so we use `random.seed(x)` to make results repeatable
X_train = train.drop(['signed up_1'], axis=1).values
y_train = train['signed up_1'].values
neigh = KNeighborsClassifier(n_neighbors=9)
neigh.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=9)
# now, we check the model's accuracy on the test data:
X_test = test.drop(['signed up_1'], axis=1).values
y_test = test['signed up_1'].values
neigh.score(X_test, y_test)
0.9903735961494384
# now, we test the accuracy on our validation data.
X_val = val.drop(['signed up_1'], axis=1).values
y_val = val['signed up_1'].values
neigh.score(X_val, y_val)
0.9919779967911987
Evaluate model¶
A 99.0% accuracy rate is pretty good but keep in mind the baserate is roughly 89/11, so we have more or less a 90% chance of guessing right if we don’t know anything about the customer, but the negative outcomes we don’t really care about, this models value is being able to id sign ups when they are actually sign ups. This requires us to know are true positive rate, or Sensitivity or Recall. So let’s dig a little deeper.
# create a confusion matrix
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(neigh, X_val, y_val, cmap='Blues')
plt.show()

tip: use this link to change the color scheme of your confusion matrix: https://
# create classification report
from sklearn.metrics import classification_report
y_val_pred = neigh.predict(X_val)
print(classification_report(y_val_pred, y_val))
precision recall f1-score support
0 1.00 0.99 1.00 7767
1 0.94 0.99 0.96 959
accuracy 0.99 8726
macro avg 0.97 0.99 0.98 8726
weighted avg 0.99 0.99 0.99 8726
# we didn't get sensitivity and specificity, so we'll calculate that ourselves.
sensitivity = 943/(943+72) # = TP/(TP+FN)
specificity = 7707/(7707+4) # = TN/(TN+FP)
print(sensitivity, specificity)
0.929064039408867 0.9994812605368953
Selecting the correct ‘k’¶
How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”
def chooseK(k, X_train, y_train, X_test, y_test):
random.seed(1)
print("calculating... ", k, "k") # I'll include this so you can see the progress of the function as it runs
class_knn = KNeighborsClassifier(n_neighbors=k)
class_knn.fit(X_train, y_train)
# calculate accuracy
accu = class_knn.score(X_test, y_test)
return accu
We’ll test odd k values from 1 to 21. We want to create a table of all the data, so we’ll use list comprehension to create the “accuracy” column.
remember: Python is end-exclusive; we want UP to 21 to we’ll have to extend the end bound to include it
test = pd.DataFrame({'k':list(range(1,22,2)),
'accu':[chooseK(x, X_train, y_train, X_test, y_test) for x in list(range(1, 22, 2))]})
calculating... 1 k
calculating... 3 k
calculating... 5 k
calculating... 7 k
calculating... 9 k
calculating... 11 k
calculating... 13 k
calculating... 15 k
calculating... 17 k
calculating... 19 k
calculating... 21 k
test
test = test.sort_values(by=['accu'], ascending=False)
test
From here, we see that the best value of k=1!
Let’s go through the code we wrote in a bit more detail, specifically regarding the DataFrame construction.
For reference, here’s the line of code we wrote:
test = pd.DataFrame({'k':list(range(1,22,2)),
'accu':[chooseK(x, X_train, y_train, X_test, y_test) for x in list(range(1, 22, 2))]})
pandas DataFrames wrap around the Python dictionary data type, which is identifiable by the use of curly brackets ({}) and key-value pairs. The keys correspond to the column names (i.e. ‘k’ or ‘accu’) while the values are a list comprised of all the values we want to include.
For ‘k’, we made a list of the range of numbers from 1 to 22 (end exclusive), selecting only every other value. This is done using the syntax: range(first_val, end_val, by=?)
. Having no by=
value means that we select every value in that range.
For ‘accu’, we used list comprehension, which boils down to being loop shorthand with the output being entered into a list. We could easily re-write the code as:
temp = []
for x in list(range(1, 22, 2)):
temp.append(chooseK(x, X_train, y_train, X_test, y_test))
before adding the list to the DataFrame. Evidently, the list comprehension saves time and memory, which is why we used it earlier.
Now, let’s graph our results!¶
plt.plot(test['k'], test['accu'])
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.show()

Adjusting the threshold¶
# we want to make a table containing: probability, expected, and actual values
test_probs = neigh.predict_proba(X_test)
test_preds = neigh.predict(X_test)
# convert probabilities to pd df
test_probabilities = pd.DataFrame(test_probs, columns = ['not_signed_up_prob', 'signed_up_prob'])
test_probabilities
final_model = pd.DataFrame({'actual_class': y_test.tolist(),
'pred_class': test_preds.tolist(),
'pred_prob': [test_probabilities['signed_up_prob'][i] if test_preds[i]==1 else test_probabilities['not_signed_up_prob'][i] for i in range(len(test_preds))]})
# that last line is some list comprehension -- to understand that here in particular click the following link:
# https://stackoverflow.com/questions/4260280/if-else-in-a-list-comprehension
final_model.head()
# add a column about the probability the observation is in the positive class
final_model['pos_pred'] = [final_model.pred_prob[i] if final_model.pred_class[i]==1 else 1-final_model.pred_prob[i] for i in range(len(final_model.pred_class))]
final_model.head()
# convert classes to categories
final_model.actual_class = final_model.actual_class.astype('category')
final_model.pred_class = final_model.pred_class.astype('category')
# create probability distribution graph
import seaborn as sns
sns.displot(final_model, x="pos_pred", kind="kde")
<seaborn.axisgrid.FacetGrid at 0x7fd178cb5c70>

final_model.pos_pred.value_counts()
0.000000 7563
1.000000 799
0.111111 91
0.222222 49
0.555556 40
0.333333 40
0.444444 38
0.777778 37
0.666667 37
0.888889 32
Name: pos_pred, dtype: int64
In most datasets, the probabilities range between 0 and 1, causing uncertain predictions. A threshold must be set for where you consider the prediction to actually be a part of the positive class. Is a 60% certainty positive? How about 40%? This is where you have more control over your model’s classifications. This is especially useful for reducing incorrect classifications that you may have noticed in your confusion matrix.
from sklearn.metrics import confusion_matrix
def adjust_thres(x, y, z):
"""
x=pred_probabilities
y=threshold
z=tune_outcome
"""
thres = pd.DataFrame({'new_preds': [1 if i > y else 0 for i in x]})
thres.new_preds = thres.new_preds.astype('category')
con_mat = confusion_matrix(z, thres)
print(con_mat)
confusion_matrix(final_model.actual_class, final_model.pred_class) # original model
array([[7704, 7],
[ 77, 938]])
adjust_thres(final_model.pos_pred, .90, final_model.actual_class) # raise threshold
[[7711 0]
[ 216 799]]
adjust_thres(final_model.pos_pred, .3, final_model.actual_class) # lower threshold
[[7664 47]
[ 39 976]]
More for next week: evaluation metrics¶
ROC/AUC curve¶
There are a few really cool graphing options, so I’ll show you a few. There are several packages in Python are interactive as well!
# basic graph
from sklearn import metrics
fpr, tpr, _ = metrics.roc_curve(y_test, final_model.pos_pred)
auc = metrics.roc_auc_score(y_test, final_model.pos_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# installs dependency for next graph
! pip install plot_metric
Requirement already satisfied: plot_metric in /opt/anaconda3/lib/python3.8/site-packages (0.0.6)
Requirement already satisfied: scikit-learn>=0.21.2 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (0.24.1)
Requirement already satisfied: colorlover>=0.3.0 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (0.3.0)
Requirement already satisfied: matplotlib>=3.0.2 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (3.3.4)
Requirement already satisfied: seaborn>=0.9.0 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (0.11.1)
Requirement already satisfied: scipy>=1.1.0 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (1.6.2)
Requirement already satisfied: pandas>=0.23.4 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (1.2.4)
Requirement already satisfied: numpy>=1.15.4 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (1.20.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (8.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (1.3.1)
Requirement already satisfied: python-dateutil>=2.1 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (2.8.1)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/site-packages (from cycler>=0.10->matplotlib>=3.0.2->plot_metric) (1.15.0)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.23.4->plot_metric) (2021.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.21.2->plot_metric) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.21.2->plot_metric) (1.0.1)
# a pretty cool one
from plot_metric.functions import BinaryClassification
# Visualisation with plot_metric
bc = BinaryClassification(y_test, final_model.pred_class, labels=["0", "1"])
# Figures
plt.figure(figsize=(5,5))
bc.plot_roc_curve()
plt.show()

F1 score¶
metrics.f1_score(y_test, final_model.pred_class)
0.9571428571428572
LogLoss¶
metrics.log_loss(y_test, final_model.pred_class)
0.3324848515189683
Another quick example¶
from pydataset import data
iris = data("iris")
iris.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
4 Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB
iris.describe()
from sklearn.preprocessing import scale
cols = list(iris.columns[:4])
scaledIris = pd.DataFrame(scale(iris.iloc[:, :4]), index=iris.index, columns=cols)
scaledIris.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Sepal.Length 150 non-null float64
1 Sepal.Width 150 non-null float64
2 Petal.Length 150 non-null float64
3 Petal.Width 150 non-null float64
dtypes: float64(4)
memory usage: 5.9 KB
scaledIris['Species'] = iris['Species']
# split datasets
irisTrain, irisTest = train_test_split(scaledIris, test_size=0.4, stratify = scaledIris['Species'])
irisTest, irisVal = train_test_split(irisTest, test_size=0.5, stratify = irisTest['Species'])
Xi_train = irisTrain.drop(['Species'], axis=1)
yi_train = irisTrain['Species']
Xi_test = irisTest.drop(['Species'], axis=1)
yi_test = irisTest['Species']
Xi_val = irisVal.drop(['Species'], axis=1)
yi_val = irisVal['Species']
iris_neigh = KNeighborsClassifier(n_neighbors=3)
iris_neigh.fit(Xi_train, yi_train)
KNeighborsClassifier(n_neighbors=3)
iris_neigh.score(Xi_test, yi_test)
0.9666666666666667
iris_neigh.score(Xi_val, yi_val)
0.9
plot_confusion_matrix(iris_neigh, Xi_val, yi_val, cmap='Blues')
plt.show()

Example using 10-k cross-validation¶
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=10, n_repeats=3, random_state=12)
X_si = scaledIris.drop(['Species'], axis=1)
y_si = scaledIris['Species']
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
cv_neigh = KNeighborsClassifier(n_neighbors=3) # create classifier
scores = cross_val_score(cv_neigh, X_si, y_si, scoring='accuracy', cv=rkf, n_jobs=-1) # do repeated cv
print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))
Accuracy: 0.949 (0.061)
plt.plot(scores)

# more complex version so you can create a graph for testing and training accuracy (not built into the previous version)
#Split arrays or matrices into train and test subsets
Xsi_train, Xsi_test, ysi_train, ysi_test = train_test_split(X_si, y_si, test_size=0.20)
rcv_knn = KNeighborsClassifier(n_neighbors=6)
rcv_knn.fit(Xsi_train, ysi_train)
print("Preliminary model score:")
print(rcv_knn.score(Xsi_test, ysi_test))
no_neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(no_neighbors))
test_accuracy = np.empty(len(no_neighbors))
for i, k in enumerate(no_neighbors):
# We instantiate the classifier
rcv_knn = KNeighborsClassifier(n_neighbors=k)
# Fit the classifier to the training data
rcv_knn.fit(Xsi_train, ysi_train)
# Compute accuracy on the training set
train_accuracy[i] = rcv_knn.score(Xsi_train, ysi_train)
# Compute accuracy on the testing set
test_accuracy[i] = rcv_knn.score(Xsi_test, ysi_test)
# Visualization of k values vs accuracy
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(no_neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(no_neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
Preliminary model score:
0.9666666666666667

Variable importance¶
There is no easy way in SKLearn to calculate variable importance for a KNN model. So, we’ll use a slightly hacked-together solution.
Variable importance reflects the significance one variable has on the model. If a variable is more important, that variable being removed/permuted has a larger effect on the output of the model. So, if we check the changes such permutations have, we should be able to extract the feature importance.
data = {'sepal_length': [0], 'sepal_width': [0], 'petal_length': [0], 'petal_width': [0]}
feat_imp = pd.DataFrame(data)
feat_imp.head()
# baseline
fin_knn = KNeighborsClassifier(n_neighbors=7)
fin_knn.fit(Xsi_train, ysi_train)
print(fin_knn.score(Xsi_test, ysi_test))
plot_confusion_matrix(fin_knn, Xsi_test, ysi_test, cmap='Blues')
0.9666666666666667
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd15b156a30>

change Sepal.Length
¶
Xsi_test.head()
perm_SL = Xsi_test.copy() # # copy df; we don't want to alter the actual data
perm_SL['Sepal.Length'] = np.random.permutation(perm_SL['Sepal.Length']) # permute data
perm_SL.head()
fin_knn.score(perm_SL, ysi_test)
0.9
feat_imp['sepal_length'] = fin_knn.score(Xsi_test, ysi_test) - fin_knn.score(perm_SL, ysi_test)
feat_imp.head()
plot_confusion_matrix(fin_knn, perm_SL, ysi_test, cmap='Blues') # what got misclassified?
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd1687a73a0>

Instead of making this repetitive, we can turn this into a function and loop.
def featureImportance(X, y, model):
# create dataframe of variables
var_imp = pd.DataFrame(columns=list(X.columns))
var_imp.loc[0] = 0
base_score = model.score(X, y)
for col in list(X.columns):
temp = X.copy() # # copy df; we don't want to alter the actual data
temp[col] = np.random.permutation(temp[col]) # permute data
var_imp[col] = base_score - model.score(temp, y)
# plot_confusion_matrix(model, temp, y, cmap='Blues') # what got misclassified?
print(var_imp)
featureImportance(Xsi_test, ysi_test, fin_knn)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0 0.033333 0.0 0.4 0.266667
From here, we find the important variables!
General eval¶
plot_confusion_matrix(fin_knn, Xsi_test, ysi_test, cmap='Blues')
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fe8f9c04b80>

Looks like we only misclassified one virginica as versicolor. Let’s see how certain our predictions were.
iris2_probs = fin_knn.predict_proba(Xsi_test)
iris2_probs
array([[0. , 0.14285714, 0.85714286],
[0. , 0.28571429, 0.71428571],
[0. , 1. , 0. ],
[0. , 0. , 1. ],
[1. , 0. , 0. ],
[0. , 0.85714286, 0.14285714],
[0. , 1. , 0. ],
[0. , 0.28571429, 0.71428571],
[0. , 0.14285714, 0.85714286],
[0. , 0.28571429, 0.71428571],
[0. , 1. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[0. , 0. , 1. ],
[0. , 0.42857143, 0.57142857],
[0. , 0.71428571, 0.28571429],
[0. , 1. , 0. ],
[1. , 0. , 0. ],
[0. , 0.85714286, 0.14285714],
[1. , 0. , 0. ],
[0. , 1. , 0. ],
[0. , 0.85714286, 0.14285714],
[1. , 0. , 0. ],
[0. , 0. , 1. ],
[1. , 0. , 0. ],
[0. , 1. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[0. , 0. , 1. ]])