Skip to article frontmatterSkip to article content

Week 6: kNN

First, import your libraries!

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from matplotlib import pyplot as plt

Data prep

Based on the following data summary, what questions and business metric should we use?

bank_data = pd.read_csv("https://raw.githubusercontent.com/UVADS/DS-3001/main/data/bank.csv")
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43628 entries, 0 to 43627
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        43628 non-null  int64 
 1   job        43628 non-null  object
 2   marital    43628 non-null  object
 3   education  43628 non-null  object
 4   default    43628 non-null  object
 5   balance    43628 non-null  int64 
 6   housing    43628 non-null  object
 7   contact    43628 non-null  object
 8   duration   43628 non-null  int64 
 9   campaign   43628 non-null  int64 
 10  pdays      43628 non-null  int64 
 11  previous   43628 non-null  int64 
 12  poutcome   43628 non-null  object
 13  signed up  43628 non-null  int64 
dtypes: int64(7), object(7)
memory usage: 4.7+ MB

< your answer here >

Now, let’s check the composition of the data.

bank_data.marital.value_counts()   # 3 levels
married 26241 single 12355 divorced 5032 Name: marital, dtype: int64
bank_data.education.value_counts()   # 4 levels
secondary 22404 tertiary 12863 primary 6584 unknown 1777 Name: education, dtype: int64
bank_data.default.value_counts()   # 2 levels
no 42844 yes 784 Name: default, dtype: int64
bank_data.job.value_counts()   # 12 levels! What should we do?
blue-collar 9366 management 9142 technician 7321 admin. 5001 services 4010 retired 2184 self-employed 1530 entrepreneur 1433 unemployed 1259 housemaid 1199 student 907 unknown 276 Name: job, dtype: int64
bank_data.contact.value_counts()   # 3 levels -- difference between cellular and telephone?
cellular 28295 unknown 12523 telephone 2810 Name: contact, dtype: int64
bank_data.housing.value_counts()   # 2 levels
yes 24231 no 19397 Name: housing, dtype: int64
bank_data.poutcome.value_counts()   # 4 levels
unknown 35684 failure 4723 other 1783 success 1438 Name: poutcome, dtype: int64
bank_data['signed up'].value_counts()   # 2 levels
0 38554 1 5074 Name: signed up, dtype: int64

We should collapse the variable with 12 levels. In Python, this process is slightly different than it is in R.

employed = ['admin', 'blue-collar', 'entrepreneur', 'housemaid', 'management',
           'self-employed', 'services', 'technician']
# unemployed = ['student', 'unemployed', 'unknown']
bank_data.job = bank_data.job.apply(lambda x: "Employed" if x in employed else "Unemployed")
bank_data.job.value_counts()
Employed 34001 Unemployed 9627 Name: job, dtype: int64

Now, we convert the appropriate columns to factors.

# bank_data.info()   # check the variables
cat = ['job', 'marital', 'education', 'default', 'housing', 'contact',
      'poutcome', 'signed up']   # select the columns to convert
bank_data[cat] = bank_data[cat].astype('category')
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43628 entries, 0 to 43627
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   age        43628 non-null  int64   
 1   job        43628 non-null  category
 2   marital    43628 non-null  category
 3   education  43628 non-null  category
 4   default    43628 non-null  category
 5   balance    43628 non-null  int64   
 6   housing    43628 non-null  category
 7   contact    43628 non-null  category
 8   duration   43628 non-null  int64   
 9   campaign   43628 non-null  int64   
 10  pdays      43628 non-null  int64   
 11  previous   43628 non-null  int64   
 12  poutcome   43628 non-null  category
 13  signed up  43628 non-null  category
dtypes: category(8), int64(6)
memory usage: 2.3 MB

Check for missing data

R uses the mice package, which allows you to visualize the NaNs in a dataset and later impute it. There is no equivalent package in Python, but we can still complete the same steps.

Using the seaborn package, we can see the distribution of missing values. Along the x-axis, you will see the proportion of the data missing for that variable.

import seaborn as sns
sns.displot(
    data=bank_data.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)
# plt.savefig("visualizing_missing_data_with_barplot_Seaborn_distplot.png", dpi=100)
# the above line will same the image to your computer!
<seaborn.axisgrid.FacetGrid at 0x7fd15aec46a0>
<Figure size 511.25x360 with 1 Axes>

No missing data!!

Next, we normalize the numeric variables.

numeric_cols = bank_data.select_dtypes(include='int64').columns
print(numeric_cols)
Index(['age', 'balance', 'duration', 'campaign', 'pdays', 'previous'], dtype='object')
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
d = scaler.fit_transform(bank_data[numeric_cols])   # conduct data transformation
scaled_df = pd.DataFrame(d, columns=numeric_cols)   # convert back to pd df; transformation converts to array
bank_data[numeric_cols] = scaled_df   # put data back into the main df
bank_data.describe()   # as we can see, the data is now normalized!
Loading...

Now, we onehot encode the data -- for reference, this is the process of converting categorical variables to a usable form for a machine learning algorithm.

cat_cols = bank_data.select_dtypes(include='category').columns
print(cat_cols)
Index(['job', 'marital', 'education', 'default', 'housing', 'contact',
       'poutcome', 'signed up'],
      dtype='object')
encoded = pd.get_dummies(bank_data[cat_cols])
encoded.head()   # note the new columns
Loading...
bank_data = bank_data.drop(cat_cols, axis=1)
bank_data = bank_data.join(encoded)
bank_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43628 entries, 0 to 43627
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  43628 non-null  float64
 1   balance              43628 non-null  float64
 2   duration             43628 non-null  float64
 3   campaign             43628 non-null  float64
 4   pdays                43628 non-null  float64
 5   previous             43628 non-null  float64
 6   job_Employed         43628 non-null  uint8  
 7   job_Unemployed       43628 non-null  uint8  
 8   marital_divorced     43628 non-null  uint8  
 9   marital_married      43628 non-null  uint8  
 10  marital_single       43628 non-null  uint8  
 11  education_primary    43628 non-null  uint8  
 12  education_secondary  43628 non-null  uint8  
 13  education_tertiary   43628 non-null  uint8  
 14  education_unknown    43628 non-null  uint8  
 15  default_no           43628 non-null  uint8  
 16  default_yes          43628 non-null  uint8  
 17  housing_no           43628 non-null  uint8  
 18  housing_yes          43628 non-null  uint8  
 19  contact_cellular     43628 non-null  uint8  
 20  contact_telephone    43628 non-null  uint8  
 21  contact_unknown      43628 non-null  uint8  
 22  poutcome_failure     43628 non-null  uint8  
 23  poutcome_other       43628 non-null  uint8  
 24  poutcome_success     43628 non-null  uint8  
 25  poutcome_unknown     43628 non-null  uint8  
 26  signed up_0          43628 non-null  uint8  
 27  signed up_1          43628 non-null  uint8  
dtypes: float64(6), uint8(22)
memory usage: 2.9 MB

The data is ready! Now, let’s build our model.

Train model

We’ll run the kNN algorithm on the banking data. First, we’ll check the prevalence of the target class.

bank_data['signed up_1'].value_counts()[1] / bank_data['signed up_1'].count()
0.11630145777940772

This means that at random, we have an 11.6% chance of correctly picking a subscribed individual. Let’s see if kNN can do any better.

"""
X = bank_data.drop(['signed up_1'], axis=1).values   # independent variables
y = bank_data['signed up_1'].values                  # dependent variable
"""

train, test = train_test_split(bank_data,  test_size=0.4, stratify = bank_data['signed up_1']) 
test, val = train_test_split(test, test_size=0.5, stratify=test['signed up_1'])

Now, let’s train the classifier for k=9.

import random
random.seed(1984)   # kNN is a random algorithm, so we use `random.seed(x)` to make results repeatable

X_train = train.drop(['signed up_1'], axis=1).values
y_train = train['signed up_1'].values

neigh = KNeighborsClassifier(n_neighbors=9)
neigh.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=9)
# now, we check the model's accuracy on the test data:

X_test = test.drop(['signed up_1'], axis=1).values
y_test = test['signed up_1'].values

neigh.score(X_test, y_test)
0.9903735961494384
# now, we test the accuracy on our validation data.

X_val = val.drop(['signed up_1'], axis=1).values
y_val = val['signed up_1'].values

neigh.score(X_val, y_val)
0.9919779967911987

Evaluate model

A 99.0% accuracy rate is pretty good but keep in mind the baserate is roughly 89/11, so we have more or less a 90% chance of guessing right if we don’t know anything about the customer, but the negative outcomes we don’t really care about, this models value is being able to id sign ups when they are actually sign ups. This requires us to know are true positive rate, or Sensitivity or Recall. So let’s dig a little deeper.

# create a confusion matrix
from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(neigh, X_val, y_val, cmap='Blues')  
plt.show()
<Figure size 432x288 with 2 Axes>

tip: use this link to change the color scheme of your confusion matrix: https://matplotlib.org/stable/tutorials/colors/colormaps.html

# create classification report
from sklearn.metrics import classification_report

y_val_pred = neigh.predict(X_val)
print(classification_report(y_val_pred, y_val))
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      7767
           1       0.94      0.99      0.96       959

    accuracy                           0.99      8726
   macro avg       0.97      0.99      0.98      8726
weighted avg       0.99      0.99      0.99      8726

# we didn't get sensitivity and specificity, so we'll calculate that ourselves.
sensitivity = 943/(943+72)   # = TP/(TP+FN)
specificity = 7707/(7707+4)   # = TN/(TN+FP)
print(sensitivity, specificity)
0.929064039408867 0.9994812605368953

Selecting the correct ‘k’

How does “k” affect classification accuracy? Let’s create a function to calculate classification accuracy based on the number of “k.”

def chooseK(k, X_train, y_train, X_test, y_test):
    random.seed(1)
    print("calculating... ", k, "k")    # I'll include this so you can see the progress of the function as it runs
    class_knn = KNeighborsClassifier(n_neighbors=k)
    class_knn.fit(X_train, y_train)
    
    # calculate accuracy
    accu = class_knn.score(X_test, y_test)
    return accu

We’ll test odd k values from 1 to 21. We want to create a table of all the data, so we’ll use list comprehension to create the “accuracy” column.

remember: Python is end-exclusive; we want UP to 21 to we’ll have to extend the end bound to include it

test = pd.DataFrame({'k':list(range(1,22,2)), 
                     'accu':[chooseK(x, X_train, y_train, X_test, y_test) for x in list(range(1, 22, 2))]})
calculating...  1 k
calculating...  3 k
calculating...  5 k
calculating...  7 k
calculating...  9 k
calculating...  11 k
calculating...  13 k
calculating...  15 k
calculating...  17 k
calculating...  19 k
calculating...  21 k
test
Loading...
test = test.sort_values(by=['accu'], ascending=False)
test
Loading...

From here, we see that the best value of k=1!

Let’s go through the code we wrote in a bit more detail, specifically regarding the DataFrame construction.

For reference, here’s the line of code we wrote:

test = pd.DataFrame({'k':list(range(1,22,2)), 
                     'accu':[chooseK(x, X_train, y_train, X_test, y_test) for x in list(range(1, 22, 2))]})

pandas DataFrames wrap around the Python dictionary data type, which is identifiable by the use of curly brackets ({}) and key-value pairs. The keys correspond to the column names (i.e. ‘k’ or ‘accu’) while the values are a list comprised of all the values we want to include.

For ‘k’, we made a list of the range of numbers from 1 to 22 (end exclusive), selecting only every other value. This is done using the syntax: range(first_val, end_val, by=?). Having no by= value means that we select every value in that range.

For ‘accu’, we used list comprehension, which boils down to being loop shorthand with the output being entered into a list. We could easily re-write the code as:

temp = []
for x in list(range(1, 22, 2)):
    temp.append(chooseK(x, X_train, y_train, X_test, y_test))

before adding the list to the DataFrame. Evidently, the list comprehension saves time and memory, which is why we used it earlier.

Now, let’s graph our results!

plt.plot(test['k'], test['accu'])
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.title('Model Accuracy')
plt.show()
<Figure size 432x288 with 1 Axes>

Adjusting the threshold

# we want to make a table containing: probability, expected, and actual values

test_probs = neigh.predict_proba(X_test)
test_preds = neigh.predict(X_test)
# convert probabilities to pd df
test_probabilities = pd.DataFrame(test_probs, columns = ['not_signed_up_prob', 'signed_up_prob'])
test_probabilities
Loading...
final_model = pd.DataFrame({'actual_class': y_test.tolist(),
                           'pred_class': test_preds.tolist(),
                           'pred_prob': [test_probabilities['signed_up_prob'][i] if test_preds[i]==1 else test_probabilities['not_signed_up_prob'][i] for i in range(len(test_preds))]})
# that last line is some list comprehension -- to understand that here in particular click the following link:
# https://stackoverflow.com/questions/4260280/if-else-in-a-list-comprehension
final_model.head()
Loading...
# add a column about the probability the observation is in the positive class
final_model['pos_pred'] = [final_model.pred_prob[i] if final_model.pred_class[i]==1 else 1-final_model.pred_prob[i] for i in range(len(final_model.pred_class))]
final_model.head()
Loading...
# convert classes to categories
final_model.actual_class = final_model.actual_class.astype('category')
final_model.pred_class = final_model.pred_class.astype('category')
# create probability distribution graph
import seaborn as sns

sns.displot(final_model, x="pos_pred", kind="kde")
<seaborn.axisgrid.FacetGrid at 0x7fd178cb5c70>
<Figure size 360x360 with 1 Axes>
final_model.pos_pred.value_counts()
0.000000 7563 1.000000 799 0.111111 91 0.222222 49 0.555556 40 0.333333 40 0.444444 38 0.777778 37 0.666667 37 0.888889 32 Name: pos_pred, dtype: int64

In most datasets, the probabilities range between 0 and 1, causing uncertain predictions. A threshold must be set for where you consider the prediction to actually be a part of the positive class. Is a 60% certainty positive? How about 40%? This is where you have more control over your model’s classifications. This is especially useful for reducing incorrect classifications that you may have noticed in your confusion matrix.

from sklearn.metrics import confusion_matrix

def adjust_thres(x, y, z):
    """
    x=pred_probabilities
    y=threshold
    z=tune_outcome
    """
    thres = pd.DataFrame({'new_preds': [1 if i > y else 0 for i in x]})
    thres.new_preds = thres.new_preds.astype('category')
    con_mat = confusion_matrix(z, thres)  
    print(con_mat)
confusion_matrix(final_model.actual_class, final_model.pred_class)   # original model
array([[7704, 7], [ 77, 938]])
adjust_thres(final_model.pos_pred, .90, final_model.actual_class)   # raise threshold 
[[7711    0]
 [ 216  799]]
adjust_thres(final_model.pos_pred, .3, final_model.actual_class)   # lower threshold
[[7664   47]
 [  39  976]]

More for next week: evaluation metrics

ROC/AUC curve

There are a few really cool graphing options, so I’ll show you a few. There are several packages in Python are interactive as well!

# basic graph
from sklearn import metrics

fpr, tpr, _ = metrics.roc_curve(y_test, final_model.pos_pred)
auc = metrics.roc_auc_score(y_test, final_model.pos_pred)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
<Figure size 432x288 with 1 Axes>
# installs dependency for next graph
! pip install plot_metric
Requirement already satisfied: plot_metric in /opt/anaconda3/lib/python3.8/site-packages (0.0.6)
Requirement already satisfied: scikit-learn>=0.21.2 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (0.24.1)
Requirement already satisfied: colorlover>=0.3.0 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (0.3.0)
Requirement already satisfied: matplotlib>=3.0.2 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (3.3.4)
Requirement already satisfied: seaborn>=0.9.0 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (0.11.1)
Requirement already satisfied: scipy>=1.1.0 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (1.6.2)
Requirement already satisfied: pandas>=0.23.4 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (1.2.4)
Requirement already satisfied: numpy>=1.15.4 in /opt/anaconda3/lib/python3.8/site-packages (from plot_metric) (1.20.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (0.10.0)
Requirement already satisfied: pillow>=6.2.0 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (8.2.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (1.3.1)
Requirement already satisfied: python-dateutil>=2.1 in /opt/anaconda3/lib/python3.8/site-packages (from matplotlib>=3.0.2->plot_metric) (2.8.1)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/site-packages (from cycler>=0.10->matplotlib>=3.0.2->plot_metric) (1.15.0)
Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/lib/python3.8/site-packages (from pandas>=0.23.4->plot_metric) (2021.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.21.2->plot_metric) (2.1.0)
Requirement already satisfied: joblib>=0.11 in /opt/anaconda3/lib/python3.8/site-packages (from scikit-learn>=0.21.2->plot_metric) (1.0.1)
# a pretty cool one

from plot_metric.functions import BinaryClassification

# Visualisation with plot_metric
bc = BinaryClassification(y_test, final_model.pred_class, labels=["0", "1"])

# Figures
plt.figure(figsize=(5,5))
bc.plot_roc_curve()
plt.show()
<Figure size 360x360 with 1 Axes>

F1 score

metrics.f1_score(y_test, final_model.pred_class)
0.9571428571428572

LogLoss

metrics.log_loss(y_test, final_model.pred_class)
0.3324848515189683

Another quick example

from pydataset import data

iris = data("iris")
iris.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal.Length  150 non-null    float64
 1   Sepal.Width   150 non-null    float64
 2   Petal.Length  150 non-null    float64
 3   Petal.Width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 7.0+ KB
iris.describe()
Loading...
from sklearn.preprocessing import scale

cols = list(iris.columns[:4])

scaledIris = pd.DataFrame(scale(iris.iloc[:, :4]), index=iris.index, columns=cols)
scaledIris.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal.Length  150 non-null    float64
 1   Sepal.Width   150 non-null    float64
 2   Petal.Length  150 non-null    float64
 3   Petal.Width   150 non-null    float64
dtypes: float64(4)
memory usage: 5.9 KB
scaledIris['Species'] = iris['Species']
# split datasets
irisTrain, irisTest = train_test_split(scaledIris,  test_size=0.4, stratify = scaledIris['Species']) 
irisTest, irisVal = train_test_split(irisTest, test_size=0.5, stratify = irisTest['Species'])
Xi_train = irisTrain.drop(['Species'], axis=1)
yi_train = irisTrain['Species']

Xi_test = irisTest.drop(['Species'], axis=1)
yi_test = irisTest['Species']

Xi_val = irisVal.drop(['Species'], axis=1)
yi_val = irisVal['Species']
iris_neigh = KNeighborsClassifier(n_neighbors=3)
iris_neigh.fit(Xi_train, yi_train)
KNeighborsClassifier(n_neighbors=3)
iris_neigh.score(Xi_test, yi_test)
0.9666666666666667
iris_neigh.score(Xi_val, yi_val)
0.9
plot_confusion_matrix(iris_neigh, Xi_val, yi_val, cmap='Blues')  
plt.show()
<Figure size 432x288 with 2 Axes>

Example using 10-k cross-validation

from sklearn.model_selection import RepeatedKFold

rkf = RepeatedKFold(n_splits=10, n_repeats=3, random_state=12) 

X_si = scaledIris.drop(['Species'], axis=1)
y_si = scaledIris['Species']
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

cv_neigh = KNeighborsClassifier(n_neighbors=3)   # create classifier
scores = cross_val_score(cv_neigh, X_si, y_si, scoring='accuracy', cv=rkf, n_jobs=-1)   # do repeated cv

print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))
Accuracy: 0.949 (0.061)
plt.plot(scores)
<Figure size 432x288 with 1 Axes>
# more complex version so you can create a graph for testing and training accuracy (not built into the previous version)

#Split arrays or matrices into train and test subsets
Xsi_train, Xsi_test, ysi_train, ysi_test = train_test_split(X_si, y_si, test_size=0.20) 
rcv_knn = KNeighborsClassifier(n_neighbors=6)
rcv_knn.fit(Xsi_train, ysi_train)

print("Preliminary model score:")
print(rcv_knn.score(Xsi_test, ysi_test))

no_neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(no_neighbors))
test_accuracy = np.empty(len(no_neighbors))

for i, k in enumerate(no_neighbors):
    # We instantiate the classifier
    rcv_knn = KNeighborsClassifier(n_neighbors=k)
    # Fit the classifier to the training data
    rcv_knn.fit(Xsi_train, ysi_train)
    
    # Compute accuracy on the training set
    train_accuracy[i] = rcv_knn.score(Xsi_train, ysi_train)

    # Compute accuracy on the testing set
    test_accuracy[i] = rcv_knn.score(Xsi_test, ysi_test)

# Visualization of k values vs accuracy
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(no_neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(no_neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
Preliminary model score:
0.9666666666666667
<Figure size 432x288 with 1 Axes>

Variable importance

There is no easy way in SKLearn to calculate variable importance for a KNN model. So, we’ll use a slightly hacked-together solution.

Variable importance reflects the significance one variable has on the model. If a variable is more important, that variable being removed/permuted has a larger effect on the output of the model. So, if we check the changes such permutations have, we should be able to extract the feature importance.

data = {'sepal_length': [0], 'sepal_width': [0], 'petal_length': [0], 'petal_width': [0]}
feat_imp = pd.DataFrame(data)
feat_imp.head()
Loading...
# baseline

fin_knn = KNeighborsClassifier(n_neighbors=7)
fin_knn.fit(Xsi_train, ysi_train)

print(fin_knn.score(Xsi_test, ysi_test))
plot_confusion_matrix(fin_knn, Xsi_test, ysi_test, cmap='Blues')  
0.9666666666666667
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd15b156a30>
<Figure size 432x288 with 2 Axes>

change Sepal.Length

Xsi_test.head()
Loading...
perm_SL = Xsi_test.copy()   # # copy df; we don't want to alter the actual data
perm_SL['Sepal.Length'] = np.random.permutation(perm_SL['Sepal.Length'])   # permute data
perm_SL.head()
Loading...
fin_knn.score(perm_SL, ysi_test)
0.9
feat_imp['sepal_length'] = fin_knn.score(Xsi_test, ysi_test) - fin_knn.score(perm_SL, ysi_test)
feat_imp.head()
Loading...
plot_confusion_matrix(fin_knn, perm_SL, ysi_test, cmap='Blues')  # what got misclassified?
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fd1687a73a0>
<Figure size 432x288 with 2 Axes>

Instead of making this repetitive, we can turn this into a function and loop.

def featureImportance(X, y, model):
    # create dataframe of variables
    var_imp = pd.DataFrame(columns=list(X.columns))
    var_imp.loc[0] = 0
    base_score = model.score(X, y)
    for col in list(X.columns):
        temp = X.copy()   # # copy df; we don't want to alter the actual data
        temp[col] = np.random.permutation(temp[col])   # permute data
        var_imp[col] = base_score - model.score(temp, y)
        # plot_confusion_matrix(model, temp, y, cmap='Blues')  # what got misclassified?
    print(var_imp)
featureImportance(Xsi_test, ysi_test, fin_knn)
   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
0      0.033333          0.0           0.4     0.266667

From here, we find the important variables!

General eval

plot_confusion_matrix(fin_knn, Xsi_test, ysi_test, cmap='Blues')  
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7fe8f9c04b80>
<Figure size 432x288 with 2 Axes>

Looks like we only misclassified one virginica as versicolor. Let’s see how certain our predictions were.

iris2_probs = fin_knn.predict_proba(Xsi_test)
iris2_probs
array([[0. , 0.14285714, 0.85714286], [0. , 0.28571429, 0.71428571], [0. , 1. , 0. ], [0. , 0. , 1. ], [1. , 0. , 0. ], [0. , 0.85714286, 0.14285714], [0. , 1. , 0. ], [0. , 0.28571429, 0.71428571], [0. , 0.14285714, 0.85714286], [0. , 0.28571429, 0.71428571], [0. , 1. , 0. ], [1. , 0. , 0. ], [1. , 0. , 0. ], [1. , 0. , 0. ], [0. , 0. , 1. ], [0. , 0.42857143, 0.57142857], [0. , 0.71428571, 0.28571429], [0. , 1. , 0. ], [1. , 0. , 0. ], [0. , 0.85714286, 0.14285714], [1. , 0. , 0. ], [0. , 1. , 0. ], [0. , 0.85714286, 0.14285714], [1. , 0. , 0. ], [0. , 0. , 1. ], [1. , 0. , 0. ], [0. , 1. , 0. ], [1. , 0. , 0. ], [1. , 0. , 0. ], [0. , 0. , 1. ]])