Graduation Lab (Week 6) - Foundations of Machine Learning I Book

Instructions:¶

Let’s build a kNN model using the college completion data from last week. The data is messy and you have a degrees of freedom problem, as in, we have too many features.

You’ve done most of the hard work already, so you should be ready to move forward with building your model.

Use the question/target variable you submitted from last week and build a model to answer the question you created for this dataset.
Build and optimize a kNN model to predict your target variable. Meaning use the tune set to select the correct k value.
Experiment with the threshold function, what happens at higher and lower thresholds. Document what you see in comments.
Evaluate the results using the confusion matrix (at the default threshold). Then talk through your question, summarize what concerns or positive elements do you have about the model?

Bonus: Adjust the function that selects k to output on Specificity instead of Accuracy

Example of how I cleaned the data¶

readme for the dataset - https://data.world/databeats/college-completion/workspace/file?filename=README.txt

import pandas as pd

grad_data = pd.read_csv('https://query.data.world/s/qpi2ltkz23yp2fcaz4jmlrskjx5qnp', encoding="cp1252")
# the encoding part here is important to properly read the data! It doesn't apply to ALL csv files read from the web,
# but it was necessary here.
grad_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 62 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   unitid                                3798 non-null   int64  
 1   chronname                             3798 non-null   object 
 2   city                                  3798 non-null   object 
 3   state                                 3798 non-null   object 
 4   level                                 3798 non-null   object 
 5   control                               3798 non-null   object 
 6   basic                                 3798 non-null   object 
 7   hbcu                                  94 non-null     object 
 8   flagship                              50 non-null     object 
 9   long_x                                3798 non-null   float64
 10  lat_y                                 3798 non-null   float64
 11  site                                  3779 non-null   object 
 12  student_count                         3798 non-null   int64  
 13  awards_per_value                      3798 non-null   float64
 14  awards_per_state_value                3798 non-null   float64
 15  awards_per_natl_value                 3798 non-null   float64
 16  exp_award_value                       3798 non-null   int64  
 17  exp_award_state_value                 3798 non-null   int64  
 18  exp_award_natl_value                  3798 non-null   int64  
 19  exp_award_percentile                  3798 non-null   int64  
 20  ft_pct                                3794 non-null   float64
 21  fte_value                             3798 non-null   int64  
 22  fte_percentile                        3798 non-null   int64  
 23  med_sat_value                         1337 non-null   float64
 24  med_sat_percentile                    1337 non-null   float64
 25  aid_value                             3797 non-null   float64
 26  aid_percentile                        3797 non-null   float64
 27  endow_value                           2323 non-null   float64
 28  endow_percentile                      2323 non-null   float64
 29  grad_100_value                        3467 non-null   float64
 30  grad_100_percentile                   3467 non-null   float64
 31  grad_150_value                        3467 non-null   float64
 32  grad_150_percentile                   3467 non-null   float64
 33  pell_value                            3797 non-null   float64
 34  pell_percentile                       3797 non-null   float64
 35  retain_value                          3535 non-null   float64
 36  retain_percentile                     3535 non-null   float64
 37  ft_fac_value                          3785 non-null   float64
 38  ft_fac_percentile                     3785 non-null   float64
 39  vsa_year                              279 non-null    float64
 40  vsa_grad_after4_first                 279 non-null    float64
 41  vsa_grad_elsewhere_after4_first       279 non-null    float64
 42  vsa_enroll_after4_first               279 non-null    float64
 43  vsa_enroll_elsewhere_after4_first     279 non-null    float64
 44  vsa_grad_after6_first                 279 non-null    float64
 45  vsa_grad_elsewhere_after6_first       279 non-null    float64
 46  vsa_enroll_after6_first               279 non-null    float64
 47  vsa_enroll_elsewhere_after6_first     279 non-null    float64
 48  vsa_grad_after4_transfer              270 non-null    float64
 49  vsa_grad_elsewhere_after4_transfer    270 non-null    float64
 50  vsa_enroll_after4_transfer            270 non-null    float64
 51  vsa_enroll_elsewhere_after4_transfer  270 non-null    float64
 52  vsa_grad_after6_transfer              270 non-null    float64
 53  vsa_grad_elsewhere_after6_transfer    270 non-null    float64
 54  vsa_enroll_after6_transfer            270 non-null    float64
 55  vsa_enroll_elsewhere_after6_transfer  270 non-null    float64
 56  similar                               3579 non-null   object 
 57  state_sector_ct                       3798 non-null   int64  
 58  carnegie_ct                           3798 non-null   int64  
 59  counted_pct                           3372 non-null   object 
 60  nicknames                             310 non-null    object 
 61  cohort_size                           3467 non-null   float64
dtypes: float64(40), int64(10), object(12)
memory usage: 1.8+ MB

grad_data.shape

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 grad_data.shape

NameError: name 'grad_data' is not defined

# We have a lot of data! A lot of these have many missing values or are otherwise not useful.

to_drop = list(range(39, 56))
to_drop.extend([27, 9, 10, 11, 28, 36, 60, 56])

grad_data1 = grad_data.drop(grad_data.columns[to_drop], axis=1)
grad_data1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 37 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   unitid                  3798 non-null   int64  
 1   chronname               3798 non-null   object 
 2   city                    3798 non-null   object 
 3   state                   3798 non-null   object 
 4   level                   3798 non-null   object 
 5   control                 3798 non-null   object 
 6   basic                   3798 non-null   object 
 7   hbcu                    94 non-null     object 
 8   flagship                50 non-null     object 
 9   student_count           3798 non-null   int64  
 10  awards_per_value        3798 non-null   float64
 11  awards_per_state_value  3798 non-null   float64
 12  awards_per_natl_value   3798 non-null   float64
 13  exp_award_value         3798 non-null   int64  
 14  exp_award_state_value   3798 non-null   int64  
 15  exp_award_natl_value    3798 non-null   int64  
 16  exp_award_percentile    3798 non-null   int64  
 17  ft_pct                  3794 non-null   float64
 18  fte_value               3798 non-null   int64  
 19  fte_percentile          3798 non-null   int64  
 20  med_sat_value           1337 non-null   float64
 21  med_sat_percentile      1337 non-null   float64
 22  aid_value               3797 non-null   float64
 23  aid_percentile          3797 non-null   float64
 24  grad_100_value          3467 non-null   float64
 25  grad_100_percentile     3467 non-null   float64
 26  grad_150_value          3467 non-null   float64
 27  grad_150_percentile     3467 non-null   float64
 28  pell_value              3797 non-null   float64
 29  pell_percentile         3797 non-null   float64
 30  retain_value            3535 non-null   float64
 31  ft_fac_value            3785 non-null   float64
 32  ft_fac_percentile       3785 non-null   float64
 33  state_sector_ct         3798 non-null   int64  
 34  carnegie_ct             3798 non-null   int64  
 35  counted_pct             3372 non-null   object 
 36  cohort_size             3467 non-null   float64
dtypes: float64(18), int64(10), object(9)
memory usage: 1.1+ MB

# drop even more data that doesn't look predictive
drop_more = [0,2,3,6,8,11,12,14,15,18,21,23,29,32,33,34,35]
grad_data2 = grad_data1.drop(grad_data1.columns[drop_more], axis=1)
grad_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   chronname             3798 non-null   object 
 1   level                 3798 non-null   object 
 2   control               3798 non-null   object 
 3   hbcu                  94 non-null     object 
 4   student_count         3798 non-null   int64  
 5   awards_per_value      3798 non-null   float64
 6   exp_award_value       3798 non-null   int64  
 7   exp_award_percentile  3798 non-null   int64  
 8   ft_pct                3794 non-null   float64
 9   fte_percentile        3798 non-null   int64  
 10  med_sat_value         1337 non-null   float64
 11  aid_value             3797 non-null   float64
 12  grad_100_value        3467 non-null   float64
 13  grad_100_percentile   3467 non-null   float64
 14  grad_150_value        3467 non-null   float64
 15  grad_150_percentile   3467 non-null   float64
 16  pell_value            3797 non-null   float64
 17  retain_value          3535 non-null   float64
 18  ft_fac_value          3785 non-null   float64
 19  cohort_size           3467 non-null   float64
dtypes: float64(12), int64(4), object(4)
memory usage: 593.6+ KB

grad_data2.head()

import numpy as np
grad_data2.replace('NULL', np.nan, inplace=True)

grad_data2['hbcu'] = [1 if grad_data2['hbcu'][i]=='X' else 0 for i in range(len(grad_data2['hbcu']))]
grad_data2['hbcu'].value_counts()

0    3704
1      94
Name: hbcu, dtype: int64

grad_data2['hbcu'] = grad_data2.hbcu.astype('category')

# convert more variables to factors
grad_data2[['level', 'control']] = grad_data2[['level', 'control']].astype('category')

# In R, we convert vals to numbers, but they already are in this import
grad_data2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3798 entries, 0 to 3797
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   chronname             3798 non-null   object  
 1   level                 3798 non-null   category
 2   control               3798 non-null   category
 3   hbcu                  3798 non-null   category
 4   student_count         3798 non-null   int64   
 5   awards_per_value      3798 non-null   float64 
 6   exp_award_value       3798 non-null   int64   
 7   exp_award_percentile  3798 non-null   int64   
 8   ft_pct                3794 non-null   float64 
 9   fte_percentile        3798 non-null   int64   
 10  med_sat_value         1337 non-null   float64 
 11  aid_value             3797 non-null   float64 
 12  grad_100_value        3467 non-null   float64 
 13  grad_100_percentile   3467 non-null   float64 
 14  grad_150_value        3467 non-null   float64 
 15  grad_150_percentile   3467 non-null   float64 
 16  pell_value            3797 non-null   float64 
 17  retain_value          3535 non-null   float64 
 18  ft_fac_value          3785 non-null   float64 
 19  cohort_size           3467 non-null   float64 
dtypes: category(3), float64(12), int64(4), object(1)
memory usage: 516.0+ KB

# check missing data
import seaborn as sns

sns.displot(
    data=grad_data2.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)

<seaborn.axisgrid.FacetGrid at 0x7fc9593cd340>

#let's drop med_stat_value then delete the rest of the NA rows
grad_data2 = grad_data2.drop(grad_data[['med_sat_value']], axis=1)
grad_data2.dropna(axis = 0, how = 'any', inplace = True)

sns.displot(
    data=grad_data2.isna().melt(value_name="missing"),
    y="variable",
    hue="missing",
    multiple="fill",
    aspect=1.25
)

<seaborn.axisgrid.FacetGrid at 0x7fc9598f06d0>

k Nearest Neighbors for Classification

Week 6: kNN

Evaluation

kNN Evaluation Example