Foundations of Machine Learning I Book

Congrats! You just graduated UVA’s BSDS program and got a job working at a movie studio in Hollywood.

Your boss is the head of the studio and wants to know if they can gain a competitive advantage by predicting new movies that might get high imdb scores (movie rating).

You would like to be able to explain the model to mere mortals but need a fairly robust and flexible approach so you’ve chosen to use decision trees to get started.

In doing so, similar to great data scientists of the past you remembered the excellent education provided to you at UVA in a undergrad data science course and have outline 20ish steps that will need to be undertaken to complete this task. As always, you will need to make sure to #comment your work heavily.

Footnotes:

You can add or combine steps if needed
Also, remember to try several methods during evaluation and always be mindful of how the model will be used in practice.
Make sure all your variables are the correct type (factor, character,numeric, etc.)

import pandas as pd
import numpy as np

#1. Load the data
#Sometimes need to set the working directory back out of a folder that we create a file in

#import os
#os.listdir()
#print(os.getcwd())
#os.chdir('c:\\Users\\Brian Wright\\Documents\\3001Python\\DS-3001')

movie_metadata=pd.read_csv("data/movie_metadata.csv")

#2 Ensure all the variables are classified correctly including the target variable and collapse factor variables as needed.

#3 Check for missing variables and correct as needed.

#4 Guess what, you don’t need to scale the data, because DTs don’t require this to be done, they make local greedy decisions...keeps getting easier, go to the next step.

#5 Determine the baserate or prevalence for the classifier, what does this number mean?

#6 Split your data into test, tune, and train. (80/10/10)

#7 Create the kfold object for cross validation.

#8 Create the scoring metric you will use to evaluate your model and the max depth hyperparameter

#9 Build the classifier object

#10 Use the kfold object and the scoring metric to find the best hyperparameter value for max depth via the grid search method.

#11 Fit the model to the training data.

#12 What is the best depth value?

#13 Print out the model

#14 View the results, comment on how the model performed using several evaluation metrics.

#15 Which variables appear to be contributing the most (variable importance)

#16 Use the predict method on the test data and print out the results.

#17 How does the model perform on the test data?

#18 Print out the confusion matrix for the test data, what does it tell you about the model?

#19 What are the top 3 movies based on the test set? Which variables are most important in predicting the top 3 movies?

#20 Summarize what you learned along the way and make recommendations on how this could be used moving forward, being careful not to over promise.