Goal: Build two data prep pipelines using different datasets to get practice with data preparation and question building.
Step one: Review these two datasets and brainstorm problems that could be addressed with the dataset. Identify a question for each dataset.
College Completion Data Dictionary + Data
Dataset is in the data file in the DS 3021 repo, but was added after class started so you may need to “git pull” to get it.
Data Dictionary (kinda) for Job Placement - You’ll need to infer from the column names but also the comments on the site.
Step two: Work through the steps outlined in the examples to include the following elements:
What is a independent Business Metric for your problem?
Data preparation:
correct variable type/class as needed
collapse factor levels as needed
one-hot encoding factor variables
normalize the continuous variables
drop unneeded variables
create target variable if needed
Calculate the prevalence of the target variable
Create the necessary data partitions (Train,Tune,Test)
Step three: What do your instincts tell you about the data. Can it address your problem, what areas/items are you worried about?
Step four: Create functions for your two pipelines that produces the train and test datasets.