Goal: Build two data prep pipelines using different datasets to get practice with data preparation and question building.
Step one: Review these two datasets and brainstorm problems that could be addressed with the dataset. Identify a question for each dataset.
College Completion Data Dictionary + Data
- Dataset is in the data file in the DS 3021 repo, but was added after class started so you may need to “git pull” to get it.
- Data Dictionary (kinda) for Job Placement - You’ll need to infer from the column names but also the comments on the site.
Step two: Work through the steps outlined in the examples to include the following elements:
- What is a independent Business Metric for your problem?
- Data preparation:
- correct variable type/class as needed
- collapse factor levels as needed
- one-hot encoding factor variables
- normalize the continuous variables
- drop unneeded variables
- create target variable if needed
- Calculate the prevalence of the target variable
- Create the necessary data partitions (Train,Tune,Test)
Step three: What do your instincts tell you about the data. Can it address your problem, what areas/items are you worried about?
Step four: Create functions for your two pipelines that produces the train and test datasets.