Skip to article frontmatterSkip to article content

Goal: Build two data prep pipelines using different datasets to get practice with data preparation and question building.

Step one: Review these two datasets and brainstorm problems that could be addressed with the dataset. Identify a question for each dataset.

College Completion Data Dictionary + Data

  • Dataset is in the data file in the DS 3021 repo, but was added after class started so you may need to “git pull” to get it.


Step two: Work through the steps outlined in the examples to include the following elements:

  • What is a independent Business Metric for your problem?
  • Data preparation:
    • correct variable type/class as needed
    • collapse factor levels as needed
    • one-hot encoding factor variables
    • normalize the continuous variables
    • drop unneeded variables
    • create target variable if needed
  • Calculate the prevalence of the target variable
  • Create the necessary data partitions (Train,Tune,Test)

Step three: What do your instincts tell you about the data. Can it address your problem, what areas/items are you worried about?

Step four: Create functions for your two pipelines that produces the train and test datasets.