Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Goal: Build two data prep pipelines using different datasets to get practice with data preparation and question building. In doing so create a new github repo for your work. Think of this as a stand alone project that requires the creation of a workspace and repository. In the repo it is likely best practices to create three files. One for the actual assignment details (this file), a second (python file to) answer questions 1 - 3 and third (python file) for question four.

Step one: Review these two datasets and brainstorm problems that could be addressed with the dataset. Identify a question for each dataset.

College Completion Data Dictionary + Data

Job_Placement

Step two: Work through the steps outlined in the examples to include the following elements:

Step three: What do your instincts tell you about the data. Can it address your problem, what areas/items are you worried about?

Step four: Create functions for your two pipelines that produces the train and test datasets. The end result should be a series of functions that can be called to produce the train and test datasets for each of your two problems that includes all the data prep steps you took. This is essentially creating a DAG for your data prep steps. Imagine you will need to do this for multiple problems in the future so creating functions that can be reused is important. You don’t need to create one full pipeline function that does everything but rather a series of smaller functions that can be called in sequence to produce the final datasets. Use your judgement on how to break up the functions.