DS Lifecycle Phase I Case Study - Foundations of Machine Learning I Book

Key Points to Cover

Developing DS questions
Developing metrics associated with question
Target variables
Type of Model
Data Brainstorming
Data to Concept

Read through the case study and then discuss the problem as a group then address these questions below:

What problems are facing the organization?
What is the target variable and what type (reg or class) of solution are they developing?
How would they know the solution was working (business metric)?
What other issues might the team be facing?

Work through the following Data to Concept and Learning:

How difficult is the data to gather?
Is the target difficult to measure or break into smaller parts?
What level of uncertainty (risk) are you willing to accept?

Take 20 to 30 minutes to gather your answers then we will discuss. I do want you to submit a written summary of the results so it might be a good idea to assign a note taker.

Case Study:

A non-profit organization is undertaking an ambitious project to develop a sophisticated recommender engine tailored to the needs of K-12 school districts. The engine’s primary purpose is to guide districts in making informed investments in Mathematics software, helping to address a widespread issue: billions of dollars are wasted annually in the U.S. education system on software that fails to meet the unique needs, skills, and contexts of individual schools within districts.

The non-profit has initiated this effort by collecting data from several school districts across the Midwest, focusing on those with diverse characteristics, including urban, suburban, and rural settings. The data, obtained through a detailed survey, features responses on a Likert scale (1-5) across 60 questions. These questions delve into a range of variables, including school demographics, technology infrastructure and proficiency, teacher preparedness, administrative support, and the perceived success or failure of previous software implementations. With 1,500 responses spanning 30 districts—and no district contributing more than 100 responses—the dataset provides a snapshot of the variability and complexity that the recommender system must accommodate.

The immediate goal is to develop a Minimum Viable Product (MVP) of the recommender engine. The MVP will allow the non-profit to showcase its potential to stakeholders and secure funding for broader data collection and system refinement. However, the challenges are numerous. The current dataset is relatively small, response biases and inconsistencies in how the survey questions are interpreted could lead to noise in the data, complicating the initial modeling process.

To address these challenges, the non-profit envisions a multi-step strategy. First, exploratory data analysis (EDA) will help uncover patterns, correlations, and potential outliers. For instance, certain districts may have unique combinations of low technology proficiency and high administrative enthusiasm, which could impact the types of recommendations generated. The next step involves feature engineering to transform survey responses into meaningful inputs for a machine learning model. This might include clustering districts by similar characteristics, such as budget constraints or student-to-teacher ratios, or calculating composite indices to quantify factors like “readiness for tech adoption.”

The MVP itself will likely employ a combination of machine learning techniques, such as collaborative filtering or hybrid recommenders, to match districts with software options. Initially, this process may require heavy reliance on human expertise, with educators and software specialists annotating or validating model outputs to ensure recommendations align with real-world constraints. Over time, as the model learns from feedback loops, it could grow more autonomous, leveraging new survey data from districts to refine its suggestions.

Another complication lies in the validation of the system’s effectiveness. Success cannot simply be measured by the accuracy of recommendations but must also account for outcomes such as improved student performance, teacher satisfaction, and long-term cost savings. The non-profit must develop a robust mechanism for gathering post-implementation feedback from districts and incorporate this into the system’s iterative improvement.

Long-term, the non-profit envisions scaling the system to serve districts nationwide. However, this raises additional technical and ethical questions. How can the system account for the nuances of districts in other regions, where educational priorities, budgets, and cultural factors may differ significantly? How can biases in the training data—such as overrepresentation of certain district types—be mitigated to ensure equitable recommendations?

Despite these challenges, the potential impact of the recommender engine is enormous. By empowering districts to make smarter, data-driven decisions, the system could help schools maximize the value of their investments, improve math education outcomes for students, and ultimately reduce the staggering waste that currently plagues the sector. The success of this project could also pave the way for similar initiatives in other subject areas or aspects of educational technology, further transforming the landscape of K-12 education.

What is Machine Learning + Data Science Lifecycle

Data Science Lifecycle and Experimental Thinking

Getting Back Up to Coding Speed

Fun with functions and pandas!