Guided Decision-Making in Machine Learning via Automated Reporting
Background
In applied machine learning, practitioners often experiment with different models, features, and preprocessing strategies to improve predictive performance. Yet, when starting from raw datasets, it is difficult to know which approaches are most promising. Achieving state-of-the-art performance is time-consuming and typically requires years of experience, as only seasoned data scientists know the “tricks of the trade” to reliably succeed.
To democratize these skills, automated tools that analyze datasets, run baseline experiments, and generate systematic reports can help guide practitioners toward better decisions. However, existing tools are largely limited to descriptive statistics and visualizations, offering little actionable guidance for more advanced users. Similarly, recent LLM-based approaches underperform because they are trained primarily on simple educational material found on the internet.
This project addresses that gap by transforming the intuition and expertise of senior data scientists into an actionable open-source package.
Project Description
The goal of this project is to design and implement a Python library that takes raw datasets as input and generates automated reports to guide practitioners in defining promising modeling pipelines.
You will begin by exploring and summarizing key dataset characteristics (e.g., size, missing values, class balance, feature types), automatically detecting common challenges such as high-cardinality categorical features and outliers. You will then train baseline predictive models and compare their performance. Reports will be structured narratives that integrate tables and visualizations (e.g., confusion matrices, ROC curves, feature importances) and will support export to Markdown, PDF, and HTML.
Building on this foundation, you will study successful solutions from Kaggle competitions and extract generalizable modeling decisions. These insights will be incorporated into the system so that it can automatically provide recommendations for next steps (e.g., handling imbalance, applying specific feature engineering techniques, adjusting model complexity). The project will also experiment with how open-ended LLM-based approaches can complement predefined rules to enrich the system’s recommendations.
Finally, you will set up a well-structured repository with documentation, examples, and contribution guidelines; write unit tests and maintain high code quality; and collaborate via GitHub issues, pull requests, and code reviews. You will apply the library to multiple datasets (classification and regression) and design a framework to evaluate report quality with respect to clarity, usefulness, and correctness, documenting limitations and proposing future improvements.
Through this work, you will learn which techniques senior data scientists (e.g., Kaggle Grandmasters) use to achieve state-of-the-art predictive performance, while gaining hands-on experience in best practices for developing and contributing to open-source packages, following the standards of widely used libraries such as scikit-learn or AutoGluon.
Requirements
- Solid Python programming skills
- Basic knowledge of machine learning (e.g., IE500 Data Mining 1, IE675b Machine Learning)
- Familiarity with Git and GitHub
- Ability to work independently as well as in a team, with strong analytical thinking skills
We look forward to receiving applications from talented and motivated students who are eager to sharpen their data science expertise, contribute to impactful open-source software, and learn how to write code that lasts beyond the scope of a single project.