Meeting time & location: TR 9:45 – 11:15 am at WH–100E.
Office hours: W 2:00 – 3:00 pm.
This course is a 4-credit course, which means that in addition to the scheduled lectures/discussions,
students are expected to do at least 9.5 hours of course-related work each week during the
semester. This includes things like: completing assigned readings, participating in lab sessions,
studying for tests and examinations, preparing written assignments, completing internship or
clinical placement requirements, and other tasks that must be completed to earn credit in the
course.
Prerequisite
Scientific programming in a language such as R, Matlab, or Python. R and Python are strongly recommended.
Previous coursework in statistics and data science.
Objectives
Understand the major principles of practical data analysis
Gain hands-on experience with practical aspects of data analysis.
Cultivate critical thinking skills to critically assess the trustworthiness of data analyses.
A Data Science and Statistics counterpart to the famous Missing Semester of Your CS Education (https://missing.csail.mit.edu/) course.
The Missing Semester for Data Science and Statistics
The hands-on competencies that every data scientist or statistician needs to thrive in industry, research, or applied work.
Core Tooling & Environment – Use BashZSH for data handling, reproducible environments, and GitGitHub (incl. LFS) for version control.
Data Acquisition & Engineering – Query databases (SQLite/MySQL), handle diverse formats, automate cleaning with provenance checks, and validate representativeness.
Applied Modeling – Build pipelines (scikit-learn/tidymodels), engineer features without leakage, evaluate rigorously, and align models to practical needs.
Statistical Inference & Uncertainty – Quantify and communicate uncertainty, use resampling for stability, and link assumptions to real-world limits.
Visualization & Communication – Create advanced visualizations/dashboards, communicate clearly to varied audiences, and produce reproducible reports in Quarto.
Project Organization – Structure projects for clarity, track metadata, and optimize pipelines for performance.
Ethics & Responsibility – Address bias/fairness, frame problems ethically, and ensure analyses reflect reality and promote trust.
Recommended Texts
Veridical Data Science, the Practice of Responsible Data Analysis and Decision Making, by Bin Yu and Rebecca L. Barter.
We will use GitHub to serve all the coursrwork and use Gradescope to submit and grade homework. This will allow the instructor to efficient grade all the work and give feedback in a timely manner.
Brightspace
Brightspace will only be used for recording grades on assignments and exams and for distributing solutions. The code and lecture notes can also be found on Brightspace.
Grading
Homework (55%): homework is assigned weekly.
Project (35%)
Quizzes and in-class participation (10%)
Homework Policy
There will be a deduction of 15% of the grade for each day homeworks are late (the final grade for a late homework that is N days late will be 0.85^N times the real grade). Homeworks may be discussed with classmates but must be written and submitted individually.
Project
Each student works on two projects — one as a Project Manager, defining goals, managing scope, and reviewing work, and one as a Data Scientist, executing analysis for another student’s project. The
process includes public project pitches, observed consultation meetings, and constant feedback, adjustments, and clarifications via professional Slack communication, a DS-led final
presentation and a PM follow-up email with assessment and feedback.
Software
RStudio and Google Colab will be used for completion of the homework assignments.
Tentative Schedule and Topics
Date
Topic
Homework
Tuesday, August 19, 2025
Introduction
Thursday, August 21, 2025
Introduction
Homework 1
Tuesday, August 26, 2025
The Data Science Life Cycle
Thursday, August 28, 2025
The Data Science Life Cycle
Homework 2
Tuesday, September 2, 2025
NO CLASS - Monday Classes Meet
Thursday, September 4, 2025
Set Up
Tuesday, September 9, 2025
Set Up
Thursday, September 11, 2025
SQL
Homework 3
Tuesday, September 16, 2025
SQL
Thursday, September 18, 2025
SQL
Homework 4
Tuesday, September 23, 2025
NO CLASS (Rosh Hashanah)
Thursday, September 25, 2025
Data Preparation
Tuesday, September 30, 2025
Data Preparation
Homework 5
Thursday, October 2, 2025
NO CLASS (Yom Kippur)
Tuesday, October 7, 2025
Project Pitching
Thursday, October 9, 2025
Exploratory Data Analysis
Homework 6
Tuesday, October 14, 2025
Model Deployment
Thursday, October 16, 2025
Model Deployment
Homework 7
Tuesday, October 21, 2025
Project Consultation Meetings I
Thursday, October 23, 2025
Project Consultation Meetings II
Homework 8
Tuesday, October 28, 2025
Case Study – The Nutrition Project
Thursday, October 30, 2025
Case Study – The Nutrition Project
Homework 9
Tuesday, November 4, 2025
Case Study – The Ames House Price Prediction Project
Thursday, November 6, 2025
Case Study – The Ames House Price Prediction Project
Homework 10
Tuesday, November 11, 2025
Case Study – the Online Shopping Purchase Prediction
Thursday, November 13, 2025
Case Study – the Online Shopping Purchase Prediction