Math 534 Practical Data Analysis, Fall 2025

  • Instructor: Xingye Qiao

  • Office: WH-122

  • Meeting time & location: TR 9:45 – 11:15 am at WH–100E.

  • Office hours: W 2:00 – 3:00 pm.

This course is a 4-credit course, which means that in addition to the scheduled lectures/discussions, students are expected to do at least 9.5 hours of course-related work each week during the semester. This includes things like: completing assigned readings, participating in lab sessions, studying for tests and examinations, preparing written assignments, completing internship or clinical placement requirements, and other tasks that must be completed to earn credit in the course.

Prerequisite

  • Scientific programming in a language such as R, Matlab, or Python. R and Python are strongly recommended.

  • Previous coursework in statistics and data science.

Objectives

  • Understand the major principles of practical data analysis

  • Gain hands-on experience with practical aspects of data analysis.

  • Cultivate critical thinking skills to critically assess the trustworthiness of data analyses.

  • A Data Science and Statistics counterpart to the famous Missing Semester of Your CS Education (https://missing.csail.mit.edu/) course.

The Missing Semester for Data Science and Statistics

The hands-on competencies that every data scientist or statistician needs to thrive in industry, research, or applied work.

  • Core Tooling & Environment – Use BashZSH for data handling, reproducible environments, and GitGitHub (incl. LFS) for version control.

  • Data Acquisition & Engineering – Query databases (SQLite/MySQL), handle diverse formats, automate cleaning with provenance checks, and validate representativeness.

  • Applied Modeling – Build pipelines (scikit-learn/tidymodels), engineer features without leakage, evaluate rigorously, and align models to practical needs.

  • Experimental Design & Causal Inference – Design bias-controlled A/B tests, handle statistical pitfalls, apply robust methods, and critically assess causal claims.

  • Statistical Inference & Uncertainty – Quantify and communicate uncertainty, use resampling for stability, and link assumptions to real-world limits.

  • Visualization & Communication – Create advanced visualizations/dashboards, communicate clearly to varied audiences, and produce reproducible reports in Quarto.

  • Project Organization – Structure projects for clarity, track metadata, and optimize pipelines for performance.

  • Ethics & Responsibility – Address bias/fairness, frame problems ethically, and ensure analyses reflect reality and promote trust.

Recommended Texts

Slack

Course related communications will be done on Slack. Click here for access.

Gradescope and GitHub

We will use GitHub to serve all the coursrwork and use Gradescope to submit and grade homework. This will allow the instructor to efficient grade all the work and give feedback in a timely manner.

Brightspace

Brightspace will only be used for recording grades on assignments and exams and for distributing solutions. The code and lecture notes can also be found on Brightspace.

Grading

  • Homework (55%): homework is assigned weekly.

  • Project (35%)

  • Quizzes and in-class participation (10%)

Homework Policy

There will be a deduction of 15% of the grade for each day homeworks are late (the final grade for a late homework that is N days late will be 0.85^N times the real grade). Homeworks may be discussed with classmates but must be written and submitted individually.

Project

Each student works on two projects — one as a Project Manager, defining goals, managing scope, and reviewing work, and one as a Data Scientist, executing analysis for another student’s project. The process includes public project pitches, observed consultation meetings, and constant feedback, adjustments, and clarifications via professional Slack communication, a DS-led final presentation and a PM follow-up email with assessment and feedback.

Software

RStudio and Google Colab will be used for completion of the homework assignments.

Tentative Schedule and Topics

Date Topic Homework
Tuesday, August 19, 2025 Introduction
Thursday, August 21, 2025 Introduction Homework 1
Tuesday, August 26, 2025 The Data Science Life Cycle
Thursday, August 28, 2025 The Data Science Life Cycle Homework 2
Tuesday, September 2, 2025 NO CLASS - Monday Classes Meet
Thursday, September 4, 2025 Set Up
Tuesday, September 9, 2025 Set Up
Thursday, September 11, 2025 SQL Homework 3
Tuesday, September 16, 2025 SQL
Thursday, September 18, 2025 SQL Homework 4
Tuesday, September 23, 2025 NO CLASS (Rosh Hashanah)
Thursday, September 25, 2025 Data Preparation
Tuesday, September 30, 2025 Data Preparation Homework 5
Thursday, October 2, 2025 NO CLASS (Yom Kippur)
Tuesday, October 7, 2025 Project Pitching
Thursday, October 9, 2025 Exploratory Data Analysis Homework 6
Tuesday, October 14, 2025 Model Deployment
Thursday, October 16, 2025 Model Deployment Homework 7
Tuesday, October 21, 2025 Project Consultation Meetings I
Thursday, October 23, 2025 Project Consultation Meetings II Homework 8
Tuesday, October 28, 2025 Case Study – The Nutrition Project
Thursday, October 30, 2025 Case Study – The Nutrition Project Homework 9
Tuesday, November 4, 2025 Case Study – The Ames House Price Prediction Project
Thursday, November 6, 2025 Case Study – The Ames House Price Prediction Project Homework 10
Tuesday, November 11, 2025 Case Study – the Online Shopping Purchase Prediction
Thursday, November 13, 2025 Case Study – the Online Shopping Purchase Prediction Homework 11
Tuesday, November 18, 2025 Classical Statistical Data Analysis
Thursday, November 20, 2025 Classical Statistical Data Analysis Homework 12
Tuesday, November 25, 2025 Online Controlled Experiments (A/B Tests)
Thursday, November 27, 2025 NO CLASS (Thanksgiving break)
Tuesday, December 2, 2025 Final Presentations
Thursday, December 4, 2025 Final Presentations