Math 457 Introduction to Statistical Learning. Spring 2020.

Binghamton University, State University of New York

Instructor: Xingye Qiao
Phone number: (607) 777-2593
Office: WH-134
Meeting time & location: Tuesday and Thursday 8:20–9:55 am at WH–G02.
Office hours: Thursday 10–11:30 am

This course is a 4-credit course, which means that in addition to the scheduled lectures/discussions, students are expected to do at least 9.5 hours of course-related work each week during the semester. This includes things like: completing assigned readings, participating in lab sessions, studying for tests and examinations, preparing written assignments, completing internship or clinical placement requirements, and other tasks that must be completed to earn credit in the course.

Prerequisite

Scientific programming in a language such as R, Matlab, or Python.
Linear regression and its inference
Matrix algebra, preferrably including orthogonality, eigenvalues and eigenvectors, and singular value decomposition.

Description

This course is a survey of statistical learning and data mining methods. It will cover major statistical learning methods and concepts for both supervised and unsupervised learning. Topics covered include regression methods with sparsity or other regularizations, model selection, graphical models, statistical learning pipeline and best practice, introduction to classification, including discriminant analysis, logistic regression, support vector machines, and kernel methods, nonlinear methods, dimension reduction, including matrix factorization-based approaches - principal component analysis and non-negative matrix factorization-, multidimensional scaling, and independent component analysis, clustering, decision trees, random forest, boosting and ensemble learning.

Learning Outcomes

Students will learn how and when to apply statistical learning techniques, their comparative strengths and weaknesses, and how to critically evaluate the performance of learning algorithms. Students completing this course should be able to

process and visualize different data types,
apply basic statistical learning methods to build predictive models or perform exploratory analysis,
have basic understanding of the underlying mechanism of predictive models and evaluate and interpret such models,
properly tune, select and validate statistical learning models,
use analytical tools and software widely used in practice,
work both independently and in a team to solve problems, and
learn to present and communicate the findings effectively.

Recommended Texts

James, Witten, Hastie and Tibshirani, 2014. An Introduction to Statistical Learning with Applications in R. Book Home Page. The PDF file of the book can be downloaded for free. There is also a R library for this book.
Hastie, Trevor, Tibshirani, Robert, and Friedman, J. H. 2009. The elements of statistical learning: Data mining, inference, and prediction. New York, NY: Springer New York.
Hastie, Trevor, Tibshirani, Robert, and Wainwright, Martin. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC; 1 edition (May 7, 2015). The PDF file of the book can be downloaded for free.
Bühlmann, Peter, and van de Geer, Sara. Statistics for High-Dimensional Data. Springer-Verlag Berlin Heidelberg.
Boyd, Stephen, and Vandenberghe, Lieven. Convex Optimization. Cambridge University Press. The PDF file of the book can be downloaded for free.

Piazza

Please use Piazza (www.piazza.com) for all communications with me rather than email. Piazza is a question-and-answer platform. It supports LaTeX, code formatting, embedding of images, and attaching of files. You are encouraged to ask questions when you have difficulty understanding a concept or working around a piece of code – you can even ask questions anonymously. Moreover, you can also answer questions from your classmates. I constantly monitor the answers and endorse those good answers.

Announcement will be sent to the class using Piazza.

Gradescope

We will use Gradescope to submit and grade homework. This will allow the instructor to efficient grade all the work and give feedback in a timely manner.

Mycourses(Blackboard)

Mycourses(Blackboard) will only be used for recording grades on assignments and exams and for distributing solutions. The code and lecture notes can also be found on blackboard.

Grading

Homework (30%): homework is assigned biweekly.
Midterm exam (30%): a midterm exam focusing on the theoretical part of the course will be administered.
Contest & Reports (35%): a group project will be assigned to each student. Successful completion of the project includes an initial report, a presentation and a final report.
Lecture attendance and participation (5%): attendance will be taken. Active participation on asking meaning questions and answering others’ questions will also count.

Homework Policy

There will be a deduction of 25% of the grade for homeworks that are not typeset using LaTeX. There will be a deduction of 15% of the grade for each day homeworks are late (the final grade for a late homework that is N days late will be 0.85^N times the real grade). Homeworks may be discussed with classmates but must be written and submitted individually.

Data Analysis Contest

Students will compete against each other in a Data Analysis Contest. The competition will begin on Tuesday, Feburary 20 and can be completed in teams of 2 – 4 members. Grades will be based upon a progress report and a final report (one per team) as well as the contest results. Further details about the contest along with specific grading criteria will be given in a separate document and discussed in class.

Software

Jupyter notebook will be used to illustrate both R and Python codes in class. I recommend installing Jupyter notebook via Anaconda.

Students are expected to write homework in LaTex. For users with no experience with LaTex, I suggest using the cloud LaTex editing service at overleaf.com.

Tentative schedule

Lecture	Week	Date	Module	Tentative Topic	Assigned	Due
1	1	Jan-21, Tuesday	I. Regression with Sparsity	Introduction	HW0
2		Jan-23, Thursday		MSE & Least Square
3	2	Jan-28, Tuesday		Ridge Regression	HW1	HW0
4		Jan-30, Thursday		Sparse Regression I
5	3	Feb-4, Tuesday		Sparse Regression II
6		Feb-6, Thursday		Graphical Models & Compressed Sensing
7	4	Feb-11, Tuesday	II. Pipeline for Statistical Learning	Model Selection and Assessment	HW 2	HW 1
8		Feb-13, Thursday		Model Validation
9	5	Feb-18, Tuesday		Case Studies & Logistic Regression
10		Feb-20, Thursday	III. Classification Methods	LR Computing & Other GLMs
11	6	Feb-25, Tuesday		Sparse GLM & Bayes Classifier	HW 3	HW2
12		Feb-27, Thursday		LDA
13	7	Mar-3, Tuesday		SVM I: linear SVM
NA		Mar-5, Thursday		NO CLASS
14	8	Mar-10, Tuesday		SVM II: dual solution, kernel SVM
15		Mar-12, Thursday	IV. Nonlinear Methods	Nonlinear I: RKHS, KRR, Kernel PCA	HW 4	HW3
16	9	Mar-17, Tuesday		Nonlinear II: Polynomial reg., smoothing, GAM, etc.
17		Mar-19, Thursday		Neural Network
18	10	Mar-24, Tuesday	V. Dimension Reduction	Dimension Reduction I: PCA-1
19		Mar-26, Thursday		Dimension Reduction I: PCA-2	HW 5	HW4
20	11	Mar-31, Tuesday		Dimension Reduction II: Extensions, NMF
21		Apr-2, Thursday		Dimension Reduction II: ICA, MDS
NA	12	Apr-7, Tuesday		NO CLASS
NA		Apr-9, Thursday		NO CLASS
22	13	Apr-14, Tuesday	VI. Clustering	Clustering I: K-means
23		Apr-16, Thursday		Clustering II: k-means, EM, HC	HW 6	HW5
24	14	Apr-21, Tuesday		Clustering III, HC; Trees
25		Apr-23, Thursday		Midterm Exam
26	15	Apr-28, Tuesday	VII. Ensemble Methods	Bagging; Random Forests
27		Apr-30, Thursday		Ensembles & Boosting
28		May-5, Tuesday		Gradient Boosting & XGBoost		HW 6