Meeting time & location: Tuesday and Thursday 8:30–9:55 am at WH-100E.
Office hours: Tuesday 2–3 pm & Thursday 10–11 am
This course is a 4-credit course, which means that in addition to the scheduled lectures/discussions,
students are expected to do at least 9.5 hours of course-related work each week during the
semester. This includes things like: completing assigned readings, participating in lab sessions,
studying for tests and examinations, preparing written assignments, completing internship or
clinical placement requirements, and other tasks that must be completed to earn credit in the
course.
Prerequisite
I assume that you have knowledge of Advanced Linear Algebra and Statistical Inference (or Mathematical Statistics).
Topics
The theoretical aspect of Multivariate Statistical Analysis, including: multivariate normal distributions, the multivariate Central Limit Theorem, quadratic forms, Wishart distributions, Hotelling's T square, inference about multivariate normal distributions.
Modern applied multivariate statistical methods, including: Principal Component Analysis, Canonical Correlation Analysis, Classification (Bayes rule, Linear and Quadratic discriminant analysis, cross-validation, and logistic regression etc.), factor analysis and Independent Component Analysis, clustering and multidimensional scaling.
Other commonly used machine learning approaches, including Classification and Regression Trees, Support Vector Machine and other large margin classifiers, kernel methods, LASSO and sparsity methods, additive models, etc., if time permits.
Learning Outcomes
Process and visualize different data types.
Identify and evaluate appropriate data analytics techniques to be used.
Understand the underlying mechanism of predictive models and evaluate and interpret such models.
Use analytical tools and software widely used in practice.
Work both independently and in a team to solve problems.
Learn to present and communicate the findings effectively.
Recommended Texts
The required texts are Härdle & Simar 2012 and Izenman 2013 (see below for details).
Elementary
Johnson, Richard A & Wichern, Dean W. 2007. Applied multivariate statistical analysis. Upper Saddle River, N.J: Pearson Prentice Hall.
Härdle, Wolfgang & Simar, Léopold. 2012. Applied multivariate statistical analysis. Berlin: Springer (also visit this site (http:www.quantlet.de) for sample codes; search “MVA”). There is a newer (4th) edition which should work as well.
Advanced and applied
Izenman, Alan Julian. 2013. Modern multivariate statistical techniques: Regression, classification, and manifold learning. New York: Springer New York. Book Home Page (including R, S-plus and MATLAB code and data sets)
Hastie, Trevor, Tibshirani, Robert, and Friedman, J. H. 2009. The elements of statistical learning: Data mining, inference, and prediction. New York, NY: Springer New York.
James, Witten, Hastie and Tibshirani, 2014. An Introduction to Statistical Learning with Applications in R. Book Home Page. The PDF file of the book can be downloaded for free. There is also a R library for this book.
Theoretical
Anderson, T. W. 2003. An introduction to multivariate statistical analysis. Hoboken, N.J: Wiley-Interscience.
Muirhead, Robb J. 1982. Aspects of multivariate statistical theory. New York: Wiley.
Working with R or SAS
Everitt, Brian, and Hothorn, Torsten. 2011. An introduction to applied multivariate analysis with R. New York: Springer.
Khattree, Ravindra, and Naik, Dayanand N. 1999. Applied multivariate statistics with SAS software. Cary, NC: SAS Institute.
Khattree, Ravindra, and Naik, Dayanand N. 2000. Multivariate data reduction and discrimination with SAS software. Cary, NC: SAS Institute.
Please use Piazza (www.piazza.com) for all electronic communications with me rather than email. Piazza is a question-and-answer platform. It supports LaTeX, code formatting, embedding of images, and attaching of files. You are encouraged to ask questions when you have difficulty understanding a concept or working around a piece of code – you can even ask questions anonymously. Moreover, you can also answer questions from your classmates. I constantly monitor the answers and endorse those which make more sense to me.
Announcement will be sent to the class using Piazza. All enrolled students should create an account with Piazza (www.piazza.com) by visiting their website. Click “enroll now” and select “Binghamton University,” then search for “Math 570.” Alternatively, use this link.
This is the first time I use piazza so some adjustment may be expected.
Blackboard
Blackboard will only be used for recording grades on assignments and exams and for distributing solutions. The code and lecture notes can also be found on blackboard.
Grading
Homework (30%): homework is assigned between weekly and biweekly.
Midterm exam (30%): a midterm exam focusing on the theoretical part of the course will be administered.
Course project (35%): a group project will be assigned to each student. Successful completion of the project includes an initial report, a presention and a final report.
Lecture attendance and participation (5%): meaningful actitivities (ask and answer questions) on piazza also count.
Course project
Use an existing data set or create your own one, apply inferential and analytic techniques learned in the class, write reports and give a presentation.
You may choose to work with 1–2 persons and you may submit the work products as a team. If you can not find a team member, I may assign one to you. If you feel the teammates that you initially chose or were assigned to are incorporative, you can choose to leave the team and work alone. You may work on the same project you has been worked on but you must write your own reports and make your own presentation. You must choose to do so voluntarily. Nobody can force a team member out. However, you are not allowed to switch to another team. If you decided to leave your team and work alone, you must do so by March 14, 2017.
Members on the same team will receive the same grade for the course project. The total points of the project is 100, which can be divided into three parts:
Initial report (10 pts): due March 14, 2017.
Presentation (30 pts): each team will give a 30-minute presentation about the outcome of the project.
Final report (60 pts): due in the final exam week.
The initial report should give information on the team formation, description of the data, potential research questions and possible methods to use. The initial report should not exceed one page. The initial report should be sent by the leader of each team to me via individual note on piazza, cc-ing the other team members.
Tips on writing the final report
The final report should include
Description of research questions / issues (either scientific or statistical question). The significance of the problems.
Description of the data.
Preliminary studies: data visualization, dimension reduction, feature extraction, feature selection, statistical inference, model assumption checking (normality? transformation needed?), etc.
Statistical analysis
Methods: what analyses were done and why. If there is any challenge in analysis,
describe your approach to tackle the problem.
Results: A small number of well-designed and tailored tables and graphics may
be appropriate. No copy-paste of large chunks of software outputs!
Conclusion: Convey your findings to broader audience. Discuss any boarder impact.
Enclose all your computer code in an individual message to me on piazza, not part of the final report.
Typos and grammatical errors will be harshly penalized. If you are not yet a master of writing, read The Elements of Style. There are a few copies in the library.
The final report should be written with the assumption that the audience of the report are college-educated persons who have taken only elemtary statistics. You are NOT reading a report for your professor to read.
The final report should not exceed 6 pages, including figures and tables, and must begin with an
appropriate title highlighting your choice of topic and analysis.
Methods to use
Statistical methods: we will have seen hypothesis testing, dimension reduction, classification, and factor analysis and possibly cluster analysis. Choose several from these methods. The goal is not to use as many different methods as possible; rather, choose the ones most appropriate for the data. Alternatively, some of you may want to tackle a project involving a multivariate method that we will not cover in this course, or to compare the performance of several statistical methods on datasets. These are all attractive options.
Data Sources
Find your own data set online (e.g. google “predictive analytics data set”), you will find plenty. Below are some data repositories.
In both the homework assignments and the midterm exam, there are sets of extra problems. Ph.D. students in the Department of Mathematical Sciences, and those who are interested in pursuing a Ph.D. in the department, must complete the extra problems. Completion of the extra problems will not lead to bonus points. But unsatisfactory performance on the extra problem set may have negative impact on your continuation to Ph.D.
Software
There is no designated software for this course. You may use the software that makes the most sense for you. However, many students find R and MATLAB relatively easy to use.