Course Summary

Course Goal

This course aims to get students ready to study statistics as quickly as possible, with a minimum of mathematical prerequisites. The text (Wackerly, et. al.) was chosen with this aim in mind.

More specifically, this course covers the material in Chapters 2-7 of Wackerly. This page is an overview of that material. The idea is that by reading this page, you'll see the broad outline of the material. This creates a mental structure, so when you learn a fact about, for example, the Gamma distribution, you'll have a place in your mind to put it. This increases the chances that you retain this material, whether for the final exam, or two semesters later.

The rest of this page is organized by the chapter of the book corresponding to the relevant material.

Chapter 2: Axioms for Probability and Consequences

The notion of probability has been around for at least as long as people have been gambling, which is a very long time. The word has been in use so long that there is a subfield of philosophy devoted to its interpretation. In this course we study only a mathematical notion of probability. The basic idea of this notion is that a probability is an assignment of numbers (probabilities) to outcomes ("events") satisfying some intuitively meaningful axioms. (For example, probabilities have to be between 0 and 1.) We define various "probability models", or toy versions of the world, which we understand, and talk of the probability only makes sense with reference to a specific model.

Because we understand these simplified models well, we can say what aggregate features (e.g. mean, standard deviation) the data produced by such models should have. Most of this course will be about the situation where we have defined a model and we do some work to establish its features. Understanding whether there is a good correspondence between such models and the real world generally demands that we be able to do this work. Evaluating the quality of such a correspondence, and finding a good one, is mostly in the following course, Math 448, Statistics.

In Chapter 2 the axioms for probability are introduced along with some related concepts, like conditional probability and independence. Consequences are deduced, like the Law of Total Probability, and Bayes' Rule. One way to understand the results of Chapter 2 is to keep in mind that if we take the "sample space" to be the unit square, events to be subsets of the unit square, and substitute area for probability, all of the axioms of probability are satisfied. This is a useful example to keep in mind to develop your intuition. Two basic types of probability computation are introduced: the "Event-Composition Method", and the "Sample-Point Method". (This terminology is overblown, and I have never seen it used outside of the context of the Wackerly text.)

The Event-Composition Method assumes that you know some facts about your probability model, for example that the probability of an event E is 0.3. You then deduce the probability of another event by using consequences of the laws of probability (which are the lemmas and theorems of Chapter 2). For example, you might deduce that the probability of E not happening is 0.7.

The Sample-Point Method assumes that you have enumerated all possible outcomes and know for some reason that they are all equally likely. You can then deduce the probability of an event by counting the number of ways it can happen and dividing by the number of all possible outcomes. For example, if we roll a die, there are six possible outcomes, all of which are equally likely (in our model) and there are three possible even numbers. So we can conclude that the probability of rolling an even number is 3/6, or 0.5.

Both these methods can get much more complicated. For example, counting the number of possible 5-card hands drawn from a 52-card deck that are full houses is not as simple as the counting in the previous paragraph.

Bayes' Rule is best understood through a series of examples. The most frequent example is that of a patient who tests positive for a rare disease. The test is very accurate, but the disease is very rare. What is the probability that the patient actually has the disease? There are two ways the patient could test positive: a false positive, or actually having the disease. The key question is what the relative probablities of these two events are. Bayes' Rule formalizes and quantifies this.

The Monty Hall problem is important for the type of reasoning involved and the use of the Law of Total Probability, not for the results. I will keep changing the problem around so you cannot develop a formula and must do the reasoning every time.

Chapter 3: Discrete Random Variables

A random variable is defined in Chapter 2 as "a real-valued function on a sample space". The fancy terminology is easily understood in examples, so you don't need to worry about it. For example, the "experiment" might be, we flip 2 coins, and the "random variable" might be that we count the number of heads. From this we can obtain the random variable in the form of a list of possible values (here 0, 1, 2) and the probability with which they are observed (here 0.25, 0.5, and 0.25 respectively). This list of possible values and their probabilities is collected into a probability function, or probability mass function $p(y) = P(Y = y) $. (There is a convention, used throughout the text, that random variables are capitalized, and real numbers are lower-case.)

A random variable is roughly as general as a function, and there are only a few results in the chapter for general random variables. Among these are Markov's Inequality and Chebyshev's Inequality. The bulk of this chapter and the next is devoted to the study of specific random variables, which are useful in various simple probability models.

The random variables studied in this chapter are discrete, which means that they take on a specific list of values, like 0, 1, 2 in the example above, rather than a continuous range of values (like any real number between 0 and 1). All of the random variables of Chapter 3 are in some sense based on repeated independent trials, which result in either "success" or "failure". Here is the list:

Bernoulli: has value $1$ with probability $p$ and $0$ with probability $1-p$
Binomial a sum of $ n $ independent Bernoulli random variables, all with the same parameter $p$
Negative Binomial number of independent Bernoulli trials required to get $r $ ones
Geometric number of independent Bernoulli trials required to get a $1$
Hypergeometric Sampling n times without replacement, but similar to the Binomial.
Poisson Accidents happen at a rate of $ \lambda $ per day on average. How many happen in a given day?

The probability functions, means, variances, and moment generating functions of these random variables are computed, using various tricks from calculus.

A few auxiliary concepts that are frequently useful are also introduced. The (cumulative) distribution function $ F(y) = P(Y \leq y) $ is closely related to the probability function: $ F(y) = \sum_{x \leq y} p(x) $. The moment generating function is another way of encoding the information in a probablity distribution. While the definition $ m_Y(t) = E[e^{tY}] $ is not an obvious step, it turns out to be useful, and an expansion in power series shows its relationship to the moments $E[Y^k]$, namely that $ m_Y^{(k)}(0) = E[Y^k] $. The moment generating function usually carries all of the information we need to know about a random variable. There is a theorem (Uniqueness of Moment Generating Function) that if two random variables have the same moment generating function, then they have the same distribution. While there are some exceptions, like the log-normal distribution, we can get away with ignoring them in this course.

Chapter 4: Continuous Random Variables

In this chapter the concept of a continuous random variable, which can assume any real value in a range, is introduced. In place of the probability function, which gives the probability of assuming a specific value, we have a probability density function, whose integral over a range gives the probability of the variable being in that range. That is $$ P( a \leq Y \leq b ) = \int_a^b f(y) dy $$ This means that for any particular value $ a $, the probability that a continuous random variable $ Y $ is equal to $a $ is zero. That is, $P(Y =a ) = 0 $. This is similar to the fact that a point has area zero, but the unit square, which is composed of many points, has area 1. The mean and variance of certain standard random variables are computed from their defining probability density function using tricks from calculus. The random variables studied are

Uniform constant density distributed over an interval
Normal the bell curve density based on $ e^{-y^2} $
Gamma density positive only for $ y>0$, based on $ x^\alpha e^{-x} $
Chi-Squared special case of the Gamma, has property that a sum of squared standard normals is Chi-Squared
Exponential special case of the Gamma, has memoryless property
Beta density positive only between $0 $ and $ 1 $, based on $x^{\alpha}(1-x)^{\beta} $

All the same concepts from discrete random variables apply in the continous case, using integrals instead of sums. This includes the moment generating function and the cumulative distribution function.

Chapter 5: Joint Distributions of Random Variables

In practice we often have situations where two or more random variables are generated by the same experiment. There may be some connection between their values, and our model for this is a joint distribution of the two random variables. In the discrete case, this takes the form of a joint probability function $p(x,y)$ of the two random variables $X$ and $ Y $ that gives the probability that simultaneously we have $ X=x $ and $ Y=y $. In the continuous case, the interpretation of the probability density is again via an integral.

A joint distribution gives us the information about the distribution of the individual variables as well, and in this chapter the marginal distribution and conditional distribution are defined. These definitions can give rise to subtle issues like the Borel-Kolomogorov paradox, but we don't need to know about such things for Math 448, and the text sets things up very carefully so we don't have to worry about such issues.

From the joint distribution of two (or more) random variables, the expectation of any function of them can be computed by an integral, and this has the properties (linearity, etc.) that one would expect. One can also determine from the joint distribution whether the variables are independent. Independence requires the density or probability function $ f(x,y) $ be a product $ g(x)h(y) $.

In addition, from the joint distribution, one can compute the covariance and correlation of two random variables. Covariance is a measure of the extent to which the variables vary together, and correlation normalizes this for the variances of the two variables, so that correlation is always between $-1$ and $1$. In analogy with linear algebra, you can think of covariance as an inner product, and correlation as the cosine of the angle between the random variables (in the space of all possible random variables). The definitions of covariance and correlation must be memorized.

In addition, the Multinomial Probability Distribution and the Bivariate Normal Distribution are introduced. The multinomial distribution can be thought of as a generalization of the Binomial to the case where more than just two outcomes of each trial are possible. The Bivariate Normal Distribution, which is a basic tool for statistics, is a construction of two normal random variables with specified variances and correlation.

Chapter 6: Functions of Random Variables

If a random variable is a function of another random variable ( $ U = f(X) $) or variables, (say $U=f(X,Y)$), we may wish to find the distribution of $U$ from the function $ f $ and the (joint) distribution of $ X $ and $ Y $. This chapter is about various methods for doing this. Three methods are studied.

Distributions
Transformations
Moment Generating Functions

The method of distributions works directly from the definition of the cumulative distribution function of $ U $. We write $F(u) = P(U \leq u) = P(f(X,Y) \leq u) $ and attempt to compute this probability from the known joint distribution of $ X $ and $ Y $.

The method of transformations is an offshoot of this method in which the relationship between $ U $ and $ X $ is given by a one-to-one function $ h $. In this case we are able to derive a formula for the distribution function of $ U $ from the distribution function of $ X $ by using the inverse of the function $h$ to transform the inequality $ P(h(X) \leq u) $ to $ P(X \leq h^{-1}(u)) $.

The method of moment generating functions is to compute the moment generating function of $ U $, and hope to recognize it as the moment generating function of a distribution we know.

There is a multivariable generalization of all this, the Jacobian method, which gives us a multiplicative factor by which to change the probability density in the form of the determinant of a matrix of partial derivatives. This determinant is the factor by which the relevant function changes volumes locally at the given point.

There is also a brief introduction to order statistics.

Chapter 7: Sampling Distributions and the Central Limit Theorem

In the situation where we take repeated independent samples from the same distribution, we may use combine these samples into various computed measurements, for example, the mean, maximum, or standard deviation. These computed measurements are combinations of random variables, as discussed in the previous chapter, and therefore have distributions we can study. The main result of this chapter is the Central Limit Theorem, which says (roughly speaking) that the sum of many (in practice this means $ n > 30 $ ) independent and identically distributed random variables, when properly scaled, has the standard normal distribution. The proof of the central limit theorem in the text is a computation showing that the limit of the moment generating functions is the moment generating function of the standard normal distribution.

A special case of the Central Limit Theorem is studied in detail: the normal approximation to the binomial distribution. This makes sense, since the definition of the Binomial Random Variable is a sum of independent and identically distributed Bernoulli Random Variables.

In addition, the t- and F-distributions, which are frequently used in statistics, are introduced. The t-distribution can be defined conceptually as follows: if $ Y_1, \ldots , Y_n $ are iid samples from a normal distribution with mean $ \mu$ and variance $ \sigma^2 $, then we define $ \bar{Y} = \frac{1}{n}(Y_1 + \cdots + Y_n) $. A fundamental computation, repeated many times in the course shows that $E[\bar{Y}] = \mu $ and $ V[\bar{Y}] = \sigma^2/n $ Using, for example, the method of moment generating functions, we can see that $Z = \frac{\bar{Y}-\mu}{\sqrt{\sigma^2/n}} $ has the standard normal distribution (mean $0$, variance $1$). If we replace $\sigma$ with an estimate $ S $ of the standard deviation derived from the samples, where $$ S^2 = \frac{1}{n-1} \sum (Y_i - \bar{Y})^2 $$ then $ Z $ has the $t$-distribution. The conceptual definition of the $F$-distribution is that it is obtained from comparing sample variances from two samples of a standard normal distribution.