Course Summary

Course Goal

This course begins the study of mathematical statistics, with a minimum of mathematical prerequisites. The text (Wackerly, et. al.) was chosen with this aim in mind.

More specifically, this course covers the material in Chapters 8-10 of Wackerly. This page is an overview of that material. The idea is that by reading this page, you'll see the broad outline of the material. This creates a mental structure, so when you learn a fact about, for example, the method of moments, you'll have a place your mind to put it. This increases the chances that you retain this material, whether for the final exam, or two semesters later.

The prerequisite for this course is Math 447. You need a strong grasp of that material, particularly Chapter 6 of Wackerly, to attempt this course. In Math 447 we studied the situation where we understood completely the data generating process (e.g. the random variable has the exponential distribution with parameter \( \lambda = 2 \)) and we worked towards an understanding of the data generated by this process (e.g. what are its aggregate features, like mean and variance).

In 448 we turn this around and try to deduce facts about the data generating process from the data set. For example, we might try to estimate the variance of a random variable from many samples of the variable. Knowing that a random variable has the Gamma distribution, for some \( \alpha \) and \( \beta \), we might try to estimate \( \alpha \) and \( \beta \) from many samples of the variable. Or, knowing that a variable has the normal distribution with unknown mean, and known variance \( \sigma^2 \), we might want to form a belief as to whether the mean is nonzero. Both estimation (Chapters 8 and 9) and hypothesis testing (Chapter 10) fall into the category of deducing facts about the data generating process from the data set.

Generally in this course we will be assuming that we have a probability model for our data (e.g. it is normal with mean \( \mu \) and variance \( \sigma^2 \) ) and trying to infer something about the parameters of the model from the data. If the model is not reasonable, then most of these procedures will yield nonsense.

There is a prior step, in which we investigate the data to see what model might be reasonable, by, for example, plotting a histogram. For example, if we did this for household net worth in the United States, we would immediately see that the normal model mentioned above is not reasonable. However, it would be much more reasonable to apply a normal model if the variable of interest is "height of adult males". Learning to do this prior step (investigating the data) is very important, but not really the subject of this course. I suggest Math 329, or at least learning something about R and ggplot2. Knowing how to use R is an employable skill!

The rest of this page is organized by the chapter of the book corresponding to the relevant material.

Chapter 8: Estimation

Whatever our data generating process is, it has some parameters of interest, like the mean, the variance, or the parameters defining any of the distributions studied in Math 447. Estimation is about procedures for guessing these parameters based on a sample from the data generating process. There are two types of estimate for a target parameter: a single "best guess" at what the parameter is, called a point estimate, and a guess that the parameter falls into a range, called an interval estimate.

Some estimators are better than others, and we want to say why. To this end, the bias, error of estimation, mean-square error, and variance of an estimator \( \hat{\theta} \) of a parameter \( \theta \) are defined. For convenience, we repeat the defintions below.

the bias of \( \hat{\theta} \) is \(B(\hat{\theta}) = E[\hat{\theta}] - \theta \).
the error of estimation of \( \hat{\theta} \) is \(|\hat{\theta} - \theta |\).
the mean-square error of \( \hat{\theta} \) is \(\textrm{MSE}( \hat{\theta} ) = E[(\hat{\theta} - \theta )^2 ] \).
the variance of \( \hat{\theta} \) is \(V[\hat{\theta}] \).

There is an important relationship here: the bias-variance tradeoff, which says that \( \textrm{MSE}( \hat{\theta} ) = V[\hat{\theta}] + B(\hat{\theta})^2\). This relationship applies not just to classical statistical estimators, but all sorts of schemes for machine learning.

Section 8.3 gives a table with some common unbiased estimators and their variances. The four cases considered are

The sample mean \( \bar{Y} \) as an estimator of the population (or distribution) mean \( \mu\).
The frequency of successes in the sample, \( \hat{p} = Y/n \), as an estimator of the frequency \( p \) of success for a binomial random variable \(Y \sim \text{Binom}(n,p) \).
Difference estimators for the two cases above.

In section 8.4 the error of an estimator \( \hat{\theta} \) is defined as \( |\hat{\theta} - \theta| \). We then consider the question of how likely it is that the error in estimation is less than certain bounds. Technically, we cannot generally compute this, as the probability distribution of \( \hat{\theta} \) depends on the unknown parameter \( \theta \). However, using general results like Tchebysheff's Theorem and empirical results like the 95% rule leads us to compute things like 2-standard deviation bounds on the error of estimation, using the standard deviations of the estimators.

We go on in 8.5 to formally consider interval estimators, which enable us to make statements of the form, "a 90% confidence interval for the mean is \((47,55)\)". We are able to obtain these interval estimators in some cases of interest through the pivotal method. This involves finding a pivotal quantity \( Q \), which by definition is a quantity such that

\( Q \) is a function only of the samples \(Y_1, \ldots, Y_n\) and the unknown parameter \( \theta \).
The distribution of \(Q \) does not depend on \( \theta \).

It is by no means clear that such a \(Q\) always exists, but it does in many cases of interest.

In 8.6 we see that in the large-sample case, it follows from the central limit theorem that for the four unbiased estimators of Section 8.3 that \[ Z = \frac{ \hat{\theta} - \theta } {\sigma_{\hat{\theta}}} \] has approximately the standard normal distribution and is therefore a pivotal quantity. Confidence intervals can therefore be derived from this observation.

In Section 8.7, using the dependence of the confidence interval on the sample size \(n\), we address the problem of selecting the sample size to meet some criterion or restriction. This involves writing down an expression for the confidence interval involving \(n\) and doing some algebra.

Section 8.8 addresses the small-sample case. Here we cannot use the central limit theorem and we need extra hypotheses. We assume that we have samples from a normal distribution with \( \mu \) and \(\sigma \) unknown, or from two normal distributions with parameters unknown but equal variances. Under these hypotheses, we can use the theorems of Chapter 7 and the definition of the t-distribution to show that \[ T = \frac{ \bar{Y}- \mu} {S/\sqrt{n}}, \quad \text{where} \quad S^2 = \frac{1}{n-1} \sum_{i=1}^n(Y_i - \bar{Y})^2 \] has the t-distribution with \(n-1\) degrees of freedom. In the difference-in-means case, we use a similar fraction which also has the t-distribution.

Section 8.9 solves the problem of finding confidence intervals for \( \sigma^2 \), when we have samples from a normal distribution with \( \mu \) and \( \sigma \) unknown. It turns out that \[ Q = \frac{ (n-1) S^2 } {\sigma^2}, \quad (\text{see above for definition of } S^2) \] has the \( \chi^2 \)-distribution with \(n-1\) degrees of freedom, by a result proved in Chapter 7. Thus, it is a pivotal quantity and a confidence interval for \( \sigma^2 \) can be derived.

Chapter 9: Properties of Estimators

In this chapter we study efficiency, which means "low-variance", consistency, which means "correct in the long run", and sufficiency, which means "carries all the information about the target parameter". We repeat here the definitions for reference.

Section 9.2: The relative efficiency of two unbiased estimators \( \hat{\theta_1} \) and \( \hat{\theta_2} \) is \( \textrm{eff}(\hat{\theta_1},\hat{\theta_2}) = V[\hat{\theta_2}]/V[\hat{\theta_1}] \). Note that \( V[\hat{\theta_1}] \) is in the denominator of the fraction, that is, "\(\hat{\theta_1} \) is more efficient than \(\hat{\theta_2} \)" means "\(\hat{\theta_1} \) has lower variance than \(\hat{\theta_2} \)". We will see in the next section that this means something like "\(\hat{\theta_1} \) converges to its target parameter more quickly than \(\hat{\theta_2} \)". The adjective unbiased in the first sentence is important: because of the bias-variance tradeoff, we can always reduce the variance by increasing the bias, so the efficiency comparison only makes sense for unbiased estimators.

Section 9.3: An estimator \( \hat{\theta} \) with target parameter \( \theta \) is said to be consistent if \( \hat{\theta}_n \) converges to \( \theta \) in probability. Note that we use \(\hat{\theta} \) to denote the general estimation rule and \(\hat{\theta}_n\) for the estimator based on \( n\) samples. The definition of convergence in probability is more complex than most of the definitions in this book, and must be studied (see link). In practice, the exercises usually have you showing convergence in probability through Theorem 9.1: an unbiased estimator is consistent if \( \lim_{n \rightarrow \infty} V[\hat{\theta}_n] = 0 \).

This result allows the book to prove a version of the weak law of large numbers: the sample mean converges in probability to the distribution mean. Note that there really is something to prove: we defined the distribution mean to be a certain integral, and the sample means are random variables that have some distribution, which you may not always be able to compute!

Convergence in probability shares a number of properties with ordinary convergence of sequences of real numbers, which you studied in calculus, and these are listed in Theorem 9.2. There are results beyond this, notably Theorem 9.3, which says that if \(U_n\) converges in probability to a standard normal distribution function, and \(W_n\) converges in probability to \(1\), then the quotient \( U_n/W_n \) converges in probability to a standard normal distribution. This is a special case of Slutsky's Theorem. Theorem 9.3 can be used to show that the distribution function of \[ \frac{\bar{Y_n} - \mu} {S_n / \sqrt{n}}, \quad \text{where} \quad S_n^2 = \frac{1}{n-1} \sum_{i = 1}^n (Y_i - \bar{Y}_n)^2 \] converges to the standard normal distribution function. Notice that this extends the CLT as proved in the book by replacing the variance of the distribution with the sample variances.

Section 9.4: A statistic \( U = g(Y_1, \ldots, Y_n) \) is sufficient for the parameter \( \theta \) if the conditional distribution of \( Y_1, \ldots, Y_n \) given \( U \) does not depend on \( \theta \). This definition is meant to capture the "all the information" idea: if the conditional distribution does not depend on \( \theta \), there is no information about \( \theta \) left.

It might seem that applying this definition requires heavy manual labor along the lines of Chapter 6 to compute the conditional distribution, but the book gives a factorization criterion for the likelihood function that is equivalent to the definition above. In the exercises, the definition is rarely applied directly and instead the factorization criterion is used. The likelihood function is by definition \[ L(y_1, \ldots, y_n \mid \theta) = f_\theta(y_1, \ldots, y_n), \] where \(f_\theta\) is the density function. The notation emphasizes the fact that we are now allowing \(\theta\) to vary. The factorization criterion says that the statistic \(U \) is sufficient for estimation of the parameter \( \theta \) if \[ L(y_1, \ldots, y_n \mid \theta) = g(u, \theta) \cdot h(y_1, \ldots, y_n), \] where \(g \) is a function only of \( u \) and \( \theta \) and \( h \) is a function only of the samples.

The chapter then goes on to discuss three methods of obtaining estimators:

The Rao-Blackwell Theorem: if we already have an estimator \( \hat{\theta} \) and a sufficient statistic \(U\) , we can reduce variance by replacing \(\hat{\theta}\) with \( E[\hat{\theta} \mid U] \).
The Method of Moments: we can compute the moments of the sample from our data and compute the moments as a function of the parameters, then choose the parameters that match up the moments.
The Method of Maximum Likelihood: we can choose those values of the parameters that maximize the likelihood of the data set given the parameters.

Regarding the Rao-Blackwell Theorem, one might ask whether repeated application of the theorem would yield better estimators than a single application. The answer is basically "no". If we naively reapply the theorem using the same sufficient statistic \( U \), or some function of it, we get the same estimator as we had before. More generally, the sufficient statistics that we tend to obtain with the methods and examples in this book are usually minimal sufficient statistics, and using these as the \( U \) in the Rao-Blackwell theorem leads to a result which is in a sense optimal: a minimum-variance unbiased estimator (MVUE).

The method of moments tends to produce consistent estimators, but these estimators need not be functions of sufficient statistics. Further, the estimators produced may be biased and/or inefficient. However, the method is easy to use, and can be generalized to the situation where we have more sample moments than parameters: in this case we select parameters by minimizing (some measure of) the aggregate error between the sample and distribution moments. Although this idea may seem simple, this situation arises often in econometrics, and Hansen's work on this Generalized Method of Moments was recognized when he won the Nobel.

Maximum Likelihood Estimators have a few good properties which it helps to know in doing some of the problems. While MLEs need not be unbiased, they will (under weak hypotheses) be consistent. Further, there is an invariance property: if \(t\) is a function of the parameter \(\theta\), then the MLE \( \widehat{t(\theta)} \) of \(t(\theta)\) is \( \widehat{t(\theta)} = t(\hat{\theta}) \). Further, it follows from the factorization criterion that a MLE is always a function of a sufficient statistic. A final section discusses large-sample properties of maximum likelihood estimators.

Chapter 10: Hypothesis Testing

Suppose we have a drug that is meant to lower blood pressure, and we wish to demonstrate to the FDA that it is effective. What we do is assemble a large number of test subjects, divide them into a control group (patients who get a placebo) and a treatment group (patients who get our drug). The way we demonstrate efficacy is to define a null hypothesis, that the drug is ineffective, and that any difference in blood pressure between the groups is purely a coincidence. The alternative hypothesis is that the null hypothesis is false. Efficacy is demonstrated (in a probabilistic sense) by showing that, with the given probability model, the results of the experiment are very unlikely if the null hypothesis is true.

Generally speaking, when we do an experiment, we don't believe the null hypothesis. Instead, we believe the alternative, or research hypothesis. However, in analyzing the results of the experiment, we put on our statistician hats, and assume the null hypothesis when analyzing the data. If the data are very unlikely to arise from the probability model in which the null hypothesis is true, then we can reject the null hypothesis.

This assumes that we have a decent probability model of the data, which lacks only the correct selection of some parameters to tune it. If our model of the data is very bad, we might wind up rejecting the null hypothesis for that reason, rather than that the parameter values associated to the alternative hypothesis create a better model for the data. These remarks underscore the importance of finding good probability models; as remarked earlier this course largely assumes that we have such models in hand.

The way we formalize "the null hypothesis make the data unlikely" is to compute a test statistic and see that it falls into a rejection region. In the example of our drug, the test statistic might be the difference in mean blood pressures between the two groups. The rejection region will depend on how unlikely we need the null hypothesis to be and the sample size (number of patients) we have. In doing this we are vulnerable, through bad luck or bad experimental procedure, to two types of errors.

A "Type I" error is a false positive: we think our drug works when it actually doesn't. Or, in other words, we reject the null hypothesis when it is actually true.
A "Type II" error is a false negative: we think our drug doesn't work when it actually does. Or, in other words, we fail to reject the null hypothesis when it is actually false.
A "Type III" error is when we confuse a Type I error with a Type II error.

You should know that the four elements of a statistical test are

Null hypothesis, \(H_0\)
Alternative hypothesis, \(H_a\)
A test statistic
Rejection region

Section 10.3: Common Large-Sample Tests A very common situation is that we want to test a hypothesis about a parameter \(\theta\) using an estimator \( \hat{\theta} \) which has an approximately normal sampling distribution with mean \( \theta \) and standard deviation \( \sigma_{\hat{\theta}} \). Generally speaking, this will be so for the standard estimators \( \bar{Y} \) and \( \hat{p} \). In this section the 4 elements of this common statistical test are described. We have a null hypothesis, \( \theta = \theta_0 \), an alternative hypothesis, say \( \theta > \theta_0 \). Our test statistic will be \[ Z = \frac{ \hat{\theta} - \theta_0 }{\sigma_{\hat{\theta}} } \] And the rejection region will be \( \{ z > z_\alpha \} \), which is equivalent to \( \{ \hat{\theta} > k \} \) for some value of \(k\). Here \( \alpha \) is the probability of a type I error. Small modifications to this scheme for lower-tailed or two-tailed tests are also described.

Section 10.4: Type II error probabilities and sample size for \(Z\) tests Generally one cannot compute the probability of a type II error without assuming that the parameter(s) have some particular value(s). This means that one must specialize an alternative hypothesis like \( \mu > \mu_0 \) to \( \mu = \mu_a \). After doing this, it is possible to compute the probability of a type II error.

If one insists that particular probabilities of type I and II errors are to be achieved and specializes the alternative hypothesis as in the previous paragraph, it is possible to compute the sample size required to achieve this.

Section 10.5: Relationships between hypothesis-testing and confidence intervals There is a very close relationship between hypothesis testing and confidence intervals. You should see this immediately in the similarity of the formula for the rejection region of the test and for the endpoints of the confidence interval.

Section 10.6: \(p\)-values Another way to report the result of a test is to give the \(p\)-value associated to the test statistic. Intuitively, it is a way of saying how unlikely the result is, assuming the null hypothesis. There are a number of ways to define \(p\)-value, and definition 10.2 must be memorized.

Section 10.7-9: Standard Hypothesis Tests Because of the relationship between hypothesis testing and confidence intervals, Sections 10.7-9 are almost a repeat of Sections 8.6, 8.8, and 8.9.

Section 10.10: Power of Tests and the Neyman-Pearson Lemma The power of a test is (almost by definition) \(1 - \) the probability of a type II error. This of course depends on the value of the parameter. High power is a desirable characteristic of a statistical test, but of course there is a trade-off against other considerations. Recall that we typically test a null hypothesis \(\theta = \theta_0\) against an alternative hypothesis which includes many possible values of \( \theta \), say \( \theta > \theta_0 \). A hypothesis that uniquely specifies all parameter values is called simple, while one which includes many possible values is called composite. The Neyman-Pearson lemma says, roughly speaking, that the most powerful test of simple hypotheses is a likelihood ratio test. Sometimes a test is most powerful for all values of the parameter in a certain range; if so the the test is called uniformly most powerful. In the examples it is sometimes possible to construct a uniformly most powerful test in the one-tailed cases. Example 10.23 is worth studying in this regard.

Section 10.11: Likelihood Ratio Tests Sometimes we want to perform a test of composite hypotheses. One way of doing this is a likelihood ratio test, where we compute by maximizing (separately) the numerator and denominator over the regions in parameter space defined by the hypotheses. The definition of this likelihood ratio is in the grey box on page 550. (It gets a grey box, so you can tell it is important.) This likelihood ratio \( \lambda \) has the property that in the large-sample case \( -2 \log ( \lambda ) \) has the \( \chi^2 \) distribution. (This is stated in more detail in Theorem 10.2.)