Chapter 1.

Introduction to Nonparametric Statistics.


The classical statistical methods (t test, least squares regression, etc) assume that the data consists of i.i.d.r.v.'s from a normal distribution. If the data is far from normal, these methods can be very unacurate. For example, t confidence intervals do not have the sought 1-alpha level when the distribution of the data is not coming from a normal distribution.

The previous statistical methods are based in the fact that we know the type of distribution we have. When we know that the data comes from a finite dimensional family of distributions, indexed by an unknown parameter, we can use this to find statistical methods optimal under the imposed conditions. Usually, we are interested in estimation of a function of the unknown parameter. These methods are called parametric methods.

The problem with the parametric statistical methods is that they assume rigid assumptions on underlying distribution of the data. If these assumption are not satisfied, then the inferences could be very unaccurate.

In contrast, parametric methods apply to big type of distributions (almost making no assumtion on the model). They are more flexible. In many cases, these methods are distribution free. This means that the distribution of the used stastistic do not depend on the distribution of the data. This makes possible to tabulate these distributions.

Ideally, we would like to have methods that for normal data do not work as well as the optimal classical methods, but they are close enough; and for data which is far from normal work much better than the classical methods.

In this hand out, we see how the t confidence intervals do not have the required level for non normal distributions. To do we do simulations and find the proportion of confidence intervals containing the mean. This proportion should be roughly 1-alpha. The following program generates N=10,000 confidence intervals for the mean. Each confidence interval is obtained by taking a random sample of size 10 from a N(1,3) distribution. The proportion of confidence interval containing the true mean is stored in the object "prop". prop should be close to 1-alpha.

************1a************
# The following program finds N confidence intervals using the 
# t-distribution. Each sample is taken from  a normal distribution
# with mean and stadard deviation sigma. The sample size is n=10.
# prop= proportion of intervals containing the true mean should be
# close to .95
rm (c1,c2,c3,con1,con2,con,prop)  
n_10
N_10000
mu_1
sigma_3
val_0
talpha_qt(.975,df=n-1)
con1_1
con2_1
for(i in 1:N) 
{
x1_rnorm(n,mean=mu,sd=sigma)
con1[i]_mean(x1)-talpha*sqrt(var(x1))/sqrt(n)
con2[i]_mean(x1)+talpha*sqrt(var(x1))/sqrt(n)
}
con_cbind(con1,con2)
c1_(sign(mu-con1)+1)/2
c2_(sign(con2-mu)+1)/2
c3_c1*c2
prop_sum(c3)/length(c3)
*********************************
If we run this program, we get that approximately 95 % of the confidence intervals contain the true mean:
> prop
[1] 0.949 
Next, we find 10,000 confidence t-intervals for a exponential distribution:
***********1b**********************
# The following program finds N confidence intervals using the 
# t-distribution. Each sample is taken from  a exponential distribution
# with mean one. The sample size is n=10.
# prop= proportion of intervals containing the true mean should be
# close to .95. But, it is around .90
rm (c1,c2,c3,con1,con2,con,prop)
n_10
N_10000
talpha_qt(.975,df=n-1)
con1_1
con2_1
for(i in 1:N) 
{
x1_rexp(n)
con1[i]_mean(x1)-talpha*sqrt(var(x1))/sqrt(n)
con2[i]_mean(x1)+talpha*sqrt(var(x1))/sqrt(n)
}
con_cbind(con1,con2)
c1_(sign(1-con1)+1)/2
c2_(sign(con2-1)+1)/2
c3_c1*c2
prop_sum(c3)/length(c3)    
***************
In this case, we get a slightly smaller proportion of confidence intervals containing the mean.
> prop
[1] 0.8999
Next, we find 10,000 confidence t-intervals for a symmetric stable distribution with index=1.1:
************1c*********************
# The following program finds N confidence intervals using the 
# t-distribution. Each sample is taken from  a symmetric stable distribution
# with  index=1.1 
# prop= proportion of intervals containing the true mean should be
# close to .95. But it is not. It is more or less  0.98
rm (c1,c2,c3,con1,con2,con,prop)
n_10
N_10000
talpha_qt(.975,df=n-1)
con1_1
con2_1
for(i in 1:N) 
{
x1_rstab(n,index=1.1,skew=0)
con1[i]_mean(x1)-talpha*sqrt(var(x1))/sqrt(n)
con2[i]_mean(x1)+talpha*sqrt(var(x1))/sqrt(n)
}
con_cbind(con1,con2)
c1_(sign(-con1)+1)/2
c2_(sign(con2)+1)/2
c3_c1*c2
prop_sum(c3)/length(c3)
***************
In this case, we get a bigger proportion of confidence intervals containing the mean.
> prop
[1] 0.9767

Comments to: Miguel A. Arcones