Chapter 1.

Descriptive Statistics.


We will analyze both the Florida data file and the cars data file.


Florida data file

First, you need to have the Florida datafile in your directory. To simplify, the notation, you do:

> florida_read.table("/home/arcones/MySwork/florida.txt",header=T,sep=",")
> county_florida[,1]
> bush_ florida[,2]
> gore_florida[,3]
> browne_ florida[,4]
> nader_florida[,5]
> harris_ florida[,6]
> hagelin_ florida[,7]
To get the total for each candidate do:
>bush.tot_sum(bush)
>gore.tot_sum(gore)
>browne.tot_sum(browne)
>nader.tot_sum(nader)
>harris.tot_sum(harris)
>hagelin.tot_sum(hagelin)
> print(c(bush.tot, gore.tot,browne.tot,nader.tot,harris.tot,hagelin.tot))
[1] 2912790 2912253   16415   97488     562    2281

Bush is the candidate with more votes.

To get the county percentage for each candidate do:

flor_c(bush,gore,browne,nader,harris,hagelin)
florida2_matrix(flor,ncol=6,byrow=F)
apply(florida2,2,sum) 
flor.perc_florida2/apply(florida2,1,sum) 
bush.perc_flor.perc[,1]
gore.perc_flor.perc[,2]
browne.perc_flor.perc[,3]
nader.perc_flor.perc[,4]
harris.perc_flor.perc[,5]
hagelin.perc_flor.perc[,6]

To obtain the main numerical measure which describe the data, you do

>summary(bush.perc)
   Min. 1st Qu. Median   Mean 3rd Qu.   Max. 
 0.3099  0.5014  0.549 0.5533  0.6156 0.7404

>apply(cbind(bush.perc,gore.perc,browne.perc,nader.perc,harris.perc,hagelin.perc),2,summary)
        bush.perc gore.perc browne.perc nader.perc harris.perc hagelin.perc 
   Min.    0.3099    0.2409   0.0005908   0.006563  0.00000000    0.0000000
1st Qu.    0.5014    0.3679   0.0022170   0.012220  0.00002718    0.0002116
 Median    0.5490    0.4297   0.0029320   0.014540  0.00007643    0.0003131
   Mean    0.5533    0.4271   0.0032780   0.015840  0.00011430    0.0003586
3rd Qu.    0.6156    0.4759   0.0036290   0.018900  0.00013740    0.0004339
   Max.    0.7404    0.6753   0.0097010   0.037770  0.00102000    0.0012670
>var(cbind(bush.perc,gore.perc,browne.perc,nader.perc,harris.perc,hagelin.perc))
                  bush.perc      gore.perc    browne.perc     nader.perc    harris.perc   hagelin.perc 
   bush.perc  8.539950e-003 -8.400036e-003  1.290400e-005 -1.517705e-004 -3.925380e-007 -6.553950e-007
   gore.perc -8.400036e-003  8.303500e-003 -1.847346e-005  1.141864e-004  5.515140e-007  2.716214e-007
 browne.perc  1.290400e-005 -1.847346e-005  2.678327e-006  2.888717e-006 -3.244831e-008  3.486386e-008
  nader.perc -1.517705e-004  1.141864e-004  2.888717e-006  3.454893e-005 -1.491483e-007  2.955648e-007
 harris.perc -3.925380e-007  5.515140e-007 -3.244831e-008 -1.491483e-007  2.746393e-008 -4.843283e-009
hagelin.perc -6.553950e-007  2.716214e-007  3.486386e-008  2.955648e-007 -4.843283e-009  5.818825e-008
>cor(cbind(bush.perc,gore.perc,browne.perc,nader.perc,harris.perc,hagelin.perc))
               bush.perc   gore.perc browne.perc nader.perc harris.perc hagelin.perc 
   bush.perc  1.00000000 -0.99752291  0.08532274 -0.2794102 -0.02563142  -0.02940071
   gore.perc -0.99752291  1.00000000 -0.12387561  0.2131898  0.03652117   0.01235708
 browne.perc  0.08532274 -0.12387561  1.00000000  0.3003004 -0.11964071   0.08831329
  nader.perc -0.27941021  0.21318982  0.30030041  1.0000000 -0.15311572   0.20845761
 harris.perc -0.02563142  0.03652117 -0.11964071 -0.1531157  1.00000000  -0.12115489
hagelin.perc -0.02940071  0.01235708  0.08831329  0.2084576 -0.12115489   1.00000000
> bush.perc+gore.perc
 [1] 0.9539692 0.9909598 0.9825326 0.9867488 0.9762802 0.9851772 0.9901536 0.9758552 0.9719790 0.9863340
[11] 0.9823255 0.9786459 0.9767922 0.9764680 0.9851897 0.9823037 0.9814883 0.9769865 0.9884133 0.9720149
[21] 0.9788312 0.9813531 0.9862315 0.9848338 0.9853195 0.9745769 0.9819735 0.9752782 0.9835796 0.9779964
[31] 0.9888793 0.9834284 0.9863618 0.9806989 0.9769704 0.9776336 0.9700807 0.9856419 0.9879203 0.9747684
[41] 0.9754682 0.9794698 0.9898825 0.9621833 0.9863176 0.9812814 0.9839829 0.9826600 0.9806902 0.9848641
[51] 0.9724893 0.9704549 0.9851655 0.9808054 0.9826009 0.9713188 0.9815126 0.9761613 0.9800933 0.9835947
[61] 0.9806803 0.9902525 0.9870644 0.9814377 0.9784392 0.9809036 0.9839626 0.9713940
We see that the distribution of the four minor candidates is skewed to the right. The distance form the median to the maximum is bigger than the distance from the median to the minimum.

The total percentage of these two candidates remain roughly constant. There exists a extreme negative correlation between the bush.perc and gore.perc. The total percentage of these two candidates remain roughly constant. In counties were Bush does better, Gore does worse and viceversa. We see that the correlation between gore.perc and nader.perc is negative, but moderately small. The correlation between bush.perc and nader.perc is positive but moderately small. Scatter plots of some pairs of variables follow:

> plot(gore.perc,bush.perc)
> plot(gore.perc,nader.perc)
> plot(bush.perc,nader.perc)


Cars data file

First, you need to have the cars datafile in your directory. To simplify, the notation, you do:

> weig_cars[,1]
> disp_cars[,2]
> mile_cars[,3]
To obtain the main numerical measure which describe the data, you do
> summary(mile)
 Min. 1st Qu. Median  Mean 3rd Qu. Max. 
   18      21     23 24.58      27   37
> summary(weig)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
 1845    2571   2885 2901    3231 3855
> summary(disp)
 Min. 1st Qu. Median Mean 3rd Qu. Max. 
   73   113.8  144.5  152     180  305

> var(cbind(weig,disp,mile))
            weig        disp        mile 
weig  245883.192  21573.3475 -2014.47740
disp   21573.347   2933.4042  -179.89407
mile   -2014.477   -179.8941    22.95904
> cor(cbind(weig,disp,mile))
           weig       disp       mile 
weig  1.0000000  0.8032804 -0.8478541
disp  0.8032804  1.0000000 -0.6931928
mile -0.8478541 -0.6931928  1.0000000
We see that the distribution of the mileage is skewed to the right. There are some cars with very high mileage. The distribution of the weights of the cars is sort of symmetric. The displacement of the cars has also a distribution skewed to the right.

There is a high positive correlation between car weight and engine displacement and high negative correlation between weight and mileage. Certainly, these relations seem natural to happen.

Running the following program, we get tables:

***********c1****
table(mile)
table(weig)
bre.m_ 10+5*c(1:6)
table(cut(mile,breaks=bre))
table(cut(weig,breaks=5))/length(weig)
table(cut(disp,breaks=pretty(disp)))
hist(mile,plot=F)
hist(weig,nclass=7,plot=F,probability=T)
hist(disp,breaks=bre.d,plot=F)
table(cut(mile,breaks=pretty(mile)),cut(weig,breaks=pretty(weig)))
table(cut(mile,breaks=pretty(mile)),cut(disp,breaks=pretty(disp)))
table(cut(disp,breaks=pretty(disp)),cut(weig,breaks=pretty(weig)))
hist2d(mile,weig)
hist2d(mile,disp,xbreaks=bre.m,ybreaks=pretty(disp))
hist2d(disp,weig,nxbins=4,nybins=5)
*******************
This is the outcome of the program:
 
> table(mile)
 18 19 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 37 
  4  3  5  6  5  8  4  3  5  4  2  1  1  2  4  1  1  1
> table(weig)
 1845 1900 2075 2170 2260 2275 2285 2295 2330 2345 2350 2390 2440 2485 2560 2575 2640 2645 2655 2670 2695 
    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1
 2710 2745 2750 2765 2775 2780 2840 2880 2885 2920 2935 2975 2985 3065 3110 3145 3185 3190 3195 3200 3220 
    1    1    1    1    1    1    1    1    2    3    1    1    1    1    1    1    1    1    1    1    1
 3265 3310 3320 3325 3415 3450 3480 3610 3665 3690 3735 3850 3855 
    1    1    1    1    1    1    2    1    1    1    2    1    1
> 
> bre.m_ 10+5*c(1:6)
> table(cut(mile,breaks=bre))
 15+ thru 20 20+ thru 25 25+ thru 30 30+ thru 35 35+ thru 40 
          12          26          13           8           1
> table(cut(weig,breaks=5))/length(weig)
 1824.90+ thru 2234.94 2234.94+ thru 2644.98 2644.98+ thru 3055.02 3055.02+ thru 3465.06 
            0.06666667             0.2166667             0.3333333             0.2333333
 3465.06+ thru 3875.10 
                  0.15
> table(cut(disp,breaks=pretty(disp)))
  50+ thru 100 100+ thru 150 150+ thru 200 200+ thru 250 250+ thru 300 300+ thru 350 
            10            24            18             4             0             4
> 
> hist(mile,plot=F)
$breaks:
 [1] 18 20 22 24 26 28 30 32 34 36 38

$counts:
 [1] 12 11 12  8  6  2  2  5  1  1

> hist(weig,nclass=7,plot=F,probability=T)
$breaks:
 [1] 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000

$counts:
 [1] 0.0001666667 0.0001666667 0.0006666667 0.0003333333 0.0009166667 0.0008333333 0.0005833333 0.0004166667
 [9] 0.0003333333 0.0004166667 0.0001666667

> hist(disp,breaks=bre.d,plot=F)
$breaks:
[1]  50 100 150 200 250 300 350

$counts:
[1] 10 24 18  4  0  4

> 
> table(cut(mile,breaks=pretty(mile)),cut(weig,breaks=pretty(weig)))
            1500+ thru 2000 2000+ thru 2500 2500+ thru 3000 3000+ thru 3500 3500+ thru 4000 
15+ thru 20               0               0               0               6               6
20+ thru 25               0               1              14              10               1
25+ thru 30               0               5               8               0               0
30+ thru 35               1               6               1               0               0
35+ thru 40               1               0               0               0               0
> table(cut(mile,breaks=pretty(mile)),cut(disp,breaks=pretty(disp)))
             50+ thru 100 100+ thru 150 150+ thru 200 200+ thru 250 250+ thru 300 300+ thru 350 
15+ thru 20             0             3             4             1             0             4
20+ thru 25             0            10            13             3             0             0
25+ thru 30             3             9             1             0             0             0
30+ thru 35             6             2             0             0             0             0
35+ thru 40             1             0             0             0             0             0
> table(cut(disp,breaks=pretty(disp)),cut(weig,breaks=pretty(weig)))
              1500+ thru 2000 2000+ thru 2500 2500+ thru 3000 3000+ thru 3500 3500+ thru 4000 
 50+ thru 100               2               7               1               0               0
100+ thru 150               0               5              16               2               1
150+ thru 200               0               0               6              10               2
200+ thru 250               0               0               0               2               2
250+ thru 300               0               0               0               0               0
300+ thru 350               0               0               0               2               2
> hist2d(mile,weig)
$x:
[1] 17.5 22.5 27.5 32.5 37.5

$y:
[1] 1750 2250 2750 3250 3750

$z:
         1500 to 2000 2000 to 2500 2500 to 3000 3000 to 3500 3500 to 4000 
15 to 20            0            0            0            2            5
20 to 25            0            0           13           13            2
25 to 30            0            6            8            1            0
30 to 35            1            5            2            0            0
35 to 40            1            1            0            0            0

$xbreaks:
[1] 15 20 25 30 35 40

$ybreaks:
[1] 1500 2000 2500 3000 3500 4000

> hist2d(mile,disp,xbreaks=bre.m,ybreaks=pretty(disp))
$x:
[1] 17.5 22.5 27.5 32.5 37.5

$y:
[1]  75 125 175 225 275 325

$z:
          50 to 100 100 to 150 150 to 200 200 to 250 250 to 300 300 to 350 
15 to 20          0          1          3          1          0          2
20 to 25          0         10         13          3          0          2
25 to 30          3         10          2          0          0          0
30 to 35          5          3          0          0          0          0
35 to 40          2          0          0          0          0          0

$xbreaks:
[1] 15 20 25 30 35 40

$ybreaks:
[1]  50 100 150 200 250 300 350

> hist2d(disp,weig,nxbins=4,nybins=5)
$x:
[1]  75 125 175 225 275 325

$y:
[1] 1750 2250 2750 3250 3750

$z:
           1500 to 2000 2000 to 2500 2500 to 3000 3000 to 3500 3500 to 4000 
 50 to 100            2            7            1            0            0
100 to 150            0            5           16            2            1
150 to 200            0            0            6           10            2
200 to 250            0            0            0            2            2
250 to 300            0            0            0            0            0
300 to 350            0            0            0            2            2

$xbreaks:
[1]  50 100 150 200 250 300 350

$ybreaks:
[1] 1500 2000 2500 3000 3500 4000

A more precise description of the distributions can be done, doing different graphs. For example, To do a dotplot:
> dotplot(mile)

To do a boxplot:
> boxplot(mile)

To do a Stem and Leaf Display:
> stem(mile,scale=-1)

N = 60   Median = 23
Quartiles = 21, 27

Decimal point is 1 place to the right of the colon

   1 : 8888999
   2 : 00000111111
   2 : 2222233333333
   2 : 4444555
   2 : 666667777
   2 : 889
   3 : 0
   3 : 223333
   3 : 45
   3 : 7
To do a histogram:
> boxplot(mile)

The general command is hist(x, nclass = , breaks = , plot = T, probability = F)

For example,

> hist(mile,breaks=bre, plot = F, probability = F)
$breaks:
[1] 15 20 25 30 35 40

$counts:
[1] 12 26 13  8  1
To do a barplot:
> barplot(mile)

With two or more variables, it is possible to do graphs:
> plot(weig,disp)
> plot(weig,mile)
> plot(disp,mile)

Comments to: Miguel A. Arcones