Functions of categorical variables


[cut | table | tapply| sapply| split ]


A categorical variable is one that takes on only a small number of values, each representing a different level of some measurement.

CUT

The function cut creates a categorical variable from a continuous variable by assigining the number i to the based on the values in the interval i. You can either specify the number of equal width intervals or the cut points. To divide x into m intervals of equal width use cut(x,m). To use the intervals determined by a1, a2, ..,am, do cut(x, c(a1, a2, ..,am)). a1 must smaller than the minimum of x. am must smaller than the maximum of x. You can divide in intervals so that intervals the same the number points using the quantile function: cut(x,c(0,quantile(x,(1:m)/m))).

The default is to use right intervals. To change it and to use left intervals do cut(x, c(....), left.include=T, inc=T).

We can name the labels as we want: cut(x, breaks=c(...),labels=c(".".,....,".")). You can also rename the obtained levels: levels(x3)_c(".",....,".").

The function pretty creates convenient breaks points for a categorical variable.

For example:

> x <-c(6,3,1,5,2,4,3,4,5,2)
> x1_cut(x,3) 
> x1
 [1] 3 2 1 3 1 2 2 2 3 1
attr(, "levels"):
[1] "0.95+ thru 2.65" "2.65+ thru 4.35" "4.35+ thru 6.05"
attr(, "class"):
[1] "category"  
> x2_cut(x, c(0,4,7))
> x2
 [1] 2 1 1 2 1 1 1 1 2 1
attr(, "levels"):
[1] "0+ thru 4" "4+ thru 7"
attr(, "class"):
[1] "category"  
> x3_cut(x, c(0,quantile(x,(1:3)/3)))
> levels(x3)_c("group 1","group 2","group 3")
> x3
 [1] 3 1 1 3 1 2 1 2 3 1
attr(, "levels"):
[1] "group 1" "group 2" "group 3"
attr(, "class"):
[1] "category" 
> x4_cut(x, c(1,5,10), left.include=T, inc=T)
> x4
 [1] 2 1 1 2 1 1 1 1 2 1
attr(, "levels"):
[1] "1 thru  5-" "5 thru 10-"
attr(, "class"):
[1] "category"  
> x5_cut(age, breaks=c(0,2,4,7),labels=c("small","medimum","large"))
> x5
[1] 1 1 1 1 1 2
attr(, "levels"):
[1] "small"   "medimum" "large"
attr(, "class"):
[1] "category"
> pretty(x)
[1] 1 2 3 4 5 6
> cut(x,pretty(x))
 [1]  5  2 NA  4  1  3  2  3  4  1
attr(, "levels"):
[1] "1+ thru 2" "2+ thru 3" "3+ thru 4" "4+ thru 5" "5+ thru 6"

TABLE

The table() function creates a contingency table. The table() function counts the number of observations cross-classified by categories.We can use one or more variables.

> sex_c("male","female","male","female","male","female","female")
> age_c(23,35,40,24,60,20,35)
> table(sex)
 female male
      4    3
> table(sex,age)
       20 23 24 35 40 60
female  1  0  1  2  0  0
  male  0  1  0  0  1  1    
> sex_factor(c(1,2,1,2,1,2,2), labels=c("Female","Male"))
> table(sex,age)
       20 23 24 35 40 60
Female  0  1  0  0  1  1
  Male  1  0  1  2  0  0

TAPPLY

The function tapply() applies a function to each cell of a table. The second argument to the tapply() function gives the indices over which the mean systolic blood pressures are to be calculated. Suppose we wished to report the mean systolic blood pressure for persons in each of the age/sex groups:

>  systol_c(118, 125, 128, 127, 110, 140, 130)
> tapply(systol, list(sex, age), mean)
        20  23  24    35  40  60
Female  NA 118  NA    NA 128 110
  Male 140  NA 127 127.5  NA  NA  

SAPPLY

The function tapply() returns a vector, matrix, or list as the result of applying a function to a list.

SPLIT

The function split() breaks up an array according to the value of a categorical variable.

> split(c("Martin", "Mary", "Matt"), c("M", "F", "M"))
$F:
[1] "Mary"

$M:
[1] "Martin" "Matt"
Comments to: Miguel A. Arcones

Go to main homepage: