Tutorials
statistical modeling
+1

Contingency Tables in R

In this tutorial, you'll learn how to create contingency tables and how to test and quantify relationships visible in them.

How to make a table

First, let's get some data. MASS package contains data about 93 cars on sale in the USA in 1993. They're stored in Cars93 object and include 27 features for each car, some of which are categorical. So let's load the MASS package and look at the type of vehicles included in cars93:

library(MASS)
Cars93$Type

##  [1] Small   Midsize Compact Midsize Midsize Midsize Large   Large  
##  [9] Midsize Large   Midsize Compact Compact Sporty  Midsize Van    
## [17] Van     Large   Sporty  Large   Compact Large   Small   Small  
## [25] Compact Van     Midsize Sporty  Small   Large   Small   Small  
## [33] Compact Sporty  Sporty  Van     Midsize Large   Small   Sporty
## [41] Sporty  Small   Compact Small   Small   Sporty  Midsize Midsize
## [49] Midsize Midsize Midsize Large   Small   Small   Compact Van    
## [57] Sporty  Compact Midsize Sporty  Midsize Small   Midsize Small  
## [65] Compact Van     Midsize Compact Midsize Van     Large   Sporty
## [73] Small   Compact Sporty  Midsize Large   Compact Small   Small  
## [81] Small   Compact Small   Small   Sporty  Midsize Van     Small  
## [89] Van     Compact Sporty  Compact Midsize
## Levels: Compact Large Midsize Small Sporty Van

We have 6 types of cars there. table function tells how many of each type we have:

table(Cars93$Type)

##
## Compact   Large Midsize   Small  Sporty     Van
##      16      11      22      21      14       9

prop.table converts it into fractions:

prop.table(table(Cars93$Type))

##
##    Compact      Large    Midsize      Small     Sporty        Van
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419

The same with the origin of cars:

table(Cars93$Origin)

##
##     USA non-USA
##      48      45

prop.table(table(Cars93$Origin))

##
##      USA  non-USA
## 0.516129 0.483871

How to make a contingency table

Great, we saw that our dataset contains a similar number of US and non-US cars and that the most prevalent types are Midsize and Small. However, maybe the US and non-US differ in type?

Let's look at types of cars with respect to their origin. We can use table again, but with two arguments now. First will become row variable and second will become column variable:

table(Cars93$Type, Cars93$Origin)

##          
##           USA non-USA
##   Compact   7       9
##   Large    11       0
##   Midsize  10      12
##   Small     7      14
##   Sporty    8       6
##   Van       5       4

Now, we saw what everybody knows: Americans love large vehicles! The table above shows the joint distribution of two categorical variables (Type and Origin). Such tables are called contingency tables.

How to get marginals form contingency table

rowSums and colSums functions are really self-explanatory

(tab1<-table(Cars93$Type, Cars93$Origin))

##          
##           USA non-USA
##   Compact   7       9
##   Large    11       0
##   Midsize  10      12
##   Small     7      14
##   Sporty    8       6
##   Van       5       4

rowSums(tab1)

## Compact   Large Midsize   Small  Sporty     Van
##      16      11      22      21      14       9

colSums(tab1)

##     USA non-USA
##      48      45

How to get percents form contingency table

prop.table nested with table gives frequencies

prop.table(table(Cars93$Type, Cars93$Origin))

##          
##                  USA    non-USA
##   Compact 0.07526882 0.09677419
##   Large   0.11827957 0.00000000
##   Midsize 0.10752688 0.12903226
##   Small   0.07526882 0.15053763
##   Sporty  0.08602151 0.06451613
##   Van     0.05376344 0.04301075

Conversion to percents is just multiplication by 100:

prop.table(table(Cars93$Type, Cars93$Origin))*100

##          
##                 USA   non-USA
##   Compact  7.526882  9.677419
##   Large   11.827957  0.000000
##   Midsize 10.752688 12.903226
##   Small    7.526882 15.053763
##   Sporty   8.602151  6.451613
##   Van      5.376344  4.301075

Notice, that this is a joint probability distribution, from which we can see that e.g., about 7.5% of cars are small and of American origin.

More often, we are interested in the distribution of one variable within groups created by another. Here, distribution of car types among the US and (separately) non-US cars seems interesting. To get this, we use the margin argument to prop.table function. It tells where in rows (margin=1) or in columns (margin=2) grouping variable is:

prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100

##          
##                 USA   non-USA
##   Compact 14.583333 20.000000
##   Large   22.916667  0.000000
##   Midsize 20.833333 26.666667
##   Small   14.583333 31.111111
##   Sporty  16.666667 13.333333
##   Van     10.416667  8.888889

Now we can easily see that small cars are twice as frequent in non-USA than in USA part of our dataset.

Also notice that percents add up to 100 in columns, while in joint distribution table (the one without margin argument), 100 was a sum of a whole table.

(tab2<-prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100)

##          
##                 USA   non-USA
##   Compact 14.583333 20.000000
##   Large   22.916667  0.000000
##   Midsize 20.833333 26.666667
##   Small   14.583333 31.111111
##   Sporty  16.666667 13.333333
##   Van     10.416667  8.888889

colSums(tab2)

##     USA non-USA
##     100     100

Chi-squared test

The most common question that arises form contingency tables is if the row and column variables are independent. The most basic way to answer it is to run a chi-squared test. It is covered in great detail in this tutorial. Let's check if Type and Origin are independent:

chisq.test(Cars93$Type, Cars93$Origin)

## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared
## approximation may be incorrect

##
##  Pearson's Chi-squared test
##
## data:  Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511

Apparently, they're not, but we also got the Chi-squared approximation may be incorrect warning. This is because chi-squared statistic follows chi-squared distribution only approximately. The more observations we have, the better approximation is. chisq.test function throws the above warning whenever one of the expected counts is lower than 5 (for what an "expected count" is, see tutorial linked above).

Fisher's exact test

Fisher's exact test is an alternative to chi-squared test used mainly when a chi-squared approximation is not satisfactory. Let's run it:

fisher.test(Cars93$Type, Cars93$Origin)

##
##  Fisher's Exact Test for Count Data
##
## data:  Cars93$Type and Cars93$Origin
## p-value = 0.007248
## alternative hypothesis: two.sided

Results are quite similar to those from chi-squared, but this is not always the case. One major drawback of Fisher's exact test is that for large tables (or large samples) it becomes computationally ineffective.

G-test

Another alternative is the so-called G-test. Its statistic is also approximately chi-squared distributed, but for small samples, this approximation is closer than one that chi-squared test uses. For G-test we can use GTest function from DescTools package. Results are again quite similar to two previous tests: Type and Origin are not independent.

library(DescTools)
GTest(Cars93$Type, Cars93$Origin)

##
##  Log likelihood ratio (G-test) test of independence without
##  correction
##
## data:  Cars93$Type and Cars93$Origin
## G = 18.362, X-squared df = 5, p-value = 0.002526

Yates' correction

In 2x2 contingency tables, the chi-squared test may be enhanced with Yates' continuity correction. It simply subtracts 0.5 from each | Observed - Expected | term in a chi-squared statistic. Again refer to this tutorial if you feel lost now. What is more, R applies Yates' correction automatically whenever it's necessary. Let's look at the availability of manual transmission versions of US and non-US cars:

(tab3<-table(Cars93$Man.trans.avail, Cars93$Origin))

##      
##       USA non-USA
##   No   26       6
##   Yes  22      39

chisq.test(tab3)

##
##  Pearson's Chi-squared test with Yates' continuity correction
##
## data:  tab3
## X-squared = 15.397, df = 1, p-value = 8.712e-05

Yates' correction was applied automatically, and R told us about it.

3 (or more) dimensional table

Let's look at Man.trans.avail, Origin, and Type at once. table splits dataset according to variable provided as third:

table(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)

## , ,  = Compact
##
##      
##       USA non-USA
##   No    2       0
##   Yes   5       9
##
## , ,  = Large
##
##      
##       USA non-USA
##   No   11       0
##   Yes   0       0
##
## , ,  = Midsize
##
##      
##       USA non-USA
##   No    9       4
##   Yes   1       8
##
## , ,  = Small
##
##      
##       USA non-USA
##   No    0       0
##   Yes   7      14
##
## , ,  = Sporty
##
##      
##       USA non-USA
##   No    0       0
##   Yes   8       6
##
## , ,  = Van
##
##      
##       USA non-USA
##   No    4       2
##   Yes   1       2

ftable gives more compact view:

ftable(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)

##              Compact Large Midsize Small Sporty Van
##                                                    
## No  USA            2    11       9     0      0   4
##     non-USA        0     0       4     0      0   2
## Yes USA            5     0       1     7      8   1
##     non-USA        9     0       8    14      6   2

Cochran-Mantel-Haenszel test

From table results above, it should be clear that relationship between Origin and Man.trans.avail differs with Type. For small and sporty cars, for example, there is no association: every US, as well as every non-US small or sporty car, comes in a manual version. On the other hand, most midsize US cars don't have a manual version, while most non-US cars of this type do. To account for possibly different relationships in strata, Cochran-Mantel-Haenszel test can be used:

mantelhaen.test(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)

##
##  Mantel-Haenszel chi-squared test with continuity correction
##
## data:  Cars93$Man.trans.avail and Cars93$Origin and Cars93$Type
## Mantel-Haenszel X-squared = 8.0153, df = 1, p-value = 0.004638
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##   2.226531 76.891307
## sample estimates:
## common odds ratio
##          13.08438

The third argument passed to mantelhaen.test identifies strata. Compare the above results to those without stratification (Yates' correction example above). Association is still present, but evidence for it is much weaker.

Association measures

Once we discovered some associations between variables, it's time to measure their strength. There are a plethora of various types of possible measures. Many of them are described here. Now I'll focus on two most widely used.

Cramer's V

V is based on the chi-squared statistic:

$$ V = \sqrt{\frac{\chi^2/N}{\min(C-1, R-1)}}, $$ where:

  • N is a grand total of the contingency table (sum of all its cells),
  • C is the number of columns
  • R is the number of rows.

V ∈ [0; 1]. The larger V is, the stronger the relationship is between variables. V = 0 can be interpreted as independence (since V = 0 if and only if χ2 = 0). The main drawback of V is lack of precise interpretation. Is V = 0.6 strong, moderate or weak association?

CramerV function from DescTools can calculate it for us:

CramerV(Cars93$Type, Cars93$Origin)

## [1] 0.3890967

Once again: is this strong, moderate or a weak association?

Goodman and Kruskal lambda

Goodman and Kruskal lambda is an example of measure based on the proportional reduction in variation. These kinds of measures aim to mimic R2 - coefficient of determination from linear regression:

  • they take values from [0, 1]
  • they express a fraction of variation explained by the independent variable.

Let's use a column-variable as the independent. Goodman and Kruskal's lambda formula is then: $$\lambda = \frac{L - \sum_j L_j}{L},$$ where:

  • Lj is a sum of non-modal frequencies in j-th column,
  • L is a sum of non-modal frequencies in "total" column

This is best explained with an example. Let's look at Type and Origin, the later being independent:

table(Cars93$Type, Cars93$Origin)

##          
##           USA non-USA
##   Compact   7       9
##   Large    11       0
##   Midsize  10      12
##   Small     7      14
##   Sporty    8       6
##   Van       5       4

Mode of US column is 11, so L1 = 7 + 10 + 7 + 8 + 5 = 37. Mode of non-US column is 14, so L2 = 9 + 0 + 12 + 6 + 4 = 31. "Total" column is

rowSums(table(Cars93$Type, Cars93$Origin))

## Compact   Large Midsize   Small  Sporty     Van
##      16      11      22      21      14       9

Mode is 22, so L = 16 + 11 + 21 + 14 + 9 = 71 and $$\lambda = \frac{71 - (37+31)}{71}=0.042$$

We can use Lambda function from DescTools package instead:

Lambda(Cars93$Type, Cars93$Origin, direction='row')

## [1] 0.04225352

direction parameter determines where (in row or in column) the dependent variable is.

Origin explains only about 4% of the variation in Type. Notice from the formula that lambda defines variation as a dichotomy between belonging and not belonging to the largest group.

It's worth mentioning that lambda is zero whenever the modal category of each column is the same. Take a look at this table:

(lambdaTab<-cbind(c(0,100), c(49,51)))

##      [,1] [,2]
## [1,]    0   49
## [2,]  100   51

chisq.test(lambdaTab)

##
##  Pearson's Chi-squared test with Yates' continuity correction
##
## data:  lambdaTab
## X-squared = 62.279, df = 1, p-value = 2.981e-15

Modal category is the same for each column (it's the second row). So lambda must be zero despite significant and visible association:

Lambda(lambdaTab, direction='row')

## [1] 0

If you would like to learn more about statistics in R, take DataCamp's Statistical Modeling in R (Part 1) course.

Want to leave a comment?