Tutorials
statistical modeling
+1

Contingency Tables in R

In this tutorial, you'll learn how to create contingency tables and how to test and quantify relationships visible in them.

How to make a table

First, let's get some data. MASS package contains data about 93 cars on sale in the USA in 1993. They're stored in Cars93 object and include 27 features for each car, some of which are categorical. So let's load the MASS package and look at the type of vehicles included in cars93:

library(MASS)
Cars93$Type ## [1] Small Midsize Compact Midsize Midsize Midsize Large Large ## [9] Midsize Large Midsize Compact Compact Sporty Midsize Van ## [17] Van Large Sporty Large Compact Large Small Small ## [25] Compact Van Midsize Sporty Small Large Small Small ## [33] Compact Sporty Sporty Van Midsize Large Small Sporty ## [41] Sporty Small Compact Small Small Sporty Midsize Midsize ## [49] Midsize Midsize Midsize Large Small Small Compact Van ## [57] Sporty Compact Midsize Sporty Midsize Small Midsize Small ## [65] Compact Van Midsize Compact Midsize Van Large Sporty ## [73] Small Compact Sporty Midsize Large Compact Small Small ## [81] Small Compact Small Small Sporty Midsize Van Small ## [89] Van Compact Sporty Compact Midsize ## Levels: Compact Large Midsize Small Sporty Van  We have 6 types of cars there. table function tells how many of each type we have: table(Cars93$Type)

##
## Compact   Large Midsize   Small  Sporty     Van
##      16      11      22      21      14       9


prop.table converts it into fractions:

prop.table(table(Cars93$Type)) ## ## Compact Large Midsize Small Sporty Van ## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419  The same with the origin of cars: table(Cars93$Origin)

##
##     USA non-USA
##      48      45

prop.table(table(Cars93$Origin)) ## ## USA non-USA ## 0.516129 0.483871  How to make a contingency table Great, we saw that our dataset contains a similar number of US and non-US cars and that the most prevalent types are Midsize and Small. However, maybe the US and non-US differ in type? Let's look at types of cars with respect to their origin. We can use table again, but with two arguments now. First will become row variable and second will become column variable: table(Cars93$Type, Cars93$Origin) ## ## USA non-USA ## Compact 7 9 ## Large 11 0 ## Midsize 10 12 ## Small 7 14 ## Sporty 8 6 ## Van 5 4  Now, we saw what everybody knows: Americans love large vehicles! The table above shows the joint distribution of two categorical variables (Type and Origin). Such tables are called contingency tables. How to get marginals form contingency table rowSums and colSums functions are really self-explanatory (tab1<-table(Cars93$Type, Cars93$Origin)) ## ## USA non-USA ## Compact 7 9 ## Large 11 0 ## Midsize 10 12 ## Small 7 14 ## Sporty 8 6 ## Van 5 4 rowSums(tab1) ## Compact Large Midsize Small Sporty Van ## 16 11 22 21 14 9 colSums(tab1) ## USA non-USA ## 48 45  How to get percents form contingency table prop.table nested with table gives frequencies prop.table(table(Cars93$Type, Cars93$Origin)) ## ## USA non-USA ## Compact 0.07526882 0.09677419 ## Large 0.11827957 0.00000000 ## Midsize 0.10752688 0.12903226 ## Small 0.07526882 0.15053763 ## Sporty 0.08602151 0.06451613 ## Van 0.05376344 0.04301075  Conversion to percents is just multiplication by 100: prop.table(table(Cars93$Type, Cars93$Origin))*100 ## ## USA non-USA ## Compact 7.526882 9.677419 ## Large 11.827957 0.000000 ## Midsize 10.752688 12.903226 ## Small 7.526882 15.053763 ## Sporty 8.602151 6.451613 ## Van 5.376344 4.301075  Notice, that this is a joint probability distribution, from which we can see that e.g., about 7.5% of cars are small and of American origin. More often, we are interested in the distribution of one variable within groups created by another. Here, distribution of car types among the US and (separately) non-US cars seems interesting. To get this, we use the margin argument to prop.table function. It tells where in rows (margin=1) or in columns (margin=2) grouping variable is: prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100 ## ## USA non-USA ## Compact 14.583333 20.000000 ## Large 22.916667 0.000000 ## Midsize 20.833333 26.666667 ## Small 14.583333 31.111111 ## Sporty 16.666667 13.333333 ## Van 10.416667 8.888889  Now we can easily see that small cars are twice as frequent in non-USA than in USA part of our dataset. Also notice that percents add up to 100 in columns, while in joint distribution table (the one without margin argument), 100 was a sum of a whole table. (tab2<-prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100) ## ## USA non-USA ## Compact 14.583333 20.000000 ## Large 22.916667 0.000000 ## Midsize 20.833333 26.666667 ## Small 14.583333 31.111111 ## Sporty 16.666667 13.333333 ## Van 10.416667 8.888889 colSums(tab2) ## USA non-USA ## 100 100  Chi-squared test The most common question that arises form contingency tables is if the row and column variables are independent. The most basic way to answer it is to run a chi-squared test. It is covered in great detail in this tutorial. Let's check if Type and Origin are independent: chisq.test(Cars93$Type, Cars93$Origin) ## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared ## approximation may be incorrect ## ## Pearson's Chi-squared test ## ## data: Cars93$Type and Cars93$Origin ## X-squared = 14.08, df = 5, p-value = 0.01511  Apparently, they're not, but we also got the Chi-squared approximation may be incorrect warning. This is because chi-squared statistic follows chi-squared distribution only approximately. The more observations we have, the better approximation is. chisq.test function throws the above warning whenever one of the expected counts is lower than 5 (for what an "expected count" is, see tutorial linked above). Fisher's exact test Fisher's exact test is an alternative to chi-squared test used mainly when a chi-squared approximation is not satisfactory. Let's run it: fisher.test(Cars93$Type, Cars93$Origin) ## ## Fisher's Exact Test for Count Data ## ## data: Cars93$Type and Cars93$Origin ## p-value = 0.007248 ## alternative hypothesis: two.sided  Results are quite similar to those from chi-squared, but this is not always the case. One major drawback of Fisher's exact test is that for large tables (or large samples) it becomes computationally ineffective. G-test Another alternative is the so-called G-test. Its statistic is also approximately chi-squared distributed, but for small samples, this approximation is closer than one that chi-squared test uses. For G-test we can use GTest function from DescTools package. Results are again quite similar to two previous tests: Type and Origin are not independent. library(DescTools) GTest(Cars93$Type, Cars93$Origin) ## ## Log likelihood ratio (G-test) test of independence without ## correction ## ## data: Cars93$Type and Cars93$Origin ## G = 18.362, X-squared df = 5, p-value = 0.002526  Yates' correction In 2x2 contingency tables, the chi-squared test may be enhanced with Yates' continuity correction. It simply subtracts 0.5 from each | Observed - Expected | term in a chi-squared statistic. Again refer to this tutorial if you feel lost now. What is more, R applies Yates' correction automatically whenever it's necessary. Let's look at the availability of manual transmission versions of US and non-US cars: (tab3<-table(Cars93$Man.trans.avail, Cars93$Origin)) ## ## USA non-USA ## No 26 6 ## Yes 22 39 chisq.test(tab3) ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: tab3 ## X-squared = 15.397, df = 1, p-value = 8.712e-05  Yates' correction was applied automatically, and R told us about it. 3 (or more) dimensional table Let's look at Man.trans.avail, Origin, and Type at once. table splits dataset according to variable provided as third: table(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)

## , ,  = Compact
##
##
##       USA non-USA
##   No    2       0
##   Yes   5       9
##
## , ,  = Large
##
##
##       USA non-USA
##   No   11       0
##   Yes   0       0
##
## , ,  = Midsize
##
##
##       USA non-USA
##   No    9       4
##   Yes   1       8
##
## , ,  = Small
##
##
##       USA non-USA
##   No    0       0
##   Yes   7      14
##
## , ,  = Sporty
##
##
##       USA non-USA
##   No    0       0
##   Yes   8       6
##
## , ,  = Van
##
##
##       USA non-USA
##   No    4       2
##   Yes   1       2


ftable gives more compact view:

ftable(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type) ## Compact Large Midsize Small Sporty Van ## ## No USA 2 11 9 0 0 4 ## non-USA 0 0 4 0 0 2 ## Yes USA 5 0 1 7 8 1 ## non-USA 9 0 8 14 6 2  Cochran-Mantel-Haenszel test From table results above, it should be clear that relationship between Origin and Man.trans.avail differs with Type. For small and sporty cars, for example, there is no association: every US, as well as every non-US small or sporty car, comes in a manual version. On the other hand, most midsize US cars don't have a manual version, while most non-US cars of this type do. To account for possibly different relationships in strata, Cochran-Mantel-Haenszel test can be used: mantelhaen.test(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)

##
##  Mantel-Haenszel chi-squared test with continuity correction
##
## data:  Cars93$Man.trans.avail and Cars93$Origin and Cars93$Type ## Mantel-Haenszel X-squared = 8.0153, df = 1, p-value = 0.004638 ## alternative hypothesis: true common odds ratio is not equal to 1 ## 95 percent confidence interval: ## 2.226531 76.891307 ## sample estimates: ## common odds ratio ## 13.08438  The third argument passed to mantelhaen.test identifies strata. Compare the above results to those without stratification (Yates' correction example above). Association is still present, but evidence for it is much weaker. Association measures Once we discovered some associations between variables, it's time to measure their strength. There are a plethora of various types of possible measures. Many of them are described here. Now I'll focus on two most widely used. Cramer's V V is based on the chi-squared statistic: $$V = \sqrt{\frac{\chi^2/N}{\min(C-1, R-1)}},$$ where: • N is a grand total of the contingency table (sum of all its cells), • C is the number of columns • R is the number of rows. V ∈ [0; 1]. The larger V is, the stronger the relationship is between variables. V = 0 can be interpreted as independence (since V = 0 if and only if χ2 = 0). The main drawback of V is lack of precise interpretation. Is V = 0.6 strong, moderate or weak association? CramerV function from DescTools can calculate it for us: CramerV(Cars93$Type, Cars93$Origin) ## [1] 0.3890967  Once again: is this strong, moderate or a weak association? Goodman and Kruskal lambda Goodman and Kruskal lambda is an example of measure based on the proportional reduction in variation. These kinds of measures aim to mimic R2 - coefficient of determination from linear regression: • they take values from [0, 1] • they express a fraction of variation explained by the independent variable. Let's use a column-variable as the independent. Goodman and Kruskal's lambda formula is then: $$\lambda = \frac{L - \sum_j L_j}{L},$$ where: • Lj is a sum of non-modal frequencies in j-th column, • L is a sum of non-modal frequencies in "total" column This is best explained with an example. Let's look at Type and Origin, the later being independent: table(Cars93$Type, Cars93$Origin) ## ## USA non-USA ## Compact 7 9 ## Large 11 0 ## Midsize 10 12 ## Small 7 14 ## Sporty 8 6 ## Van 5 4  Mode of US column is 11, so L1 = 7 + 10 + 7 + 8 + 5 = 37. Mode of non-US column is 14, so L2 = 9 + 0 + 12 + 6 + 4 = 31. "Total" column is rowSums(table(Cars93$Type, Cars93$Origin)) ## Compact Large Midsize Small Sporty Van ## 16 11 22 21 14 9  Mode is 22, so L = 16 + 11 + 21 + 14 + 9 = 71 and $$\lambda = \frac{71 - (37+31)}{71}=0.042$$ We can use Lambda function from DescTools package instead: Lambda(Cars93$Type, Cars93\$Origin, direction='row')

## [1] 0.04225352


direction parameter determines where (in row or in column) the dependent variable is.

Origin explains only about 4% of the variation in Type. Notice from the formula that lambda defines variation as a dichotomy between belonging and not belonging to the largest group.

It's worth mentioning that lambda is zero whenever the modal category of each column is the same. Take a look at this table:

(lambdaTab<-cbind(c(0,100), c(49,51)))

##      [,1] [,2]
## [1,]    0   49
## [2,]  100   51

chisq.test(lambdaTab)

##
##  Pearson's Chi-squared test with Yates' continuity correction
##
## data:  lambdaTab
## X-squared = 62.279, df = 1, p-value = 2.981e-15


Modal category is the same for each column (it's the second row). So lambda must be zero despite significant and visible association:

Lambda(lambdaTab, direction='row')

## [1] 0


If you would like to learn more about statistics in R, take DataCamp's Statistical Modeling in R (Part 1) course.