course
R Contingency Tables Tutorial
How to Make a Table
First, let's get some data. MASS
package contains data about 93 cars on sale in the USA in 1993. They're stored in Cars93
object and include 27 features for each car, some of which are categorical. So let's load the MASS
package and look at the type of vehicles included in cars93
:
library(MASS)
Cars93$Type
## [1] Small Midsize Compact Midsize Midsize Midsize Large Large
## [9] Midsize Large Midsize Compact Compact Sporty Midsize Van
## [17] Van Large Sporty Large Compact Large Small Small
## [25] Compact Van Midsize Sporty Small Large Small Small
## [33] Compact Sporty Sporty Van Midsize Large Small Sporty
## [41] Sporty Small Compact Small Small Sporty Midsize Midsize
## [49] Midsize Midsize Midsize Large Small Small Compact Van
## [57] Sporty Compact Midsize Sporty Midsize Small Midsize Small
## [65] Compact Van Midsize Compact Midsize Van Large Sporty
## [73] Small Compact Sporty Midsize Large Compact Small Small
## [81] Small Compact Small Small Sporty Midsize Van Small
## [89] Van Compact Sporty Compact Midsize
## Levels: Compact Large Midsize Small Sporty Van
We have 6 types of cars there. table
function tells how many of each type we have:
table(Cars93$Type)
##
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
prop.table
converts it into fractions:
prop.table(table(Cars93$Type))
##
## Compact Large Midsize Small Sporty Van
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419
The same with the origin of cars:
table(Cars93$Origin)
##
## USA non-USA
## 48 45
prop.table(table(Cars93$Origin))
##
## USA non-USA
## 0.516129 0.483871
How to Make a Contingency Table
Great, we saw that our dataset contains a similar number of US and non-US cars and that the most prevalent types are Midsize and Small. However, maybe the US and non-US differ in type?
Let's look at types of cars with respect to their origin. We can use table
again, but with two arguments now. First will become row variable and second will become column variable:
table(Cars93$Type, Cars93$Origin)
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
Now, we saw what everybody knows: Americans love large vehicles! The table above shows the joint distribution of two categorical variables (Type
and Origin
). Such tables are called contingency tables.
How to Get Marginals Form Contingency Table
rowSums
and colSums
functions are really self-explanatory
(tab1<-table(Cars93$Type, Cars93$Origin))
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
rowSums(tab1)
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
colSums(tab1)
## USA non-USA
## 48 45
How to Get Percents Form Contingency Table
prop.table
nested with table
gives frequencies
prop.table(table(Cars93$Type, Cars93$Origin))
##
## USA non-USA
## Compact 0.07526882 0.09677419
## Large 0.11827957 0.00000000
## Midsize 0.10752688 0.12903226
## Small 0.07526882 0.15053763
## Sporty 0.08602151 0.06451613
## Van 0.05376344 0.04301075
Conversion to percents is just multiplication by 100:
prop.table(table(Cars93$Type, Cars93$Origin))*100
##
## USA non-USA
## Compact 7.526882 9.677419
## Large 11.827957 0.000000
## Midsize 10.752688 12.903226
## Small 7.526882 15.053763
## Sporty 8.602151 6.451613
## Van 5.376344 4.301075
Notice, that this is a joint probability distribution, from which we can see that e.g., about 7.5% of cars are small and of American origin.
More often, we are interested in the distribution of one variable within groups created by another. Here, distribution of car types among the US and (separately) non-US cars seems interesting. To get this, we use the margin
argument to prop.table
function. It tells where in rows (margin=1
) or in columns (margin=2
) grouping variable is:
prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100
##
## USA non-USA
## Compact 14.583333 20.000000
## Large 22.916667 0.000000
## Midsize 20.833333 26.666667
## Small 14.583333 31.111111
## Sporty 16.666667 13.333333
## Van 10.416667 8.888889
Now we can easily see that small cars are twice as frequent in non-USA than in USA part of our dataset.
Also notice that percents add up to 100 in columns, while in joint distribution table (the one without margin
argument), 100 was a sum of a whole table.
(tab2<-prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100)
##
## USA non-USA
## Compact 14.583333 20.000000
## Large 22.916667 0.000000
## Midsize 20.833333 26.666667
## Small 14.583333 31.111111
## Sporty 16.666667 13.333333
## Van 10.416667 8.888889
colSums(tab2)
## USA non-USA
## 100 100
Chi-squared Test
The most common question that arises form contingency tables is if the row and column variables are independent. The most basic way to answer it is to run a chi-squared test. It is covered in great detail in this tutorial. Let's check if Type
and Origin
are independent:
chisq.test(Cars93$Type, Cars93$Origin)
## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511
Apparently, they're not, but we also got the Chi-squared approximation may be incorrect
warning. This is because chi-squared statistic follows chi-squared distribution only approximately. The more observations we have, the better approximation is. chisq.test
function throws the above warning whenever one of the expected counts is lower than 5 (for what an "expected count" is, see tutorial linked above).
Fisher's Exact Test
Fisher's exact test is an alternative to chi-squared test used mainly when a chi-squared approximation is not satisfactory. Let's run it:
fisher.test(Cars93$Type, Cars93$Origin)
##
## Fisher's Exact Test for Count Data
##
## data: Cars93$Type and Cars93$Origin
## p-value = 0.007248
## alternative hypothesis: two.sided
Results are quite similar to those from chi-squared, but this is not always the case. One major drawback of Fisher's exact test is that for large tables (or large samples) it becomes computationally ineffective.
G-test
Another alternative is the so-called G-test. Its statistic is also approximately chi-squared distributed, but for small samples, this approximation is closer than one that chi-squared test uses. For G-test we can use GTest
function from DescTools
package. Results are again quite similar to two previous tests: Type
and Origin
are not independent.
library(DescTools)
GTest(Cars93$Type, Cars93$Origin)
##
## Log likelihood ratio (G-test) test of independence without
## correction
##
## data: Cars93$Type and Cars93$Origin
## G = 18.362, X-squared df = 5, p-value = 0.002526
Yates' Correction
In 2x2 contingency tables, the chi-squared test may be enhanced with Yates' continuity correction. It simply subtracts 0.5 from each | Observed - Expected | term in a chi-squared statistic. Again refer to this tutorial if you feel lost now. What is more, R
applies Yates' correction automatically whenever it's necessary. Let's look at the availability of manual transmission versions of US and non-US cars:
(tab3<-table(Cars93$Man.trans.avail, Cars93$Origin))
##
## USA non-USA
## No 26 6
## Yes 22 39
chisq.test(tab3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab3
## X-squared = 15.397, df = 1, p-value = 8.712e-05
Yates' correction was applied automatically, and R
told us about it.
3 (or more) Dimensional Table
Let's look at Man.trans.avail
, Origin
, and Type
at once. table
splits dataset according to variable provided as third:
table(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)
## , , = Compact
##
##
## USA non-USA
## No 2 0
## Yes 5 9
##
## , , = Large
##
##
## USA non-USA
## No 11 0
## Yes 0 0
##
## , , = Midsize
##
##
## USA non-USA
## No 9 4
## Yes 1 8
##
## , , = Small
##
##
## USA non-USA
## No 0 0
## Yes 7 14
##
## , , = Sporty
##
##
## USA non-USA
## No 0 0
## Yes 8 6
##
## , , = Van
##
##
## USA non-USA
## No 4 2
## Yes 1 2
ftable
gives more compact view:
ftable(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)
## Compact Large Midsize Small Sporty Van
##
## No USA 2 11 9 0 0 4
## non-USA 0 0 4 0 0 2
## Yes USA 5 0 1 7 8 1
## non-USA 9 0 8 14 6 2
Cochran-Mantel-Haenszel Test
From table
results above, it should be clear that relationship between Origin
and Man.trans.avail
differs with Type
. For small and sporty cars, for example, there is no association: every US, as well as every non-US small or sporty car, comes in a manual version. On the other hand, most midsize US cars don't have a manual version, while most non-US cars of this type do. To account for possibly different relationships in strata, Cochran-Mantel-Haenszel test can be used:
mantelhaen.test(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)
##
## Mantel-Haenszel chi-squared test with continuity correction
##
## data: Cars93$Man.trans.avail and Cars93$Origin and Cars93$Type
## Mantel-Haenszel X-squared = 8.0153, df = 1, p-value = 0.004638
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.226531 76.891307
## sample estimates:
## common odds ratio
## 13.08438
The third argument passed to mantelhaen.test
identifies strata. Compare the above results to those without stratification (Yates' correction example above). Association is still present, but evidence for it is much weaker.
Association measures
Once we discovered some associations between variables, it's time to measure their strength. There are a plethora of various types of possible measures. Many of them are described here. Now I'll focus on two most widely used.
Cramer's V
V is based on the chi-squared statistic:
$$ V = \sqrt{\frac{\chi^2/N}{\min(C-1, R-1)}}, $$ where:
- N is a grand total of the contingency table (sum of all its cells),
- C is the number of columns
- R is the number of rows.
V ∈ [0; 1]. The larger V is, the stronger the relationship is between variables. V = 0 can be interpreted as independence (since V = 0 if and only if χ2 = 0). The main drawback of V is lack of precise interpretation. Is V = 0.6 strong, moderate or weak association?
CramerV
function from DescTools
can calculate it for us:
CramerV(Cars93$Type, Cars93$Origin)
## [1] 0.3890967
Once again: is this strong, moderate or a weak association?
Goodman and Kruskal lambda
Goodman and Kruskal lambda is an example of measure based on the proportional reduction in variation. These kinds of measures aim to mimic R2 - coefficient of determination from linear regression:
- they take values from [0, 1]
- they express a fraction of variation explained by the independent variable.
Let's use a column-variable as the independent. Goodman and Kruskal's lambda formula is then: $$\lambda = \frac{L - \sum_j L_j}{L},$$ where:
- Lj is a sum of non-modal frequencies in j-th column,
- L is a sum of non-modal frequencies in "total" column
This is best explained with an example. Let's look at Type
and Origin
, the later being independent:
table(Cars93$Type, Cars93$Origin)
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
Mode of US column is 11, so L1 = 7 + 10 + 7 + 8 + 5 = 37. Mode of non-US column is 14, so L2 = 9 + 0 + 12 + 6 + 4 = 31. "Total" column is
rowSums(table(Cars93$Type, Cars93$Origin))
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
Mode is 22, so L = 16 + 11 + 21 + 14 + 9 = 71 and $$\lambda = \frac{71 - (37+31)}{71}=0.042$$
We can use Lambda
function from DescTools
package instead:
Lambda(Cars93$Type, Cars93$Origin, direction='row')
## [1] 0.04225352
direction
parameter determines where (in row
or in column
) the dependent variable is.
Origin
explains only about 4% of the variation in Type
. Notice from the formula that lambda defines variation as a dichotomy between belonging and not belonging to the largest group.
It's worth mentioning that lambda is zero whenever the modal category of each column is the same. Take a look at this table:
(lambdaTab<-cbind(c(0,100), c(49,51)))
## [,1] [,2]
## [1,] 0 49
## [2,] 100 51
chisq.test(lambdaTab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lambdaTab
## X-squared = 62.279, df = 1, p-value = 2.981e-15
Modal category is the same for each column (it's the second row). So lambda must be zero despite significant and visible association:
Lambda(lambdaTab, direction='row')
## [1] 0
If you would like to learn more about statistics in R, take DataCamp's Statistical Modeling in R (Part 1) course.
R Courses
course
Intermediate R
course
Introduction to Statistics in R
blog
R Correlation Tutorial
tutorial
Chi-Square Test in R: A Complete Guide
tutorial
Contingency Analysis using R
tutorial
T-tests in R Tutorial: Learn How to Conduct T-Tests
tutorial
R Formula Tutorial
tutorial
Chi-square Test in Spreadsheets
Avinash Navlani
10 min