# Contingency Tables in R

# How to make a table

First, let's get some data. `MASS`

package contains data about 93 cars
on sale in the USA in 1993. They're stored in `Cars93`

object and
include 27 features for each car, some of which are categorical. So
let's load the `MASS`

package and look at the type of vehicles included in `cars93`

:

```
library(MASS)
Cars93$Type
## [1] Small Midsize Compact Midsize Midsize Midsize Large Large
## [9] Midsize Large Midsize Compact Compact Sporty Midsize Van
## [17] Van Large Sporty Large Compact Large Small Small
## [25] Compact Van Midsize Sporty Small Large Small Small
## [33] Compact Sporty Sporty Van Midsize Large Small Sporty
## [41] Sporty Small Compact Small Small Sporty Midsize Midsize
## [49] Midsize Midsize Midsize Large Small Small Compact Van
## [57] Sporty Compact Midsize Sporty Midsize Small Midsize Small
## [65] Compact Van Midsize Compact Midsize Van Large Sporty
## [73] Small Compact Sporty Midsize Large Compact Small Small
## [81] Small Compact Small Small Sporty Midsize Van Small
## [89] Van Compact Sporty Compact Midsize
## Levels: Compact Large Midsize Small Sporty Van
```

We have 6 types of cars there. `table`

function tells how many of each
type we have:

```
table(Cars93$Type)
##
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
```

`prop.table`

converts it into fractions:

```
prop.table(table(Cars93$Type))
##
## Compact Large Midsize Small Sporty Van
## 0.17204301 0.11827957 0.23655914 0.22580645 0.15053763 0.09677419
```

The same with the origin of cars:

```
table(Cars93$Origin)
##
## USA non-USA
## 48 45
prop.table(table(Cars93$Origin))
##
## USA non-USA
## 0.516129 0.483871
```

# How to make a contingency table

Great, we saw that our dataset contains a similar number of US and non-US cars and that the most prevalent types are Midsize and Small. However, maybe the US and non-US differ in type?

Let's look at types of cars with respect to their origin. We can use
`table`

again, but with two arguments now. First will become row
variable and second will become column variable:

```
table(Cars93$Type, Cars93$Origin)
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
```

Now, we saw what everybody knows: Americans love large vehicles! The table
above shows the joint distribution of two categorical variables (`Type`

and
`Origin`

). Such tables are called *contingency tables*.

# How to get marginals form contingency table

`rowSums`

and `colSums`

functions are really self-explanatory

```
(tab1<-table(Cars93$Type, Cars93$Origin))
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
rowSums(tab1)
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
colSums(tab1)
## USA non-USA
## 48 45
```

# How to get percents form contingency table

`prop.table`

nested with `table`

gives frequencies

```
prop.table(table(Cars93$Type, Cars93$Origin))
##
## USA non-USA
## Compact 0.07526882 0.09677419
## Large 0.11827957 0.00000000
## Midsize 0.10752688 0.12903226
## Small 0.07526882 0.15053763
## Sporty 0.08602151 0.06451613
## Van 0.05376344 0.04301075
```

Conversion to percents is just multiplication by 100:

```
prop.table(table(Cars93$Type, Cars93$Origin))*100
##
## USA non-USA
## Compact 7.526882 9.677419
## Large 11.827957 0.000000
## Midsize 10.752688 12.903226
## Small 7.526882 15.053763
## Sporty 8.602151 6.451613
## Van 5.376344 4.301075
```

Notice, that this is a joint probability distribution, from which we can see that e.g., about 7.5% of cars are small and of American origin.

More often, we are interested in the distribution of one variable within
groups created by another. Here, distribution of car types among the US and
(separately) non-US cars seems interesting. To get this, we use the `margin`

argument to `prop.table`

function. It tells where in rows (`margin=1`

)
or in columns (`margin=2`

) grouping variable is:

```
prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100
##
## USA non-USA
## Compact 14.583333 20.000000
## Large 22.916667 0.000000
## Midsize 20.833333 26.666667
## Small 14.583333 31.111111
## Sporty 16.666667 13.333333
## Van 10.416667 8.888889
```

Now we can easily see that small cars are twice as frequent in non-USA than in USA part of our dataset.

Also notice that percents add up to 100 in columns, while in joint
distribution table (the one without `margin`

argument), 100 was a sum of
a whole table.

```
(tab2<-prop.table(table(Cars93$Type, Cars93$Origin), margin=2)*100)
##
## USA non-USA
## Compact 14.583333 20.000000
## Large 22.916667 0.000000
## Midsize 20.833333 26.666667
## Small 14.583333 31.111111
## Sporty 16.666667 13.333333
## Van 10.416667 8.888889
colSums(tab2)
## USA non-USA
## 100 100
```

# Chi-squared test

The most common question that arises form contingency tables is if the row
and column variables are independent. The most basic way to answer it is
to run a chi-squared test. It is covered in great detail in this
tutorial.
Let's check if `Type`

and `Origin`

are independent:

```
chisq.test(Cars93$Type, Cars93$Origin)
## Warning in chisq.test(Cars93$Type, Cars93$Origin): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: Cars93$Type and Cars93$Origin
## X-squared = 14.08, df = 5, p-value = 0.01511
```

Apparently, they're not, but we also got the
`Chi-squared approximation may be incorrect`

warning. This is because
chi-squared statistic follows chi-squared distribution only
approximately. The more observations we have, the better approximation
is. `chisq.test`

function throws the above warning whenever one of the expected
counts is lower than 5 (for what an "expected count" is, see tutorial
linked above).

# Fisher's exact test

Fisher's exact test is an alternative to chi-squared test used mainly when a chi-squared approximation is not satisfactory. Let's run it:

```
fisher.test(Cars93$Type, Cars93$Origin)
##
## Fisher's Exact Test for Count Data
##
## data: Cars93$Type and Cars93$Origin
## p-value = 0.007248
## alternative hypothesis: two.sided
```

Results are quite similar to those from chi-squared, but this is not always the case. One major drawback of Fisher's exact test is that for large tables (or large samples) it becomes computationally ineffective.

# G-test

Another alternative is the so-called G-test. Its statistic is also
approximately chi-squared distributed, but for small samples, this
approximation is closer than one that chi-squared test uses. For G-test
we can use `GTest`

function from `DescTools`

package. Results are again
quite similar to two previous tests: `Type`

and `Origin`

are not
independent.

```
library(DescTools)
GTest(Cars93$Type, Cars93$Origin)
##
## Log likelihood ratio (G-test) test of independence without
## correction
##
## data: Cars93$Type and Cars93$Origin
## G = 18.362, X-squared df = 5, p-value = 0.002526
```

# Yates' correction

In 2x2 contingency tables, the chi-squared test may be enhanced with Yates'
continuity correction. It simply subtracts 0.5 from each | *Observed -
Expected* | term in a chi-squared statistic. Again refer to this
tutorial
if you feel lost now. What is more, `R`

applies Yates' correction
automatically whenever it's necessary. Let's look at the availability of
manual transmission versions of US and non-US cars:

```
(tab3<-table(Cars93$Man.trans.avail, Cars93$Origin))
##
## USA non-USA
## No 26 6
## Yes 22 39
chisq.test(tab3)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab3
## X-squared = 15.397, df = 1, p-value = 8.712e-05
```

Yates' correction was applied automatically, and `R`

told us about it.

# 3 (or more) dimensional table

Let's look at `Man.trans.avail`

, `Origin`

, and `Type`

at once. `table`

splits dataset according to variable provided as third:

```
table(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)
## , , = Compact
##
##
## USA non-USA
## No 2 0
## Yes 5 9
##
## , , = Large
##
##
## USA non-USA
## No 11 0
## Yes 0 0
##
## , , = Midsize
##
##
## USA non-USA
## No 9 4
## Yes 1 8
##
## , , = Small
##
##
## USA non-USA
## No 0 0
## Yes 7 14
##
## , , = Sporty
##
##
## USA non-USA
## No 0 0
## Yes 8 6
##
## , , = Van
##
##
## USA non-USA
## No 4 2
## Yes 1 2
```

`ftable`

gives more compact view:

```
ftable(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)
## Compact Large Midsize Small Sporty Van
##
## No USA 2 11 9 0 0 4
## non-USA 0 0 4 0 0 2
## Yes USA 5 0 1 7 8 1
## non-USA 9 0 8 14 6 2
```

# Cochran-Mantel-Haenszel test

From `table`

results above, it should be clear that relationship between
`Origin`

and `Man.trans.avail`

differs with `Type`

. For small and sporty cars, for example, there is no association: every US, as well as every non-US small or sporty car, comes in a manual version. On the other hand,
most midsize US cars don't have a manual version, while most non-US cars
of this type do. To account for possibly different relationships in
strata, Cochran-Mantel-Haenszel test can be used:

```
mantelhaen.test(Cars93$Man.trans.avail, Cars93$Origin, Cars93$Type)
##
## Mantel-Haenszel chi-squared test with continuity correction
##
## data: Cars93$Man.trans.avail and Cars93$Origin and Cars93$Type
## Mantel-Haenszel X-squared = 8.0153, df = 1, p-value = 0.004638
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.226531 76.891307
## sample estimates:
## common odds ratio
## 13.08438
```

The third argument passed to `mantelhaen.test`

identifies strata.
Compare the above results to those without stratification (Yates' correction
example above). Association is still present, but evidence for it is much
weaker.

# Association measures

Once we discovered some associations between variables, it's time to measure their strength. There are a plethora of various types of possible measures. Many of them are described here. Now I'll focus on two most widely used.

# Cramer's V

V is based on the chi-squared statistic:

$$ V = \sqrt{\frac{\chi^2/N}{\min(C-1, R-1)}}, $$ where:

*N*is a grand total of the contingency table (sum of all its cells),*C*is the number of columns*R*is the number of rows.

*V* ∈ [0; 1]. The larger *V* is, the stronger the relationship is between
variables. *V* = 0 can be interpreted as independence (since *V* = 0 if
and only if *χ*^{2} = 0). The main drawback of *V* is lack of
precise interpretation. Is *V* = 0.6 strong, moderate or weak
association?

`CramerV`

function from `DescTools`

can calculate it for us:

```
CramerV(Cars93$Type, Cars93$Origin)
## [1] 0.3890967
```

Once again: is this strong, moderate or a weak association?

# Goodman and Kruskal lambda

Goodman and Kruskal lambda is an example of measure based on the proportional reduction in variation. These kinds of measures aim to mimic
*R*^{2} - coefficient of determination from linear regression:

- they take values from [0, 1]
- they express a fraction of variation explained by the independent variable.

Let's use a column-variable as the independent. Goodman and Kruskal's lambda formula is then: $$\lambda = \frac{L - \sum_j L_j}{L},$$ where:

*L*_{j}is a sum of non-modal frequencies in*j*-th column,*L*is a sum of non-modal frequencies in "total" column

This is best explained with an example. Let's look at `Type`

and `Origin`

,
the later being independent:

```
table(Cars93$Type, Cars93$Origin)
##
## USA non-USA
## Compact 7 9
## Large 11 0
## Midsize 10 12
## Small 7 14
## Sporty 8 6
## Van 5 4
```

Mode of US column is 11, so *L*_{1} = 7 + 10 + 7 + 8 + 5 = 37.
Mode of non-US column is 14, so
*L*_{2} = 9 + 0 + 12 + 6 + 4 = 31. "Total" column is

```
rowSums(table(Cars93$Type, Cars93$Origin))
## Compact Large Midsize Small Sporty Van
## 16 11 22 21 14 9
```

Mode is 22, so *L* = 16 + 11 + 21 + 14 + 9 = 71 and
$$\lambda = \frac{71 - (37+31)}{71}=0.042$$

We can use `Lambda`

function from `DescTools`

package instead:

```
Lambda(Cars93$Type, Cars93$Origin, direction='row')
## [1] 0.04225352
```

`direction`

parameter determines where (in `row`

or in `column`

) the
dependent variable is.

`Origin`

explains only about 4% of the variation in `Type`

. Notice from the
formula that lambda defines variation as a dichotomy between belonging
and not belonging to the largest group.

It's worth mentioning that lambda is zero whenever the modal category of each column is the same. Take a look at this table:

```
(lambdaTab<-cbind(c(0,100), c(49,51)))
## [,1] [,2]
## [1,] 0 49
## [2,] 100 51
chisq.test(lambdaTab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: lambdaTab
## X-squared = 62.279, df = 1, p-value = 2.981e-15
```

Modal category is the same for each column (it's the second row). So lambda must be zero despite significant and visible association:

```
Lambda(lambdaTab, direction='row')
## [1] 0
```

If you would like to learn more about statistics in R, take DataCamp's Statistical Modeling in R (Part 1) course.