# Contingency Analysis using R

In this tutorial, you'll learn with the help of an example how "Contingency Analysis" or "Chi-square test of independence" works and also how efficiently we can perform it using R.
Oct 2018  · 6 min read

Contingency analysis is a hypothesis test that is used to check whether two categorical variables are independent or not. In simple words, we are asking the question "Can we predict the value of one variable if we know the value of the other variable?". If the answer is yes, we can say that the variables under consideration are not independent. If the answer is no, then we can say that the variables under consideration are independent. The test makes use of contingency tables as a result of which it is known as 'Contingency Analysis'. It is also known as 'Chi-square test of independence' because the test statistic follows a chi-square distribution and the test is used to check whether two categorical variables are independent or not.

The null hypothesis of the test is that the two variables are independent and the alternative hypothesis is that the two variables are not independent.

Let us try to understand 'Contingency Analysis' or 'Chi-square test of independence' with the help of an example.

Suppose we want to know whether the choice of sport is independent of gender or not. So, we asked one hundred men and one hundred women which sport they prefer to play among archery, boxing, and cycling and summarizes the data obtained in the following two-way table.

The above table is known as the observed table as it contains the observed counts.

The Chi-square test of independence works by comparing the observed counts to the expected counts. Therefore, our next task is to derive the expected table containing the expected counts from the observed table. The expected table is what we expect the two-way table to look like if the two categorical variables are independent. From probability theory, we know that two events are said to be independent if their joint probability is equal to the product of their marginal probabilities. We will use this concept to calculate the expected counts for each of the six cells. Let us compute the expected count for the first cell. First, we will calculate the joint probability by multiplying the probability of being female (100/200) with the probability of preferring archery (45/200). Once we have the joint probability (100/200 * 45/200), if we multiply it by the sample size (200) we will get the expected count for the first cell which is 22.5. Similarly, we will compute the expected counts for the remaining five cells. The following table is the table that we want to see if gender and sports preference are independent.

Now that we have the expected as well as the observed counts, our next task is to check how different the observed counts are from the expected counts. For that purpose, we have to compute a test statistic known as the chi-square test static as it follows the chi-square distribution. Following is the formula for computing the value of the chi-square test statistic.

We can see from the above formula that the value of the chi-square test statistic can be 0 (when there is absolutely no difference between the observed and the expected counts) but can never be negative. This makes the Chi-square test of independence a one-tailed test.

Using the above formula let us calculate the value of the chi-square test statistic for our example. It is known as the observed value of the test statistic.

Now it is time to decide whether to reject the null hypothesis or not. We make the decision either by comparing the observed value of the test statistic to its critical value or by looking at the p-value. If the observed value of the test statistic exceeds its critical value or if the p-value is less than or equal to the significance level then we can reject the null hypothesis and conclude that there is a statistically significant relationship between the two categorical variables that is they are not independent. If we know the significance level (usually 0.05) and the degrees of freedom we can get the critical value from the chi-square table. The significance level is the probability of rejecting a true null hypothesis. For a table with r rows and c columns, degrees of freedom can be calculated by the following formula.

Therefore, for our example, we have 2 degrees of freedom.

From the below table we can see that the critical value of the test statistic is 5.99 for a significance level of 0.05 and 2 degrees of freedom.

Since the observed value of the test statistic is greater than its critical value (19.798 > 5.99), we can reject the null hypothesis and conclude that choice of sport is not independent of gender.

## Let us now see how to perform the Chi-square test of independence using R programming language

It is very easy to perform the Chi-square test of independence using the built-in function `chisq.test()`.

Following is the observed table.

``````observed_table <- matrix(c(35, 15, 50, 10, 30, 60), nrow = 2, ncol = 3, byrow = T)
rownames(observed_table) <- c('Female', 'Male')
colnames(observed_table) <- c('Archery', 'Boxing', 'Cycling')
observed_table
``````
``````##        Archery Boxing Cycling
## Female      35     15      50
## Male        10     30      60``````

In order to perform the test, we need to apply the `chisq.test()` function to the observed table.

``````X <- chisq.test(observed_table)
X
``````
``````##
##  Pearson's Chi-squared test
##
## data:  observed_table
## X-squared = 19.798, df = 2, p-value = 5.023e-05``````

From the above result, we can see that p-value is less than the significance level (0.05). Therefore, we can reject the null hypothesis and conclude that the two variables (gender & sport preference) are not independent.

If we want to see the expected table, we can also do that.

``````X\$expected
``````
``````##        Archery Boxing Cycling
## Female    22.5   22.5      55
## Male      22.5   22.5      55``````

I hope you enjoyed this article. If you would like to learn more in R, take DataCamp's Statistical Modeling in R (Part 1) course.

### Introduction to R

Beginner
4 hours
2,395,635
Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.
See Details

### Intermediate R

Beginner
6 hours
532,094
Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

### Introduction to Statistical Modeling in R

Beginner
4 hours
28,526
This course is designed to get you up to speed with the most important and powerful methodologies in statistics.
See all courses
Related

### How to Become a Data Scientist in 8 Steps

Find out everything you need to know about becoming a data scientist, and find out whether it’s the right career for you!

12 min

### How to Become a Data Analyst in 2023: 5 Steps to Start Your Career

Learn how to become a data analyst and discover everything you need to know about launching your career, including the skills you need and how to learn them.

Elena Kosourova

18 min

### Predicting FIFA World Cup Qatar 2022 Winners

Learn to use Elo ratings to quantify national soccer team performance, and see how the model can be used to predict the winner of FIFA World Cup Qatar 2022.

Arne Warnke

### How Data Science is Changing Soccer

With the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.

Richie Cotton

### Sports Analytics: How Different Sports Use Data Analytics

Discover how sports analytics works and how different sports use data to provide meaningful insights. Plus, discover what it takes to become a sports data analyst.

Kurtis Pykes

13 min

### How to Write a Bash Script: A Simple Bash Scripting Tutorial

Discover the basics of bash scripting and learn how to write a bash script.

Kurtis Pykes

5 min

See MoreSee More