Blog

Contingency Analysis using R

In this tutorial, you'll learn with the help of an example how "Contingency Analysis" or "Chi-square test of independence" works and also how efficiently we can perform it using R.

Updated Oct 2018 · 6 min read

Contingency analysis is a hypothesis test that is used to check whether two categorical variables are independent or not. In simple words, we are asking the question "Can we predict the value of one variable if we know the value of the other variable?". If the answer is yes, we can say that the variables under consideration are not independent. If the answer is no, then we can say that the variables under consideration are independent. The test makes use of contingency tables as a result of which it is known as 'Contingency Analysis'. It is also known as 'Chi-square test of independence' because the test statistic follows a chi-square distribution and the test is used to check whether two categorical variables are independent or not.

The null hypothesis of the test is that the two variables are independent and the alternative hypothesis is that the two variables are not independent.

Let us try to understand 'Contingency Analysis' or 'Chi-square test of independence' with the help of an example.

Suppose we want to know whether the choice of sport is independent of gender or not. So, we asked one hundred men and one hundred women which sport they prefer to play among archery, boxing, and cycling and summarizes the data obtained in the following two-way table.

The above table is known as the observed table as it contains the observed counts.

The Chi-square test of independence works by comparing the observed counts to the expected counts. Therefore, our next task is to derive the expected table containing the expected counts from the observed table. The expected table is what we expect the two-way table to look like if the two categorical variables are independent. From probability theory, we know that two events are said to be independent if their joint probability is equal to the product of their marginal probabilities. We will use this concept to calculate the expected counts for each of the six cells. Let us compute the expected count for the first cell. First, we will calculate the joint probability by multiplying the probability of being female (100/200) with the probability of preferring archery (45/200). Once we have the joint probability (100/200 * 45/200), if we multiply it by the sample size (200) we will get the expected count for the first cell which is 22.5. Similarly, we will compute the expected counts for the remaining five cells. The following table is the table that we want to see if gender and sports preference are independent.

Now that we have the expected as well as the observed counts, our next task is to check how different the observed counts are from the expected counts. For that purpose, we have to compute a test statistic known as the chi-square test static as it follows the chi-square distribution. Following is the formula for computing the value of the chi-square test statistic.

We can see from the above formula that the value of the chi-square test statistic can be 0 (when there is absolutely no difference between the observed and the expected counts) but can never be negative. This makes the Chi-square test of independence a one-tailed test.

Using the above formula let us calculate the value of the chi-square test statistic for our example. It is known as the observed value of the test statistic.

Now it is time to decide whether to reject the null hypothesis or not. We make the decision either by comparing the observed value of the test statistic to its critical value or by looking at the p-value. If the observed value of the test statistic exceeds its critical value or if the p-value is less than or equal to the significance level then we can reject the null hypothesis and conclude that there is a statistically significant relationship between the two categorical variables that is they are not independent. If we know the significance level (usually 0.05) and the degrees of freedom we can get the critical value from the chi-square table. The significance level is the probability of rejecting a true null hypothesis. For a table with r rows and c columns, degrees of freedom can be calculated by the following formula.

Therefore, for our example, we have 2 degrees of freedom.

From the below table we can see that the critical value of the test statistic is 5.99 for a significance level of 0.05 and 2 degrees of freedom.

Since the observed value of the test statistic is greater than its critical value (19.798 > 5.99), we can reject the null hypothesis and conclude that choice of sport is not independent of gender.

Let us now see how to perform the Chi-square test of independence using R programming language

It is very easy to perform the Chi-square test of independence using the built-in function chisq.test().

Following is the observed table.

observed_table <- matrix(c(35, 15, 50, 10, 30, 60), nrow = 2, ncol = 3, byrow = T)
rownames(observed_table) <- c('Female', 'Male')
colnames(observed_table) <- c('Archery', 'Boxing', 'Cycling')
observed_table

##        Archery Boxing Cycling
## Female      35     15      50
## Male        10     30      60

In order to perform the test, we need to apply the chisq.test() function to the observed table.

X <- chisq.test(observed_table)
X

##
##  Pearson's Chi-squared test
##
## data:  observed_table
## X-squared = 19.798, df = 2, p-value = 5.023e-05

From the above result, we can see that p-value is less than the significance level (0.05). Therefore, we can reject the null hypothesis and conclude that the two variables (gender & sport preference) are not independent.

If we want to see the expected table, we can also do that.

X$expected

##        Archery Boxing Cycling
## Female    22.5   22.5      55
## Male      22.5   22.5      55

I hope you enjoyed this article. If you would like to learn more in R, take DataCamp's Statistical Modeling in R (Part 1) course.

Topics

R Programming

Data Science

Data Analysis

R Courses

Certification available

Course

Introduction to R

4 hr

2.7M

Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.

See Details

Start Course

Certification available

Course

Intermediate R

6 hr

590K

Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

See Details

Start Course

Certification available

Course

Introduction to Statistics in R

4 hr

81K

Grow your statistical skills and learn how to collect, analyze, and draw accurate conclusions from data.

See Details

Start Course

Data Sets and Where to Find Them: Navigating the Landscape of Information

Are you struggling to find interesting data sets to analyze? Do you have a plan for what to do with a sample data set once you’ve found it? If you have data set questions, this tutorial is for you! We’ll go over the basics of what a data set is, where to find one, how to clean and explore it, and where to showcase your data story.

Amberle McKee

11 min

You’re invited! Join us for Radar: The Analytics Edition

Join us for a full day of events sharing best practices from thought leaders in the analytics space

DataCamp Team

4 min

10 Top Data Analytics Conferences for 2024

Discover the most popular analytics conferences and events scheduled for 2024.

Javier Canales Luna

7 min

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.

Mark Graus

10 min

A Complete Guide to Alteryx Certifications

Advance your career with our Alteryx certification guide. Learn key strategies, tips, and resources to excel in data science.

Matt Crabtree

9 min

Mastering Bayesian Optimization in Data Science

Unlock the power of Bayesian Optimization for hyperparameter tuning in Machine Learning. Master theoretical foundations and practical applications with Python to enhance model accuracy.

Zoumana Keita

11 min

See More See More

Let us now see how to perform the Chi-square test of independence using R programming language

Data Sets and Where to Find Them: Navigating the Landscape of Information

You’re invited! Join us for Radar: The Analytics Edition

10 Top Data Analytics Conferences for 2024

A Data Science Roadmap for 2024

A Complete Guide to Alteryx Certifications

Mastering Bayesian Optimization in Data Science

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to R

Intermediate R

Introduction to Statistics in R

Data Sets and Where to Find Them: Navigating the Landscape of Information

You’re invited! Join us for Radar: The Analytics Edition

10 Top Data Analytics Conferences for 2024

A Data Science Roadmap for 2024

A Complete Guide to Alteryx Certifications

Mastering Bayesian Optimization in Data Science

Introduction to R