Course
ChiSquare Test in R: A Complete Guide
Understanding how to analyze and interpret data is an invaluable skill for data professionals. There are many different statistical tests that are used for different reasons. The chisquare test is a common test that is used in a specific context: when you need to determine associations between categorical variables. This is a common thing researchers need to know, which is why the chi square test is one of the most widely used statistical tests.
This tutorial introduces the chisquare test, its different types, and the steps to perform it using the R programming language. By the end of this guide, you’ll be equipped with the knowledge and skills to confidently apply the chisquare test to your own data and interpret the results.
If you’re new to the R programming language, you may want to check out the beginnerfriendly Data Analyst with R career track to familiarize yourself with the language through handson data analysis examples.
Main Steps: How to Perform a ChiSquare Test in R
To perform a chisquare test in R, follow these steps:

Step 1: Prepare your data in a contingency table format.

Step 2: Use the
chisq.test()
function to apply the chisquare test.
Here is a quick example demonstrating it using sample data:
# Step 1: Creating a contingency table
data < matrix(c(10, 20, 30, 40), nrow = 2)
# Step 2: Applying the chisquare test function
result < chisq.test(data)
# Viewing the result
print(result)
This code snippet creates a 2x2 contingency table and performs the chisquare test. The result will show the test statistic, degrees of freedom, and pvalue.
What is a ChiSquare Test?
A chisquare test is a statistical test used to determine if there is a significant association between categorical variables. It compares the observed frequencies of occurrences in different categories with the frequencies expected if there were no associations between the variables.
Types of chisquare tests
There are two main types of chisquare tests:
 ChiSquare Test of Independence: It helps determine whether the variables are independent or if there’s a relationship between them. For example, you might want to know if gender affects voting preference.
 ChiSquare Test of Goodness of Fit: This test checks if the sample data fits a population distribution. For example, you might want to see if a die is fair by comparing the observed frequency of each face with the expected frequency if the die were fair.
Assumptions of the chisquare test
To ensure the validity of the chisquare test, certain assumptions must be met:
 The data must be in the form of frequencies or counts of cases.
 The categories should be mutually exclusive.
 For the chisquare test of independence, the expected frequency in each category should be at least 5.
 The goodness of fit test expects a frequency of at least 1, with no more than 20% of expected frequencies being less than 5.
Practical applications of chisquare tests
Chisquare tests are widely used in academia and industry, especially for testing hypotheses about the independence of categorical variables. Some of these practical applications are:
 Market Research: Some applications include analyzing if customer preferences for a product vary across different age groups and income levels or determining if different marketing campaigns are equally effective across demographic segments.
 Medical Research: Common use cases are studying the association between lifestyle factors (e.g., smoking, exercise) and the incidence of diseases (e.g., lung cancer, heart disease) or evaluating whether different treatment groups show different recovery rates in clinical trials.
 Quality Control: Commonly used to examine whether product defects are independent of the manufacturing process or specific production lines and to compare the quality of products from different suppliers to determine if there is a significant difference in defect rates.
 Education: The tests are often used to determine if there is a significant difference in pass rates between students from different schools or teaching methods and to evaluate if introducing a new curriculum improves student performance across subjects.
It’s worth noting that these are some of the many applications across academia and the industry and can be extended to other domains and fields.
Performing a ChiSquare Test in R: An Example
The best way to learn to perform chisquare tests is through an example where we apply the test to a dataset. We’ll use the Anemia Levels in Nigeria dataset, which can be downloaded from Kaggle. The dataset comes from the 2018 Nigeria Demographic and Health Surveys (NDHS). It explores the impact of mothers’ age and socioeconomic factors on anemia levels among children aged 0–59 months across Nigeria’s 36 states and the Federal Capital Territory.
Let’s load the dataset in R and examine a sample to understand the data better. To read CSV files in R, you’ll need to install a package called readr
.
# Load the necessary libraries
install.packages('readr')
library(readr)
# Load the dataset from the CSV file
dataset < read_csv("children anemia.csv")
# Display the first few rows of the dataset
head(dataset)
# Rename a column
colnames(dataset)[colnames(dataset) == "Anemia level...8"] < "Anemia level"
# Display the column names
colnames(dataset)
In addition to a sample of the dataset, we’ll see the columns of the dataset below:
Columns in the dataset. Image by Author.
Among them, we’ll pick these two columns to evaluate if there is a relationship between them.

Highest educational level: This column categorizes the mother’s education into “No education,” “Primary,” “Secondary,” and “Higher” levels.

Anemia level: This column indicates the anemia level of the child, such as “Moderate,” “Severe,” or “No Anemia.”
Step 1: Creating a contingency table
A contingency table, also known as a crosstabulation or crosstab, shows how the values of two or more categorical variables are distributed across their respective categories.
We’ll select the two selected columns from the dataset and convert them to the required contingency table format. We’ll use a commonly used package called dplyr
for these operations.
# Install and load the package
install.packages('dplyr')
library(dplyr)
# Select the columns of interest
selected_data < dataset %>% select(Highest educational level, Anemia level)
# Create a contingency table for Highest educational level and Anemia level
contingency_table < table(selected_data$Highest educational level, selected_data$Anemia level)
# View the contingency table
print(contingency_table)
The resulting contingency table looks like this:
Contingency table. Image by Author.
Step 2: Applying the chisquare test function
Since we have the dataset in the contingency table format we want, we can simply apply the chisq.test()
function. No libraries need to be loaded to call this function, as it’s available in the base R package.
# Perform chisquare test
chi_square_test < chisq.test(contingency_table)
# View the results
print(chi_square_test)
The output will look like:
Pearson’s chisquare test results. Image by Author.
That’s it! We have performed the chisquare test in two simple steps. Next, how do we interpret the results?
Formulating Hypothesis & Interpreting the Results
Hypotheses clearly state what we are testing and establish a framework for interpreting the results. In simpler terms, the hypothesis we formulate gives us a clear question to answer, and the chisquare test helps us determine whether the observed data supports or refutes the claim.
Hypotheses for the chisquare test
When performing a chisquare test, we typically establish two hypotheses:
 Null Hypothesis (H0): The null hypothesis states that there is no association between the two categorical variables being tested. It assumes that any observed differences in the data are due to random chance rather than a true relationship.
 Alternative Hypothesis (H1): The alternative hypothesis states that there is a significant association between the two variables. It suggests that the observed differences are not due to random chance and that there is a relationship between the variables.
Applying the concepts of null and alternative hypothesis to the variables we have performed the chisquare test on, we can formulate the hypothesis as:
 Null Hypothesis (H0): The null hypothesis is that there is no association between the mother’s highest educational level and the child’s anemia level. This means we assume that the likelihood of a child having anemia is independent of the mother’s education level.
 Alternative Hypothesis (H1): The alternative hypothesis is that there is an association between the mother’s highest educational level and the child’s anemia level. This implies that the mother’s education level affects the likelihood of the child having anemia.
Interpreting the output of the chisquare test
Now that we’ve formed a hypothesis, we can interpret the results in the context of the hypothesis:

ChiSquare Statistic (Xsquared): The chisquare test statistic is
142.86
. This value measures the discrepancy between the observed frequencies in the contingency table and the frequencies we would expect if there were no associations between the variables. 
Degrees of Freedom (df): The degree of freedom for this test is
9
. This is calculated as (number of row  1) * (number of columns  1). 
PValue: The pvalue is less than
2.2e16
, which is extremely small. This pvalue indicates the probability of observing a chisquare statistic as extreme as, or more extreme than,142.86
if the null hypothesis were true.
We reject the null hypothesis since the pvalue is much smaller than common significance levels (e.g., 0.05, 0.01, or even 0.001). This provides strong evidence of a significant association between the mother’s education level and the child’s anemia status. In other words, the chisquare test results indicate that the likelihood of a child having anemia is significantly associated with the mother’s level of education.
Additional Analysis: Accessing Values from chisq.test()
Beyond hypothesis testing, we can retrieve certain values from the object returned by the chisq.test()
function:
Observed counts
These represent the actual counts of children with different anemia levels across each mother’s education level. The observed counts can be retrieved from the following code:
# Observed counts
observed_counts < chi_square_test$observed
print(observed_counts)
The output is as follows:
Observed counts. Image by Author.
Expected counts
These counts are calculated under the assumption that there is no association between the mother’s education level and the child’s anemia status. The expected counts can be retrieved from the following code:
# Expected counts
expected_counts < chi_square_test$expected
print(round(expected_counts, 2))
The output is as follows:
Expected counts. Image by Author.
Pearson residuals
These residuals help identify the largest discrepancies between observed and expected counts, indicating which cells contribute most to the chisquare statistic. The Pearson residuals can be retrieved from the following code:
# Pearson residuals
pearson_residuals < chi_square_test$residuals
print(round(pearson_residuals, 2))
The output is as follows:
Residuals output. Image by Author.
Let us try to understand what these residual numbers mean:

Positive Residuals: Positive residuals indicate that the observed count is higher than expected. For example, a residual of
5.96
for "Not anemic" in the "Higher" education group means that there are significantly more children who are not anemic than expected among mothers with higher education. 
Negative Residuals: Negative residuals indicate that the observed count is lower than expected. For instance, a residual of
5.74
for "Moderate" anemia in the "Higher" education group suggests that there are significantly fewer moderately anemic children than expected among mothers with higher education. 
Large Residuals: Large positive or negative residuals suggest a significant deviation from what was expected. These cells contribute most to the chisquare statistic. For example, the large positive residual for “Not anemic” in the “Higher” education group and the large negative residual for “Moderate” anemia in the same group indicate strong deviations in the anemia levels of children based on the mother’s education level.

Small Residuals: Small residuals (close to 0) suggest that the observed counts are close to the expected counts, indicating a weaker deviation. For example, the residuals for “Primary” education across the anemia levels are relatively smaller, indicating that the observed and expected counts are closer for this group.
Contribution diagram
Based on the values extracted above, the contribution of each cell to the chisquare statistic can be calculated by the code below and converted into a percentage:
# Calculate contribution to chisquare statistic
contributions < (observed_counts  expected_counts)^2 / expected_counts
# Calculate percentage contributions
total_chi_square < chi_square_test$statistic
percentage_contributions < 100 * contributions / total_chi_square
# Print percentage contributions
print("Percentage Contributions:")
print(round(percentage_contributions, 2))
The output we’ll see is as follows:
Percentage contributions. Image by Author.
The calculated contribution can be visualized as a heatmap. We will use a package called pheatmap
to do so, after installing and loading the package.
# Install and load heatmap package
install.packages("pheatmap")
library(heatmap)
# Create heatmap for percentage contributions
pheatmap(percentage_contributions,
display_numbers = TRUE,
cluster_rows = FALSE,
cluster_cols = FALSE,
main = "Percentage Contribution to ChiSquare Statistic")
The resulting output is as follows:
Percentage contribution to chisquare statistic heatmap. Image by Author.
A heatmap like the one above with contributions can be useful if you choose to perform further analysis to understand what type of associations exist after we find out associations exist based on the chisquare test results.
Conclusion
This tutorial introduced you to the chisquare test, its different types, and the underlying assumptions. We further learned how to perform the test and interpret the results in R with added visualization using an example.
Chisquare tests are commonly used during hypothesis testing and generally in statistics. Consider taking up one of these courses to solidify your understanding of data analytics and statistics using R:
 Introduction to Statistics in R course
 Exploratory Data Analysis in R course
 Hypothesis Testing in R course
Master AI for Business
Learn how to extract business value from AI and LLMs.
As a senior data scientist, I design, develop, and deploy largescale machinelearning solutions to help businesses make better datadriven decisions. As a data science writer, I share learnings, career advice, and indepth handson tutorials.
Frequently Asked Questions
What is the purpose of a chisquare test?
The chisquare test is used to determine if there is a significant association between two categorical variables.
Can the chisquare test be used with small sample sizes?
It's generally not recommended because the test requires an expected frequency of at least 5 in each cell to produce reliable results.
What do Pearson residuals indicate in a chisquare test?
Pearson residuals show how much each cell in the contingency table contributes to the overall chisquare statistic. Positive values indicate higher observed counts than expected, and negative values indicate lower.
How do I create a contingency table in R for the chisquare test?
Use the table()
or xtabs()
functions to create a contingency table from your categorical variables.
What if my data doesn’t meet the assumptions for a chisquare test?
Consider using Fisher's Exact Test, which is more appropriate for small sample sizes or when expected frequencies are low.
Learn with DataCamp
Course
Introduction to Statistics
Course
Introduction to Statistics in R
tutorial
Contingency Analysis using R
tutorial
Chisquare Test in Spreadsheets
Avinash Navlani
10 min
tutorial
R Contingency Tables Tutorial
tutorial
Ttests in R Tutorial: Learn How to Conduct TTests
tutorial
R Formula Tutorial
tutorial
Survival Analysis in R For Beginners
Daniel Schütte
15 min