Skip to main content

Chi-square Test in Spreadsheets

In this tutorial, you'll learn how to perform the chi-square test in spreadsheets.
Nov 2019  · 10 min read

If you are interested in learning more about statistics in spreadsheets, take DataCamp's Statistics in Spreadsheets course.

In 1900 Karl Pearson developed published a paper on the χ2 test, which is considered to be one of the foundations of modern statistics. In this paper, Pearson investigated the test of goodness of fit (Source). Chi-square test is a non-parametric test (a non-parametric statistical test is a test whose model does not specify conditions about the parameter of the population from which the sample is drawn.). It is used for identifying the relationship between a categorical variable and denoted by χ2.

The chi-squared test is often constructed from a sum of squared errors or through the sample variance. This is a statistical hypothesis test where the sample distribution of test statistics is a chi-squared when the null hypothesis is true. It arises from the assumption of independent, normally distributed data.

In this tutorial, you are going to learn about the chi-square test, and the following topics will be covered:

Chi-square test

The chi-square test is a statistical test that can be used to determine what observed frequencies are significantly different from expected frequencies or not in one or more categories (Source). In the mathematical expression, it is the ratio of experimentally observed result/frequencies (O) and the theoretically expected results (E) based on certain hypotheses, or it is calculated by dividing the overall deviation from the observed and expected frequencies by the expected frequencies.

Chi-square Test in Spreadsheets

If there is no difference in observed and expected frequencies, then the chi-square value would be zero. If there is a difference, then the value of chi-square would be more than zero.

While comparing the calculated value with the table values, you have to calculate the degree of freedom. Then you will be able to compare and draw a conclusion.

Chi-square Test in Spreadsheets

Chi-square probability distribution graph: Image Source:

There are three types of chi-square tests:

  • Goodness of fit
  • Test of independence
  • Test of homogeneity

Terminology

  • Contingency table: This is a cross table or two-way table. You use to show the one variable in a row and another in a column with their frequency count. It is a type of frequency distribution table of the categorical variables.

  • Observed frequencies: Are counts made from experimental data. In other words, you observe the data happening and take measurements. (Source)

  • Expected frequencies: Are counts calculated using probability theory. Expected frequencies are calculated for each cell in the contingency table.

Chi-square Test in Spreadsheets tutorial

Where,

  • Eij: Expected frequency for ith row and jth column
  • Ti: Total in the ith row
  • Tj: Total in the jth row
  • N: Grand Total

Or you can think of this as (row total * column total) / grand total

  • Null Hypothesis (H0): It states that no association exists between the two cross-tabulated variables in the population. Hence, the variables are statistically independent. For example, if you compare two methods A and B for its goodness or which one works better, and if the assumption is that both methods are equally good, then this assumption is known as the Null Hypothesis.
  • Alternate Hypothesis (HA): It proposes that the two variables are related to the population. If you assume that from two methods, method A is superior to method B or method B is superior to method A, then this assumption is known as Alternative Hypothesis.

  • Degree of Freedom: The number of independent variates that make up the statistic is known as the degree of freedom of that statistic.

Chi-square Test in Spreadsheets tutorial

Where,

  • r=numbers of rows
  • c=number of columns

This will be used in the test of independence and test of homogeneity, not in the goodness of fit.

  • Chi-square test Statistics: A chi-squared statistic is a single number that tells you how much difference exists on your observed counts and the counts you would expect if there were no relationship at all in the population.
  • Chi-Square p-value: Chi-square P-value will tell you if your test results are significant or not.

Types of Chi-square test

  1. Goodness of fit: Chi-Square goodness of fit test is a non-parametric test that is used to find out how the observed value of a given phenomenon is significantly different from the expected value. In this test, you only have one variable from a single population (Source).

    • Null hypothesis (H0): In the Chi-Square goodness of fit test, the null hypothesis assumes that there is no significant difference between the observed and the expected value (Source).

    • Alternative hypothesis (Ha): In the Chi-Square goodness of fit test, the alternative hypothesis assumes that there is a significant difference between the observed and the expected value (Source).

      e.g., Let’s take a straightforward example, you rolled a fair 6-sided die 120 times and got the observed frequencies.

Chi-square Test in Spreadsheets tutorial

Hence,

  • H0 = Chances are equal to getting all numbers in the same frequency, or data is consistent with the expected one.
            p1 = p2 = p3 = p4 = p5 = p6 = 1/6
    
  • Ha = At least one p is not equal to 1/6, or data is not consistent with the expected one.

    • Criteria and Decision Rule: The rejection region is always right-tailed using χ2 distribution with (k-1) degree of freedom. (k =number of the categories) Reject H0 if χ2calulated > χ2tabulated DOF = k-1
  • Test of independence you use this to test two categorical variables are independent or not. e.g., gender vs. opinion independence.

H0: The row variable is independent of the column variable, or there is no significant relationship between variables Ha: The Relationship is significant.

  • Criteria and Decision Rule: Rejection region is always right tailed using χ2 distribution with (r-1)(c-1) degree of freedom. (r = number of the rows, c = number of the columns)

    Reject H0 if χ2calulated > χ2tabulated

    DOF = (r-1)(c-1)

  1. Test of homogeneity: Whenever you want to test whether those frequencies of different populations are distributed identically or not. In such cases, you perform the test of homogeneity. Let’s consider an example to grasp it more practically. In a survey, you asked about income as low, medium, or high. In this survey, both populations are diverse such as males and females. In such cases, you perform a chi-square test for homogeneity to determine whether male and female income differs significantly or not.

H0: Frequency count across the population is the same. Ha: Frequency count across the population is different.

  • Criteria and Decision Rule: Rejection region is always right tailed using χ2 distribution with (r-1)(c-1) degree of freedom. (r = number of the rows, c = number of the columns)

    Reject H0 if χ2calulated > χ2tabulated

    DOF = (r-1)(c-1)

Example of the Chi-square test

Suppose you wish to classify defects in the furniture produced by a manufacturing plant based on the type of defects and the production shift. A total of 390 furniture defects were recorded, and the defects were classified as one of four types A, B, C, and D. At the same time, each piece of defected furniture was identified according to the production shift.

Chi-square Test in Spreadsheets tutorial

Source: Engineering Statistics book

Solution: you need to look at whether the defect types are dependent on the production shift or not. So, let’s solve this using excel.

Solving the example using the Chi-square test in Spreadsheets

Let’s first put this data into the Spreadsheet

Chi-square Test in Spreadsheets tutorial

Defining the null hypothesis and the alternative hypothesis

To define the null and alternate hypothesis in the above section. The main objective is to check whether the furniture defects are independent of the production shift or not:

  • H0 = Defect type and manufacturing shift are independent
  • Ha = Defect type and manufacturing shift are dependent

Calculated expected frequencies

  • Before calculating the expected frequencies. First, calculate the row-wise sum of items for each row and the column-wise sum of items for each column using the SUM() function, which is known as row total and column total, respectively. Also, calculate the total of row total and column total. Row and column total both will be the same.
Chi-square Test in Spreadsheets tutorial
  • As you know expected frequency = (row total * column total) / total
Chi-square Test in Spreadsheets tutorial

Don’t forget to make cells absolute while applying the formula, so that you can copy & paste the formula for all of the expected values.

Calculate Chi-statistic value

Now before you calculate Chi – statistic value or p-value, lets first assume the significance level. This means at what significance level you want to know the answer. Let’s assume significance level α = 0.05. Also, the degree of freedom would be = (r-1)(c-1) = (3-1)(4-1) = 6.

Now there are two ways to calculate chi-statistic value one by the formula χ^2= ∑(O-E)^2/E or use the excel function to get the chi-square statistic value.

Let’s first calculate using the formula. For this, you need to calculate ∑(O-E)^2/E using excel. This can be done by using the below step –

Chi-square Test in Spreadsheets tutorial

You can get all the values by copy and paste this formula to all the cells.

To get the χ^2 values to take the sum of all the values, this would give us the chi-square statistic calculated value.

Chi-square Test in Spreadsheets tutorial

Based on the tabulated and calculated value, you can conclude that the defect types and shift times are dependent.

Chi-square Test in Spreadsheets tutorial

Now let’s calculate using excel function. CHISQ.TEST() function will give the p-value, which can directly be compared with the significance level to conclude the results.

Chi-square Test in Spreadsheets tutorial

Based on the p-value, you can conclude that the defect is dependent on manufacturing shift time.

Chi-square Test in Spreadsheets tutorial

Pros and Cons

Pros:

  • It is easier to compute.
  • It can also be used with nominal data.
  • It does not assume anything about the data distribution.

Cons:

  • The number of observations should be more than 20.
  • Data must be frequency data.
  • It assumes random sampling. It means the sample should be selected randomly.
  • It is sensitive to small frequencies, which leads to erroneous conclusions.
  • It is also sensitive to sample size.

Conclusion

Congratulations, you have made it to the end of this tutorial!

In this tutorial, you have covered a lot of details of the Chi-square test. You have learned what Chi-square is, terminologies used in the Chi-square test, types of Chi-square tests, examples of Chi-square tests, and an example on how to solve a Chi-square test in spreadsheets. Also, you looked over its pros and cons.

Hopefully, you can now utilize the Chi-Square concepts to test the hypothesis. Thanks for reading this tutorial!

Check out our Getting Started with Spreadsheets tutorial.

If you are interested in learning more about statistics in spreadsheets, take DataCamp's Statistics in Spreadsheets course.

Intermediate Spreadsheets

Beginner
4 hours
38,262
Expand your spreadsheets vocabulary by diving deeper into data types, including numeric data, logical data, and missing data.
See DetailsRight Arrow
Start Course

Introduction to Spreadsheets

Beginner
2 hours
65,689
Learn the basics of spreadsheets by working with rows, columns, addresses, and ranges.

Data Analysis in Spreadsheets

Beginner
3 hours
83,845
Learn how to analyze data with spreadsheets using functions such as SUM(), AVERAGE(), and VLOOKUP().
See all coursesRight Arrow
Related

YOLO Object Detection Explained

Understand YOLO object detection, its benefits, how it has evolved over the last couple of years and some real-life applications.
Zoumana Keita 's photo

Zoumana Keita

5 Ways to Use Data Science in Marketing

Discover five ways you can use data science in marketing. Get ahead of the game, improve your data skills, and work on a data science marketing project.
Natassha Selvaraj's photo

Natassha Selvaraj

_Quote.png

The Deep Learning Revolution in Space Science

Justin Fletcher joins the show to talk about how the US Space Force is using deep learning with telescope data to monitor satellites, potentially lethal space debris, and identify and prevent catastrophic collisions. 

Richie Cotton's photo

Richie Cotton

53 min

Regular Expressions Cheat Sheet

Regular expressions (regex or regexp) are a pattern of characters that describe an amount of text. Regular expressions are one of the most widely used tools in natural language processing and allow you to supercharge common text data manipulation tasks. Use this cheat sheet as a handy reminder when working with regular expressions.
DataCamp Team's photo

DataCamp Team