Project: Hypothesis Testing with Men's and Women's Soccer Matches

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

: The mean number of goals scored in women's international soccer matches is the same as men's.

: The mean number of goals scored in women's international soccer matches is greater than men's.

# Start your code here!
library(tidyverse)
library(infer)

We start off with our datasets import.

# Importing datasets
women <- read_csv("women_results.csv", show_col_types = FALSE, col_select = -1)
men <- read_csv("men_results.csv", show_col_types = FALSE,col_select = -1)

# Checking the first rows
glimpse(women)
glimpse(men)

Now, let's filter our datasets to get the target group, and create the variable we need goals_scored, which is the sum of the goals scored by the home and away teams.

# Let's filter the target group and calculate the goals scored
women.target <- women %>%
	# only FIFA World Cup tournaments since 2002-01-01
	filter(tournament == "FIFA World Cup" & date > "2002-01-01") %>%
	# Calculating the goals scored in each match
	mutate(goals_scored = home_score + away_score)
glimpse(women.target)

men.target <- men %>%
	# only FIFA World Cup tournaments since 2002-01-01
	filter(tournament == "FIFA World Cup" & date > "2002-01-01") %>%
	# Calculating the goals scored in each match
	mutate(goals_scored = home_score + away_score)
glimpse(men.target)

Given that we're assuming independence between matches and teams, it's not necessary to sort or pair matches by date, because at the end we'll just get the mean from each gender, so order doesn't matter.

Before we get to the test, let's visualize the goals scored

ggplot() +
	# Density plot for goals scored by men
	geom_density(aes(x = goals_scored, color = "Men", fill = "Men"), data = men.target, alpha = 0.2) +
	# Density plot for goals scored by women
	geom_density(aes(x = goals_scored, color = "Women", fill = "Women"), data = women.target, alpha = 0.2) +
	# Adding manual legends for identify the gender
	scale_color_manual(name = "Teams gender", values = c("Women" = "red", "Men" = "blue")) +
	scale_fill_manual(name = "Teams gender", values = c("Women" = "red", "Men" = "blue")) +
	# Adding the mean of goals scored by men as a vertical line
	geom_vline(xintercept = mean(men.target$goals_scored), color = "blue", linetype = "dashed") +
	# Adding the mean of goals scored by women as a vertical line
	geom_vline(xintercept = mean(women.target$goals_scored), color = "red", linetype = "dashed")

According to the plot, the distribution of goals scored by any gender is left skewed. They also look pretty similar, although women have scored slighty more goals in average (red dashed vertical line) than men (blue dashed vertical line) in the history of FIFA World Cup.

To test our hypothesis we need a test for a numerical value that is compared between two groups. We could use t-test, but the fact that the distribution is skewed warns us about using the mean, so in this case, a more robust estimator of the center is the median. That's why we won't use t-test, but Wilcoxon test, because the former uses the mean and the latter the median.

However, it's a good exercise to check differences between tests. Let's start with the t-test

# t-test from base-R
t.test(women.target$goals_scored, men.target$goals_scored, alternative = "greater")

According to the t-test, the null hypothesis should be rejected because with about 3 against 2.5 goals per match, women have scored statistically more goals than men in the history of FIFA World Cup since 2002 even at a significance of 1%, because we got a p-value less than 0.01.

Now, let's see whta's the decision when considering the Wilcoxon test.

wtest <- wilcox.test(x = women.target$goals_scored, y = men.target$goals_scored, alternative = 'greater')
wtest
median(women.target$goals_scored)
median(men.target$goals_scored)

According to the Wilcoxon test, there is statistical evidence to reject the null hypothesis because the p-value is lower than the level of significance 10%. We also can see that in median there is one goal of difference between the two groups in favour of women. Thus, we reject the null hypothesis, just as it was suggested by the two tests. We could reflect that probably the sample size was a key in getting similar results.

‌
‌
‌