Skip to content

Hypothesis Testing with Men's and Women's Soccer Matches in R

Introduction

As a sports journalist specializing in soccer, particularly in the international arena, the question of whether women’s international soccer matches see more goals scored than their male counterparts presents an interesting and timely investigation. Over the years, soccer has undergone significant changes, and variations in match performances likely depend on factors such as the competition type and tournament year. To address this hypothesis, the analysis focuses on the official FIFA World Cup matches held after January 1, 2002, to ensure a standardized comparison between men’s and women’s soccer.

The primary goal of this project is to test the hypothesis that more goals are scored in women's international soccer matches than in men's. To do so, data from FIFA World Cup matches for both men and women were collected, focusing on match results, and a statistical hypothesis test will determine whether this is true. This analysis leverages R programming with various data manipulation and visualization packages including tidyverse, lubridate, dplyr, and ggplot2 for cleaning, exploring, and visualizing the data.

# Import necessary libraries
library(tidyverse)
library(readr)
library(dplyr)
library(lubridate)
# Start your code here!
# Use as many cells as you like

women_results  <- read.csv("women_results.csv")
men_results <- read.csv("men_results.csv")

#Filter for tournament and period of interest
womens_data <- women_results %>%
mutate(date = ymd(date), total_goals_scored = home_score + away_score, group = "women") %>%
filter(tournament == "FIFA World Cup", date > "2002-01-01") 

#Visualize to check shape of distribution
ggplot(womens_data, aes(total_goals_scored)) +
geom_histogram(binwidth = 0.5)

#Filter for tournament and period of interest
mens_data <- men_results %>%
mutate(date = ymd(date), total_goals_scored = home_score + away_score, group = "men") %>%
filter(tournament == "FIFA World Cup", date > "2002-01-01")

#Visualize to check shape of distribution
ggplot(mens_data, aes(total_goals_scored)) +
geom_histogram(binwidth = 0.5)


#Calculating sample_stats
combined_data <- womens_data %>%
bind_rows(mens_data)
summary_stats <- combined_data %>%
group_by(group) %>%
summarize(median = median(total_goals_scored), IQR = quantile(total_goals_scored, 0.75) - quantile(total_goals_scored, 0.25))

#checklist for  choosing statistical test:
#variable type: discrete and dichotomous
#Type of analysis: comparison
#groups : 2 groups with 2 data sets 
#study design: Independent/unpaired
#Shape of distribution: both skewed to the right
#appropriate test: Mann whitney U test


wilcox.test(x = womens_data$total_goals_scored,
			y = mens_data$total_goals_scored,
			alternative = "greater"
			)



result_df <- data.frame( p_val = 0.005107, result = "reject")		

Conclusion

Based on the results of the statistical hypothesis test, the p-value was calculated to be 0.005107, which is well below the significance level of 0.10. This provides strong evidence to reject the null hypothesis that the mean number of goals scored in women’s international soccer matches is the same as men’s.

With this result, we can confidently conclude that, on average, more goals are scored in women's international soccer matches than in men's. This finding challenges the traditional assumptions and opens up an interesting avenue for further analysis on the dynamics of women’s football.

This investigation, carried out using R programming and tools like tidyverse, lubridate, dplyr, and ggplot2, demonstrates the power of statistical analysis in revealing insights that go beyond intuition and contribute meaningfully to the discourse surrounding gender differences in sports.