Akinkunmi_project 3
You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!
While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) after 2002-01-01.
You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.
The question you are trying to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
You assume a 10% significance level, and use the following null and alternative hypotheses:
# Import necessary libraries
library(tidyverse)# Load necessary libraries
library(readr)
library(dplyr)
library(ggplot2) # Added ggplot2 library
# Load the data from men_results.csv and women_results.csv
men_results <- read_csv("men_results.csv")
women_results <- read_csv("women_results.csv")
# Display the structure of the datasets to understand their contents
str(men_results)
str(women_results)
# Determining the column names, data types, and values
glimpse(men_results)
glimpse(women_results)
# I want to find the unique values in a categorical column
unique(men_results$tournament)
unique(women_results$tournament)
# Print the datasets
print(men_results)
print(women_results)
# Filter data to include Fifa world cup matches after 2002-01-01
women_wc_data <- women_results %>%
filter(tournament == "FIFA World Cup" & date > as.Date("2002-01-01"))
men_wc_data <- men_results %>%
filter(tournament == "FIFA World Cup" & date > as.Date("2002-01-01"))
# Calculate test value
# Get the total goals scored by both men and women
women_wc_data_goals <- women_wc_data %>%
mutate(goals = home_score + away_score)
men_wc_data_goals <- men_wc_data %>%
mutate(goals = home_score + away_score)
# Print the dataset
print(men_wc_data_goals)
print(women_wc_data_goals)
# Calculate the mean goals scored in women's and men's matches
mean_goals_women <- women_wc_data_goals$goals
mean_goals_men <- men_wc_data_goals$goals
# Determine the test.
# Given that we’re comparing means between two independent groups (women’s and men’s matches), we’ll use a unpaired two-sample t-test.
n_men = nrow(men_wc_data_goals)
n_women = nrow(women_wc_data_goals)
n_men
n_women
# Determine the Distribution using histogram
ggplot(women_wc_data_goals, aes(x = goals)) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.7) +
labs(title = "Distribution of Goals Scored in Women's Matches") # Corrected title
ggplot(men_wc_data_goals, aes(x = goals)) +
geom_histogram(binwidth = 1, position = "identity", alpha = 0.7) +
labs(title = "Distribution of Goals Scored in Men's Matches") # Corrected title
# Perform a one-tailed t-test (assuming unequal variances)
test_stat <- wilcox.test(mean_goals_women, mean_goals_men,
alternative = "greater", var.equal = FALSE)
test_stat
# Extract p-value and result
p_val <- test_stat$p.value
result <- ifelse(p_val < 0.10, "reject", "fail to reject")
# store p-value and result as result_df then create the data frame
result_df <- data.frame(p_val = p_val, result = result)
result_df