You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!
While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup
matches (not including qualifiers) since 2002-01-01
.
You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv
and men_results.csv
.
The question you are trying to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
You assume a 10% significance level, and use the following null and alternative hypotheses:
# Start your code here!
import pandas as pd
# Load Data -> Adjust index_col to avoid duplicate index column "Unnamed: 0"
women_df = pd.read_csv("women_results.csv", index_col=0)
men_df = pd.read_csv("men_results.csv", index_col=0)
# Inspect Women's Data
women_df.head()
# Info on women_df
women_df.info()
# Checking for nulls in women_df
women_df.isna().sum()
# Inspect Men's Data
men_df.head()
# Info on men_df
men_df.info()
# Checking for nulls:
men_df.isna().sum()
Given the overall absence of nulls, we are ready to do statistical hypothesis tests. The hypothesis described in this problem is that women score more goals than men.
H_o = Women's goals <= Men's Goals H_a = Women's goals > Men's Goals
I will first filter the data so it only includes games that are characterized as "FIFA World Cup" matches for men and women. Next, I will filter the data for dates after 2002-01-01. Finally, I will be using the goals scored by men as the distribution that we compare women's goals to, and assess whether there is a statistically significant difference, or whether the null hypothesis fails to be rejected.
# Evaluating overall proportions of raw data
ax = men_df["tournament"].value_counts(normalize=False)[:15].plot(kind="bar", rot=90, title="Proportion of Matches in the Raw Men's Data by Tournament", ylabel="Proportion", xlabel="Tournament")
for bars in ax.containers:
ax.bar_label(bars);
# Evaluating overall proportions of raw data
ax = women_df["tournament"].value_counts(normalize=False)[:15].plot(kind="bar", rot=90, title="Proportion of Matches in the Raw Women's Data by Tournament", ylabel="Proportion", xlabel="Tournament")
for bars in ax.containers:
ax.bar_label(bars)
# Men have 964 FIFA World Cup Games, Women 284, now we will filter the dfs
men = men_df.loc[men_df["tournament"] == "FIFA World Cup"].copy()
women = women_df.loc[women_df["tournament"] == "FIFA World Cup"].copy()
# New dfs after filtering by tournament
print("Men before/after Tournament Filter:", men_df.shape, men.shape, "\nWomen before/after Tournament Filter:", women_df.shape, women.shape)
# Next, we need to convert the "date" column in both dataframes to a datetime and filter for dates after 2002-01-01
# Converting from object to datetime
men["date"] = pd.to_datetime(men["date"])
women["date"] = pd.to_datetime(women["date"])
# Filtering by the minimum date
men = men.loc[men["date"] >= "2002-01-01"]
women = women.loc[women["date"] >= "2002-01-01"]
# New dfs after filtering by tournament AND date, a considerable reduction in overall data
print("Men before/after Tournament Filter:", men_df.shape, men.shape, "\nWomen before/after Tournament Filter:", women_df.shape, women.shape)
# Creating a total_goals column so that we can compare total goals for hypothesis testing
men["total_goals"] = men["home_score"] + men["away_score"]
women["total_goals"] = women["home_score"] + women["away_score"]