You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!
While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.
You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.
The question you are trying to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
You assume a 10% significance level, and use the following null and alternative hypotheses:
# Start your code here!
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import mannwhitneyu
#Set the significance level:
alpha = 0.1#Import the womens csv file
df_women = pd.read_csv('women_results.csv')
#Change the date column to datetime and filter by the noted date
df_women['date'] = pd.to_datetime(df_women['date'])
df_women = df_women[df_women['date'] > '2002-01-01']
#Filter the dataframe by only FIFA World Cup matches
df_women = df_women[df_women['tournament'] == 'FIFA World Cup']
#Get the total score from the games
df_women['total_score'] = df_women['home_score'] + df_women['away_score']
df_women.head()#Import the mens csv file
df_men = pd.read_csv('men_results.csv')
#Change the date column to datetime and filter by the noted date
df_men['date'] = pd.to_datetime(df_men['date'])
df_men = df_men[df_men['date'] > '2002-01-01']
#Filter the dataframe by only FIFA World Cup matches
df_men = df_men[df_men['tournament'] == 'FIFA World Cup']
#Get the total score from the games
df_men['total_score'] = df_men['home_score'] + df_men['away_score']
df_men.head()#Combine the total_score columns from both dataframes and add a new column for sex
df_combine = pd.concat([df_women[['total_score']].assign(sex='women'),
df_men[['total_score']].assign(sex='men')],
ignore_index=True)
df_combine.head()#Plot the distribution of the two total scores
sns.displot(data=df_combine, x="total_score", kde=True, hue='sex')
plt.show()
#The distribution is not uniform, therefore we need a non-parametric test, rather than a parametric test.
#Therefore use a Mann-Whitney U Test
#As it is a right tailed distribution, input this into the check#Split them into two groups
group_women = df_combine[df_combine['sex'] == 'women']['total_score']
group_men = df_combine[df_combine['sex'] == 'men']['total_score']
#Pass the two groups to the Mann Whitney U test, specified for a right tailed test
stat, p_value = mannwhitneyu(group_women, group_men, alternative='greater')
#Print the results
print(f"T-statistic:{stat}, P-value{p_value}")
#Get the result
if p_value < alpha:
result = 'reject'
else:
result = 'fail to reject'
result#Put the results into a dictionary
result_dict = {"p_val": p_value, "result": result}
result_dict