I am pretending to work as a sports journalist for a major online sports media company, specializing in soccer analysis and reporting. I have been watching both men's and women's international soccer matches for a number of years, and i believe that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that my subscribers are bound to love, but I will need to perform a valid statistical hypothesis test to be sure.
While scoping this project, I do acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so I have decided to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.
I will create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.
The question I aam trying to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
I assume a 1% significance level, and use the following null and alternative hypotheses:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg
from scipy.stats import ttest_ind
df_men = pd.read_csv('men_results.csv')
df_women = pd.read_csv('women_results.csv')print(df_men.shape)
print(df_women.shape)print(df_men[['tournament']].value_counts())df_men.info()df_women.info()df_men['date'] = pd.to_datetime(df_men['date'])
df_women['date'] = pd.to_datetime(df_women['date'])df_men = df_men[(df_men['date']>='2002-01-01') & (df_men['tournament']=='FIFA World Cup')]
df_women = df_women[(df_women['date']>='2002-01-01') & (df_women['tournament']=='FIFA World Cup')]df_men['total_score'] = df_men['home_score'] + df_men['away_score']
df_women['total_score'] = df_women['home_score'] + df_women['away_score']df_men.head()df_women.head()print(len(df_men))
print(len(df_women))df_men['group'] = 'Male'
df_women['group'] = 'Female'df = pd.concat([df_men, df_women])df = df[['date', 'home_team', 'away_team', 'group', 'total_score']]