You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!
While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.
You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.
The question you are trying to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
You assume a 10% significance level, and use the following null and alternative hypotheses:
Exploratory data analysis
Determining the column names, data types, and values
# Start your code here!
import pandas as pd
import pingouin as pg
from scipy.stats import shapiro
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
women = pd.read_csv('women_results.csv', index_col=0)
men = pd.read_csv('men_results.csv', index_col=0)
print(women.head(), '\n')
print(men.head())def check_dataframe(dataframe):
print('_HEAD_'.center(50, '*'))
print(dataframe.head(), '\n')
print('_TAIL_'.center(50, '*'))
print(dataframe.tail(), '\n')
print('_SHAPE_'.center(50, '*'))
print(dataframe.shape, '\n')
print('_DATAFRAME INFO_'.center(50, '*'))
print(dataframe.info(), '\n')
print('_COLUMNS_'.center(50, '*'))
print(dataframe.columns, '\n')
print('_ANY NULL VALUE_'.center(50, '*'))
print(dataframe.isna().values.any(), '\n')
print('_TOTAL NULL VALUES_'.center(50, '*'))
print(dataframe.isna().sum(), '\n')
print('_DESCRIBING DATAFRAME_'.center(50, '*'))
print(dataframe.describe([0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99]).T)
check_dataframe(women)
check_dataframe(men)# categorical columns
cat_col_women = [col for col in men.columns if men[col].dtypes == 'O']
cat_col_men = [col for col in men.columns if men[col].dtypes == 'O']
# removing "date" variable from categorical columns
cat_col_women = [col for col in cat_col_women if col not in 'date']
cat_col_men = [col for col in cat_col_men if col not in 'date']
print(cat_col_women, cat_col_men)for col in cat_col_women:
print(women[col].value_counts())for col in cat_col_men:
print(men[col].value_counts())# Converting datetype of the variable "date" to datetime
women['date'] = pd.to_datetime(women['date'])
men['date'] = pd.to_datetime(men['date'])
print(women['date'].dtypes, men['date'].dtypes)Filtering the data
Filter the data to only include official FIFA World Cup matches that took place after 2002–01–01.
df_women = women.loc[(women['tournament'] == 'FIFA World Cup') & (women['date'] > '2002-01-01')]
df_men = men.loc[(men['tournament'] == 'FIFA World Cup') & (men['date'] > '2002-01-01')]
print(df_women.head(), '\n')
print(df_men.head())df_women['goals_scored'] = df_women['home_score'] + df_women['away_score']
df_men['goals_scored'] = df_men['home_score'] + df_men['away_score']# Choosing the correct hypothesis test
## Use EDA to determine the appropriate hypothesis test for this dataset and scenario.df_women['goals_scored'].plot(kind='hist');
df_men['goals_scored'].plot(kind='hist');# Calculation of mean goal scores for women and determination of normality
print(f"The mean of goals scored for women is {df_women['goals_scored'].mean()}.")
statistic, p_value = shapiro(df_women['goals_scored'])
print(f"Statistic: {statistic} and P value: {p_value}")
alpha = 0.01
if p_value > alpha:
print("The data looks normally distributed (fail to reject H0)", '\n')
else:
print("The data does not look normally distributed (reject H0)", '\n')# Calculation of mean goal scores for men and determination of normality
print(f"The mean of goals scored for women is {df_men['goals_scored'].mean()}.")
statistic, p_value = shapiro(df_men['goals_scored'])
print(f"Statistic: {statistic} and P value: {p_value}")
alpha = 0.01
if p_value > alpha:
print("The data looks normally distributed", '\n')
else:
print("Fail to reject H0", '\n')