Project: Hypothesis Testing with Men's and Women's Soccer Matches

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

: The mean number of goals scored in women's international soccer matches is the same as men's.

: The mean number of goals scored in women's international soccer matches is greater than men's.

#importing and loading
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
women = pd.read_csv('women_results.csv')
men = pd.read_csv('men_results.csv')

#First some EDA to get an idea of the women's and men's data. 
#Below is an example of what I did with the women's data ( I also did the same w/ the men's - not shown)

#women.shape     # of rows and columns (4884, 7)
# women.describe()  women.info() 
# women['home_team'].value_counts()   got number of women's home_teams 198 and away_teams 196   

# num of only FIFA World Cup matches  284
w_fifa = women[women['tournament']=='FIFA World Cup']	

#total sum of home and away goals 18091
women['total_goals']= women['home_score'] + women['away_score']
women['total_goals'].sum()

#noticed the date column was an object, changed it to 'date' type and then filtered for those matches occurring after Jan. 1, 2002, as per instructions, to create the subset dataframe 'w_fifa_sub'
pd.to_datetime(w_fifa['date'])
w_fifa_sub=w_fifa[w_fifa['date']>= '2002-01-01']

#filtered data for women's games for this subset dataframe 200 games, mean of total_goals was 2.98
w_fifa_sub['w_home_score']=w_fifa_sub['home_score']
w_fifa_sub['w_away_score']=w_fifa_sub['away_score']
w_fifa_sub['total_goals']=w_fifa_sub['w_home_score']+w_fifa_sub['w_away_score']
w_fifa_sub['total_goals'].agg([np.min, np.max, np.sum, np.mean])

#filtered data for men's FIFA World Cup games since Jan. 1, 2002 m_fifa_sub:  384 games, avg total_goals was 2.51
m_fifa = men[men['tournament']=='FIFA World Cup']
pd.to_datetime(m_fifa['date'])
m_fifa_sub=m_fifa[m_fifa['date']>= '2002-01-01']
m_fifa_sub['total_goals']=m_fifa_sub['home_score']+m_fifa_sub['away_score']
m_fifa_sub['total_goals'].agg([np.min, np.max, np.sum, np.mean])

#compared the distribution of women's and men's total goals
fig, ax = plt.subplots(figsize=(10, 6))
g=sns.histplot(data=m_fifa_sub, x='total_goals',alpha = 0.7, label="Men's")
g=sns.histplot(data=w_fifa_sub, x='total_goals', alpha = 0.7, label="Women's")
g.set_title("Distribution of Men's and Women's Total Goals in FIFA World Cup Matches Since 2002")
g.set(xlabel='Total Goals in a Match', ylabel='Count of Matches')
sns.set_style()
ax.legend()
plt.show()
#the data is right-skewed

#The question I'm trying to determine the answer to is:
#Are more goals scored in women's international soccer matches than men's?
#You assume a 10% significance level, and use the following null and alternative hypotheses:
    #H-0: The mean number of goals scored in women's international soccer matches is the same as men's.
    #H-A : The mean number of goals scored in women's international soccer matches is greater than men's.
#we are comparing 2 means, so I first think of a t-test
#are they paired? no, these datasets are independent of each other
#however, from the histograms above we see that the data is right-skewed, so probably can't do a t-test
#let's test to see if it is normal data w/ the Shapiro-Wilk Test
#significance level had been set to 10%
#if pvalue is < 0.1 then data significantly deviates from a normal distribution

from scipy import stats
w_shapiro= stats.shapiro(w_fifa_sub['total_goals'])
print(w_shapiro)
m_shapiro= stats.shapiro(m_fifa_sub['total_goals'])
print(m_shapiro)
#pvalue is a lot lower than the significance level, therefore data is not normal

#since this data is not normally distributed and it is unpaired, we will need to use a Wilcoxon-Mann-Whitney rank test
#all the following steps are to prepare the data to have just the two crucial columns to compare with this test

# I'm preparing to merge the data, by creating new names for the columns of men's and women's total goals
w_fifa_sub['w_total_goals']= w_fifa_sub['total_goals']
m_fifa_sub['m_total_goals']= m_fifa_sub['total_goals']

# merge the two datasets, create a column that tells us which dataset each row is from, left is women's
w_m= w_fifa_sub.merge(m_fifa_sub, how='outer', indicator=True)

#change the names of the values in the '_merge' column to 'women' and 'men'
d={'left_only':'women', 'right_only':'men'}
w_m['_merge']=w_m['_merge'].map(d)

#subset the combined dataset
w_small = w_m[['w_total_goals','_merge']]

#pivot to wide format
w_wide = w_small.pivot(columns='_merge', values='w_total_goals')

#subset to just the 'women' col
w_wide_sub = w_wide[['women']]

#repeat last steps w/ men's data
m_small = w_m[['m_total_goals','_merge']]
m_wide = m_small.pivot(columns='_merge', values='m_total_goals')
m_wide_sub = m_wide[['men']]

#join these two columns together
w_m_wide=pd.concat([w_wide_sub, m_wide_sub])
w_m_wide.rename_axis(columns='gender')
w_m_wide

#import pingouin to use the Wilcoxon-Mann-Whitney rank test
# since we have a pvalue < 0.1 significance level, we reject the NULL hypothesis and determin that our alternative hypothesis is likely correct: there are more goals scored in women's international soccer matches than in men's.
import pingouin
result_df=pingouin.mwu(x=w_m_wide['women'], y=w_m_wide['men'],alternative='greater')
result_sub = result_df['p-val']
result_array=np.array(np.float(result_sub))
result_array
result_dict={"p_val":np.float(result_array), "result": "reject"}
result_dict