Project: Hypothesis Testing with Men's and Women's Soccer Matches

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

: The mean number of goals scored in women's international soccer matches is the same as men's.

: The mean number of goals scored in women's international soccer matches is greater than men's.

# Start your code here!

# Importing necessary modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pingouin
from scipy.stats import shapiro

# Reading the data

men_df = pd.read_csv('men_results.csv')
women_df = pd.read_csv('women_results.csv')

# Examining the data
men_df.head()

# Creating a function that filters the date column, remove unwanted columns and aggregate the scores

def clean_data(*dfs):
    """
    Cleans multiple DataFrames by:
    1. Filter data for FIFA World Cup matches
    2. Filtering based on a date column (keeping data from 2002-01-01 onwards).
    3. Dropping unwanted columns.
    4. Aggregating (summing) score-related columns.
    
    Parameters:
    *dfs (pd.DataFrame): Multiple DataFrames to clean.

    Returns:
    tuple: Cleaned DataFrames.
    """
    cleaned_dfs = []
    
    for df in dfs:
        # Ensure the date column is in datetime format
        df['date'] = pd.to_datetime(df['date'])
        
        # Filter data for FIFA World Cup matches
        df = df[df['tournament'] == 'FIFA World Cup']

        # Filter data from 2002-01-01 onwards
        df = df[df['date'] >= '2002-01-01']

        # Drop unwanted columns
        df = df.drop(columns='Unnamed: 0')

        # Aggregate score-related columns (assuming they contain "score" in their names)
        df['total_score'] = df['home_score'] + df['away_score']

        cleaned_dfs.append(df)
    
    return tuple(cleaned_dfs)  # Return cleaned DataFrames as a tuple

clean_men, clean_women = clean_data(men_df, women_df)

# Before concatenating the dataset into one, a column should be added to differentiate which data is which
clean_men['gender'] = 'male'
clean_women['gender'] = 'female'

df= pd.concat([clean_men, clean_women], ignore_index=True)
df.head()

# Determining what type of hypothesis test to do
sns.displot(data=df, x='total_score', kind='kde', col='gender')
plt.suptitle('Distribution of Total Scores by Gender', y=1.05)
plt.xlabel('Total Score')
plt.ylabel('Density')
plt.show()

# Perform Shapiro-Wilk test for normality on total scores for both genders
shapiro_test_male = shapiro(df[df['gender'] == 'male']['total_score'])
shapiro_test_female = shapiro(df[df['gender'] == 'female']['total_score'])

print(f"The p-value for male and female Shapiro–Wilk test are {shapiro_test_male[1]}, {shapiro_test_female[1]} respectively")

for p_val in [shapiro_test_male[1], shapiro_test_female[1]]:
    if p_val <= 0.05:
        print("Null hypothesis is rejected and the data does not follow a normal distribution. A non-parametric hypothesis test is needed")
    else:
        print("Fail to reject the null hypothesis and the data does follows a normal distribution. Parametric hypothesis test is needed")

The KDE chart showed that the data has a longer tail and seemed skewwed. The Shapiro–Wilk test revealed that the data are indeed not normally distributed. Reason for a non-parametric hypothesis test

Wilcoxon-Mann-Whitney test is the non-parametric test

# Performing Wilcoxon-Mann-Whitney test
mwu_result = pingouin.mwu(x= df[df['gender'] == 'female']['total_score'],
                    y= df[df['gender'] == 'male']['total_score'],
                    alternative='greater')

p_val = mwu_result['p-val'].values[0]
print(f"Wilcoxon-Mann-Whitney test p-value is {p_val}\n")


null_hypothesis= "The mean number of goals scored in women's international soccer matches is the same as men's."
alternate_hypothesis= "The mean number of goals scored in women's international soccer matches is greater than men's."

alpha = 0.10
if p_val <= alpha:
    result = "reject"
    print(f"Reject null hypothesis. Meaning: {alternate_hypothesis}")
else:
    result= "fail to reject"
    print(f"Fail to reject null hypothesis. Meaning {null_hypothesis}")

# Result
result_dict = {"p_val": p_val, "result": result}