Project: Hypothesis Testing with Men's and Women's Soccer Matches

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

: The mean number of goals scored in women's international soccer matches is the same as men's.

: The mean number of goals scored in women's international soccer matches is greater than men's.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import ttest_ind, shapiro, mannwhitneyu

# Load the datasets for women's and men's FIFA World Cup results
women_df = pd.read_csv('women_results.csv', index_col=0, parse_dates=['date'])
men_df = pd.read_csv('men_results.csv', index_col=0, parse_dates=['date'])

def subset_df(df: pd.DataFrame) -> pd.DataFrame:
    """Subset only FIFA World Cup matches from 2002-01-01"""
    return df.query("tournament == 'FIFA World Cup' and date >= '2002-01-01'")

# Subset the dataframes to include only FIFA World Cup matches from 2002 onwards
men_samp = subset_df(men_df)
women_samp = subset_df(women_df)

# Calculate the total score for each match by summing the home and away scores
men_samp['total_score'] = men_samp[["home_score", "away_score"]].sum(axis=1)
women_samp['total_score'] = women_samp[["home_score", "away_score"]].sum(axis=1)

# Plot histograms of total scores for men's and women's FIFA World Cup matches
# to help choose which statistical test to use

# Plot histogram for men's total scores
ax = men_samp['total_score'].hist(
    bins=np.arange(9),  # Define the bin edges for the histogram
    color='darkblue',   # Set the color of the bars for men's data
    alpha=0.6,          # Set the transparency level of the bars
    label='Men'         # Label for the legend
)

# Plot histogram for women's total scores on the same axes
ax = women_samp['total_score'].hist(
    bins=np.arange(9),  # Use the same bin edges for consistency
    color='orange',     # Set the color of the bars for women's data
    alpha=0.5,          # Set the transparency level of the bars
    label='Women',      # Label for the legend
    ax=ax               # Plot on the same axes as the men's histogram
)

# Set the title and labels for the axes
ax.set(
    title="All goals in FIFA World Cup matches",
    xlabel="Total number of goals per match",
    ylabel="Number of matches"
)

plt.legend()
plt.show()

# Set the significance level for statistical tests
alpha = 0.1

# Calculate the number of samples for men and women
n_men, n_women = len(men_samp), len(women_samp)

# Calculate the mean total score for men and women
men_mean, women_mean = men_samp['total_score'].mean(), women_samp['total_score'].mean()

# Calculate the variance of total scores for men and women
men_var, women_var = men_samp['total_score'].var(), women_samp['total_score'].var()

# Print the samples statistics 
print(f"Men's FIFA World Cup Matches: {n_men} samples, Mean Goals: {men_mean:.2f}, Variance: {men_var:.2f}")
print(f"Women's FIFA World Cup Matches: {n_women} samples, Mean Goals: {women_mean:.2f}, Variance: {women_var:.2f}")

# Perform the Shapiro-Wilk test for normality on total scores
men_shapiro = shapiro(men_samp['total_score'])
women_shapiro = shapiro(women_samp['total_score'])

print(f"Shapiro-Wilk Test for Men's Total Scores: Statistic={men_shapiro.statistic:.4f}, p-value={men_shapiro.pvalue:.4f}")
print(f"Shapiro-Wilk Test for Women's Total Scores: Statistic={women_shapiro.statistic:.4f}, p-value={women_shapiro.pvalue:.4f}")

We need to use the non-parametric Mann-Whitney U-test because the Shapiro-Wilk test results indicate that our data does not follow a normal distribution, which agrees with the histogram. The Mann-Whitney U-test does not assume normality and is more appropriate for comparing the medians of two independent samples when the normality assumption is violated.

For comparison purposes, we will also run the parametric T-test, which assumes normality, to see how the results differ.

# Perform the parametric T-tests
# Standard independent 2-sample T-test (assumes equal variances)
t_2samp = ttest_ind(women_samp["total_score"], men_samp["total_score"], alternative='greater')

# Welch’s T-test (does not assume equal variances)
t_welch = ttest_ind(women_samp["total_score"], men_samp["total_score"], alternative='greater', equal_var=False)

# Display the results of the T-tests
print(f"Standard 2-sample T-test (equal variances): statistic={t_2samp.statistic:.4f}, p-value={t_2samp.pvalue:.4f}")
print(f"Welch’s T-test (unequal variances): statistic={t_welch.statistic:.4f}, p-value={t_welch.pvalue:.4f}")

As we see from the results of both parametric tests (Standard 2-sample T-test and Welch’s T-test), the p-values are low enough to suggest rejecting the null hypothesis. This indicates that there is a statistically significant difference in the total scores between the two groups under the assumption of normality.

However, since our data does not follow a normal distribution, it is crucial to rely on a non-parametric test for a more accurate conclusion.

Now, let's see our final conclusion based on the results of the Mann-Whitney U-test, which does not assume normality and is more appropriate for our data.

# Perform the Mann-Whitney U-test
mwu_res = mannwhitneyu(women_samp["total_score"], men_samp["total_score"], alternative='greater')

# Extract the p-value from the test result
p_val = mwu_res.pvalue

# Determine the result based on the p-value and a significance level (alpha)
alpha = 0.05
result = "reject" if p_val < alpha else "fail to reject"

# Results to submit a project
# result_dict = {'p_val': p_val, 'result': result}

# Create a dictionary to store the detailed results
result_dict = {
    'test_statistic': mwu_res.statistic,
    'p_value': p_val,
    'alpha': alpha,
    'result': result,
    'interpretation': (
        "There is a statistically significant difference in total scores between the two groups. "
        "The mean number of goals scored in women's international soccer matches is greater than men's."
        if result == "reject" else
        "There is no statistically significant difference in total scores between the two groups. "
        "The mean number of goals scored in women's international soccer matches is not greater than men's."
    )
}

result_dict