Skip to content
New Workbook
Sign up
Hypothesis Testing - World Cup

Scenario

  • I am working as a sports journalist at a major online sports media company, specialising in football/soccer analysis and reporting.
  • I've been watching both men's and women's international soccer matches for a number of years.
  • My gut instinct tells me that women's international football matches have more goals scoredthan men's.
  • This would make an interesting investigative article thatsubscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, I acknowledge that the (sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so I decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

I create two datasets containing the results of every official men's and women's international football match since the 19th century, scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question: Are more goals scored in women's international soccer matches than men's?

I assume a 10% significance level, and use the following null and alternative hypotheses:

  • : The mean number of goals scored in women's international soccer matches is the same as men's.
  • : The mean number of goals scored in women's international soccer matches is greater than men's.
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl # for default style
import seaborn as sns
import scipy.stats as st

# standardise appearance of visualisations
sns.set_palette("tab10")
mpl.rcParams['axes.titleweight'] = 'bold'
mpl.rcParams['figure.titleweight'] = 'bold'
mpl.rcParams['font.weight'] = 'regular'
mpl.rcParams['axes.labelweight'] = 'regular'
mpl.rcParams['axes.titlesize'] = 12
mpl.rcParams['axes.labelsize'] = 10
sns.set_style("ticks")

# read the data
women_results=pd.read_csv('women_results.csv', index_col=[0])
men_results=pd.read_csv('men_results.csv', index_col=[0])

# create gender column
women_results['gender'] = 'women'
men_results['gender'] = 'men'

# combine into a single data frame
results = pd.concat([women_results,
                     men_results],
                    axis = 0,
                    ignore_index = True)

# calculate total score
results['total_score'] = results['home_score'] + results['away_score']

# preview the combined df
results.head()
# convert date format
results['date'] = pd.to_datetime(results['date'])

# change dtypes 
for col in results.select_dtypes(include='object').columns:
    results[col] = results[col].astype('category')
    
# filter for non-zero value counts in FIFA tournaments
FIFA_tournaments = results[results['tournament'].str.contains('FIFA')]
value_counts = FIFA_tournaments['tournament'].value_counts()
filtered_counts = value_counts[value_counts > 0].reset_index()
filtered_counts.columns = ['tournament', 'counts']
filtered_counts
# subset 
worldcup = results[results['tournament'] == 'FIFA World Cup']

# drop the 'tournament' column
worldcup = worldcup.drop(columns='tournament')

# preview df
worldcup.head()
# check for nulls, size, data types
worldcup.info()

# summarise the numerical fields
worldcup.groupby('gender')['total_score'].describe().round(2).T

Observations for matches

  • 10x more men's matches
  • same median and minimum, but women's is more skewed, i.e. men's median and mean are signficantly closer to a normal distribution
## The summary shows a higher std and 
# get the year from the date
worldcup['year'] = worldcup['date'].dt.year

# plot the histogram
#plt.figure(figsize=(10, 6))
#sns.histplot(results['year'], binwidth=4, color='gray')
sns.histplot(data=worldcup, x='year', binwidth=1, alpha=0.6, hue='gender')
plt.title('Number of Games per Year')
plt.xlabel('Year')
plt.ylabel('Number of Games')
plt.show()
print(worldcup[worldcup['gender'] == 'women']['year'].min())

1. Exploratory data analysis

  • Load the data from men_results.csv and women_results.csv to understand its contents.
  • Determining the column names, data types, and values
worldcup.groupby('gender')['total_score'].describe().round(2).T

Despite filterg

# plot the histogram with kde
sns.histplot(data=worldcup, x='total_score', hue='gender', stat='density', common_norm=False, binwidth=1)
plt.title('distribution of total scores by gender')
plt.xlabel('total score')
plt.ylabel('density')
plt.show()
# calculate the standard deviation of total scores for each year by gender
std_devs = worldcup.groupby(['year', 'gender'])['total_score'].std().reset_index()

# pivot the data for easier plotting
std_devs_pivot = std_devs.pivot(index='year', columns='gender', values='total_score')

# plot the standard deviations using a line graph with markers
ax = std_devs_pivot.plot(kind='line', marker='o', figsize=(12, 6))
ax.set_title('standard deviations of total scores by year and gender')
ax.set_xlabel('year')
ax.set_ylabel('standard deviation of total scores')
ax.set_ylim(0)  # start y-axis at 0
plt.legend(title='gender')
plt.grid(True)
plt.show()