Skip to content

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

: The mean number of goals scored in women's international soccer matches is the same as men's.

: The mean number of goals scored in women's international soccer matches is greater than men's.

Libraries and the data reading

# Libs
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import qqplot
from scipy.stats.distributions import norm
from scipy.stats import shapiro
from scipy.stats import mannwhitneyu
# Reading the data

df_women = pd.read_csv('women_results.csv') # Women's Football Matches

df_men = pd.read_csv('men_results.csv') # Men's Football Matches

Getting to know the data

First, we need to verify if both datasets are clean. To do this, I'll be looking at the first rows and the data types of each column.

# First look at Women's matches
print(df_women.head())
print(df_women.info())

We have a total of 4884 matches for women's football, accouting all tournaments. However, the question is about the World Cup and we have to use matches since 2002. So let's filter it. Also, the date column is being presented as a string. In order to filter by the date, we'll have to convert it to date format.

# First look at Men's matches
print("First 5 rows of the data frame with the men's football matches: \n")
print(df_men.head())

print("Overall Information men's football matches: \n")
print(df_men.info())

The men's data frame has more than 9x times the amount of rows of women's matches. Also, its date column is a string too. We'll have to correct the column and perform the filtering.

Just to recall, I'll filter both data frames considering the following restrictions:

  1. Only FIFA World Cup;
  2. Only matches from 2002 onwards.

Filtering the data

To filter for matches from 2002 onwards, I'll use datetime functions to convert the date column into datetime and then filter for the years of 2002 onwards. Then, I'll filter for matches of the FIFA World Cup.

# Filter for the years 2002-

df_women['date'] = pd.to_datetime(df_women['date'])
df_women = df_women[df_women['date'] > '2002-01-01']

df_men['date'] = pd.to_datetime(df_men['date'])
df_men = df_men[df_men['date'] > '2002-01-01']

# Filter for FIFA World Cup

df_women = df_women[df_women['tournament'].isin(['FIFA World Cup'])]
df_men = df_men[df_men['tournament'].isin(['FIFA World Cup'])]

We want to know if the mean number of goals scored in women's international socces matches is the same as men's. So, let's create a new column with the total goals of each match to find the mean of each sample.

# Summing up the home and away goals

df_women['total_goals'] = df_women['home_score'] + df_women['away_score']

df_men['total_goals'] = df_men['home_score'] + df_men['away_score']

Choosing the Statistical Test

Usually, to compare means of two groups, we use t-test. However, t-test needs a distribution aproximately normal which can be achieved with a large sample size. Also, it assumes equal variances.

# Verifying the sample size
print(len(df_women))
print(len(df_men))