Skip to content

You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!

While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.

You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.

The question you are trying to determine the answer to is:

Are more goals scored in women's international soccer matches than men's?

You assume a 10% significance level, and use the following null and alternative hypotheses:

: The mean number of goals scored in women's international soccer matches is the same as men's.

: The mean number of goals scored in women's international soccer matches is greater than men's.

# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import pingouin as pg
from scipy.stats import shapiro
import scipy.stats as stats

# loading and checking both datasets
men_results = pd.read_csv("men_results.csv", index_col=0)
women_results = pd.read_csv("women_results.csv",index_col=0)
print(women_results.head())
print(women_results.info())
print("Count of null valuesin womens DF")
print(women_results.isnull().sum())
print(men_results.head())
print(men_results.info())
print("Count of null values in mens DF")
print(men_results.isnull().sum())

All data have the same structure, there are no null values. There is only one error in the data type of “date” should be corrected to datetime.

# coverting column "date" to datetime
men_results["date"] = pd.to_datetime(men_results["date"]) 
women_results["date"] = pd.to_datetime(women_results["date"])
print(women_results.info())
print(men_results.info())
# creating a new column for all goals
women_results["total_goals"] = women_results["home_score"] + women_results["away_score"]
men_results["total_goals"] = men_results["home_score"] + men_results["away_score"]
print(women_results.info())
print(men_results.info())

Filtering the data Filter the data to only include official FIFA World Cup matches that took place after 2002–01–01.

# Filtering the data

FWP_women_results = women_results[women_results["tournament"]=='FIFA World Cup']
filtered_women_results = FWP_women_results[FWP_women_results["date"]> "2002-01-01"]


FWP_men_results = men_results[men_results["tournament"]=='FIFA World Cup']
filtered_men_results = FWP_men_results[FWP_men_results["date"]> "2002-01-01"]

Choosing the correct hypothesis test

# Using EDA to determine the right hypothesis

filtered_men_results["total_goals"].hist(alpha=0.4,color="red", label="Men Subset")
filtered_women_results["total_goals"].hist(alpha=0.6, label="Women Subset")
plt.title("Goals distribution by Gender")
plt.xlabel("Total Goals")
plt.ylabel("Frecuency")
plt.show()

Interpretation:

  • The distribution of men and women is very similar, almost identical, although men have a higher total of hits and goals.
  • The graph shows a skewed distribution to the right, suggesting Right tail test.
# Determing if the data is normally distributed

women_goals = filtered_women_results["total_goals"]
men_goals = filtered_men_results["total_goals"]

alpha = 0.01

women_statistic, pvalue_women = shapiro(women_goals)

if pvalue_women > alpha:
    print(" Women Data is normally distributed")
else:
    print("Women Data does not look normally distributed")
    
    
men_statistic, pvalue_men = shapiro(men_goals)

if pvalue_women > alpha:
    print(" Men Data is normally distributed")
else:
    print("Men Data does not look normally distributed")

Since both dataset are not normally distributed we can use a non-parametric test. Wilcoxon-Mann-Whitney test.

# performing Wilcoxon-Mann-Whitney test

results_wmu = pg.mwu(women_goals,men_goals, alternative="greater")
p_val = results_wmu["p-val"].values[0]

confidence = 0.01

if p_val > confidence:
    print("Fail to Reject H0 There is no significant difference between goals scores of men and women (p>0.1)")
else:
    print("Reject Ho There is a significant difference between goals scores of men and women (p<0.1)")

if p_val < confidence:
   result = "reject" 
else:
    result = "fail to reject"
    
result_dict = {"p_val": p_val, "result": result}
result_dict

Conclusion

As the P-value is lesser than the suggested Alpha value, we have sufficient evidence to reject the null hypothesis and can safely say: The mean number of goals scored in women's international soccer matches is GREATER than men's at a 10% significance level.