You're working as a sports journalist at a major online sports media company, specializing in soccer analysis and reporting. You've been watching both men's and women's international soccer matches for a number of years, and your gut instinct tells you that more goals are scored in women's international football matches than men's. This would make an interesting investigative article that your subscribers are bound to love, but you'll need to perform a valid statistical hypothesis test to be sure!
While scoping this project, you acknowledge that the sport has changed a lot over the years, and performances likely vary a lot depending on the tournament, so you decide to limit the data used in the analysis to only official FIFA World Cup matches (not including qualifiers) since 2002-01-01.
You create two datasets containing the results of every official men's and women's international football match since the 19th century, which you scraped from a reliable online source. This data is stored in two CSV files: women_results.csv and men_results.csv.
The question you are trying to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
You assume a 10% significance level, and use the following null and alternative hypotheses:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg
import scipy.stats as stats
# Load the women's and men's match results datasets
wresults= pd.read_csv("women_results.csv", parse_dates=["date"])
mresults= pd.read_csv("men_results.csv", parse_dates=["date"])
# Filter the datasets for matches that are part of the FIFA World Cup
wresults= wresults[wresults["tournament"]=="FIFA World Cup"]
mresults= mresults[mresults["tournament"]=="FIFA World Cup"]
# Further filter the datasets for matches that occurred on or after January 1, 2002
wresults= wresults[wresults["date"]>="2002-01-01"]
mresults= mresults[mresults["date"]>="2002-01-01"]
# Calculate the total score for each match by adding the home and away scores
wresults["total_score"] = wresults["home_score"] + wresults["away_score"]
mresults["total_score"] = mresults["home_score"] + mresults["away_score"]
# Display the first few rows of the filtered datasets
display(wresults, mresults)
# Print the mean and standard deviation of the total scores for both datasets
print("Means:","Women:",wresults["total_score"].mean().round(2),"Men:", mresults["total_score"].mean().round(2))
print("Std  :","Women:",wresults["total_score"].std().round(2),"Men:", mresults["total_score"].std().round(2))
# Plotting both histograms side by side for visual comparison
plt.figure(figsize=(12, 6))
# Histogram for women's results
plt.subplot(1, 2, 1)  # 1 row, 2 columns, 1st subplot
plt.hist(wresults["total_score"], color='blue', alpha=0.7, label='Women')
plt.title('Women Total Scores')  # Title for the histogram
plt.xlabel('Total Score')  # X-axis label
plt.ylabel('Frequency')  # Y-axis label
plt.legend()  # Show legend
# Histogram for men's results
plt.subplot(1, 2, 2)  # 1 row, 2 columns, 2nd subplot
plt.hist(mresults["total_score"], color='green', alpha=0.7, label='Men')
plt.title('Men Total Scores')  # Title for the histogram
plt.xlabel('Total Score')  # X-axis label
plt.ylabel('Frequency')  # Y-axis label
plt.legend()  # Show legend
plt.tight_layout()  # Adjust the layout to prevent overlap
plt.show()  # Display the histograms# Check for normality of the total scores for both women's and men's results using Shapiro-Wilk test
shapiro_test_women = stats.shapiro(wresults['total_score'])
shapiro_test_men = stats.shapiro(mresults['total_score'])
# Display the Shapiro-Wilk test results
print("Shapiro-Wilk Test for Women's Total Scores:", shapiro_test_women)
print("Shapiro-Wilk Test for Men's Total Scores:", shapiro_test_men)
# Visual inspection of the distribution using Q-Q plots
plt.figure(figsize=(12, 6))
# Plot for women's total scores
plt.subplot(1, 2, 1)
stats.probplot(wresults['total_score'], dist="norm", plot=plt)
plt.title('Q-Q Plot for Women\'s Total Scores')
# Plot for men's total scores
plt.subplot(1, 2, 2)
stats.probplot(mresults['total_score'], dist="norm", plot=plt)
plt.title('Q-Q Plot for Men\'s Total Scores')
plt.tight_layout()
plt.show()Women's Total Scores:
Statistic: 0.8491
- This value indicates how well the data fit a normal distribution. Closer to 1 means better fit. Here, 0.8491 suggests a moderate fit.
P-value: 3.89×10−13
- This very small p-value indicates strong evidence against the null hypothesis that the data are normally distributed. Thus, we reject the null hypothesis for women's total scores, suggesting that they are not normally distributed.
Men's Total Scores:
Statistic: 0.9266
- Similar to above, this statistic shows a closer fit to normal distribution compared to women's scores but still not perfect.
P-value: 8.89×10−13
- Like with the women's scores, this very small p-value leads us to reject the null hypothesis that the data are normally distributed, indicating non-normality in men's total scores.
Since the distributions are not normally distributed we will be using Mann-Whitney U to calculate the p-value.
# Perform a Mann-Whitney U test comparing total scores between women's and men's results
paired_test = pg.mwu(x=wresults['total_score'], y=mresults['total_score'],alternative="greater" )
# Display the test results
display(paired_test)
# Extract the p-value from the test results
p_val=paired_test["p-val"].values[0]
# Print the p-value for inspection
print(p_val)
# Determine the result based on the p-value with a significance level of 0.1
if p_val <= 0.1:
    result = "reject"  # Reject the null hypothesis
else:
    result= "fail to reject"  # Fail to reject the null hypothesis
# Create a dictionary to store the p-value and the test conclusion
result_dict = {"p_val": p_val, "result": result}
# Display the result dictionary
result_dict