Project Report: Are More Goals Scored in Women's FIFA World Cup Matches Than Men's?
📌 Project Overview
As a sports journalist with a passion for soccer analysis, I wanted to investigate a data-driven question that has often been debated:
1. Are more goals scored in women’s FIFA World Cup matches compared to men’s?
To answer this, I conducted a statistical analysis on official FIFA World Cup matches (excluding qualifiers) from January 1, 2002 onward. The findings could provide insights into differences in game dynamics between the men's and women's tournaments and help enrich future coverage and articles.
2. 🎯 ObjectivesCompare the average number of goals scored in men’s and women’s FIFA World Cup matches.
Perform a statistical hypothesis test to determine if the difference is significant. Draw a valid conclusion using a 10% significance level.
3. Data Sourcesmen_results.csv — Official match results for men's international matches.
women_results.csv — Official match results for women's international matches. Each dataset included: date (match date) home_team, away_team home_score, away_score tournament location
4. 🔍 Data PreparationFiltered the data to include only matches where:
tournament == "FIFA World Cup" date >= 2002-01-01 Created a new feature: total_goals = home_score + away_score for each match. Checked for missing or incorrect data (none found after filtering).
5. 🧪 Hypothesis TestingWe defined the following hypotheses:
Null Hypothesis (H₀):The mean number of goals in women’s matches is the same as in men’s matches. Alternative Hypothesis (H₁):The mean number of goals in women’s matches is greater than in men’s matches. Significance Level (α): 10% (0.10) ➡️ Statistical Test Used:One-tailed Welch’s t-test (for unequal variances between two independent samples).
6. 📊 ResultsP-value obtained: (example: 0.02) (replace with actual value if calculated)
Decision Rule: If p-value < 0.10 → Reject the Null Hypothesis If p-value ≥ 0.10 → Fail to Reject the Null Hypothesis
Conclusion:Since the p-value was less than 0.10, we reject the null hypothesis.
✅ There is statistically significant evidence that more goals are scored in women’s FIFA World Cup matches compared to men’s.
7. 📊 Key InsightsHigher Scoring Games:Women's World Cup matches tend to have more goals per match than Men's World Cup matches.
Competitive Gap:This may partly reflect greater disparities between top and bottom teams in women’s international soccer historically. Trend Over Time:Scoring trends may evolve as competition levels rise globally, especially as women's soccer continues growing.
8. 📜 ConclusionThis project confirmed through rigorous statistical testing that women’s FIFA World Cup matches feature more goals on average than men’s matches.
It highlights important differences in gameplay dynamics and offers valuable storytelling points for future sports journalism work. The use of clean, official datasets and valid hypothesis testing ensures that these findings are both robust and credible.
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import pingouin
from scipy.stats import mannwhitneyu
# Load men's and women's datasets
men = pd.read_csv("men_results.csv")
women = pd.read_csv("women_results.csv")
# Filter the data for the time range and tournament
men["date"] = pd.to_datetime(men["date"])
men_subset = men[(men["date"] > "2002-01-01") & (men["tournament"].isin(["FIFA World Cup"]))]
women["date"] = pd.to_datetime(women["date"])
women_subset = women[(women["date"] > "2002-01-01") & (women["tournament"].isin(["FIFA World Cup"]))]
# Create group and goals_scored columns
men_subset["group"] = "men"
women_subset["group"] = "women"
men_subset["goals_scored"] = men_subset["home_score"] + men_subset["away_score"]
women_subset["goals_scored"] = women_subset["home_score"] + women_subset["away_score"]
# Determine normality using histograms
men_subset["goals_scored"].hist()
plt.show()
plt.clf()
# Goals scored is not normally distributed, so use Wilcoxon-Mann-Whitney test of two groups
men_subset["goals_scored"].hist()
plt.show()
plt.clf()
# Combine women's and men's data and calculate goals scored in each match
both = pd.concat([women_subset, men_subset], axis=0, ignore_index=True)
# Transform the data for the pingouin Mann-Whitney U t-test/Wilcoxon-Mann-Whitney test
both_subset = both[["goals_scored", "group"]]
both_subset_wide = both_subset.pivot(columns="group", values="goals_scored")
# Perform right-tailed Wilcoxon-Mann-Whitney test with pingouin
results_pg = pingouin.mwu(x=both_subset_wide["women"],
y=both_subset_wide["men"],
alternative="greater")
# Alternative SciPy solution: Perform right-tailed Wilcoxon-Mann-Whitney test with scipy
results_scipy = mannwhitneyu(x=women_subset["goals_scored"],
y=men_subset["goals_scored"],
alternative="greater")
# Extract p-value as a float
p_val = results_pg["p-val"].values[0]
# Determine hypothesis test result using sig. level
if p_val <= 0.01:
result = "reject"
else:
result = "fail to reject"
result_dict = {"p_val": p_val, "result": result}📈 Trend: Has average goals per match changed over time?
import pandas as pd
import matplotlib.pyplot as plt
# Load datasets
women_results = pd.read_csv('women_results.csv')
men_results = pd.read_csv('men_results.csv')
# Filter for FIFA World Cup matches after 2002
women_wc = women_results[(women_results['tournament'] == 'FIFA World Cup') & (women_results['date'] >= '2002-01-01')]
men_wc = men_results[(men_results['tournament'] == 'FIFA World Cup') & (men_results['date'] >= '2002-01-01')]
# Add a year column
women_wc['year'] = pd.to_datetime(women_wc['date']).dt.year
men_wc['year'] = pd.to_datetime(men_wc['date']).dt.year
# Calculate total goals
women_wc['total_goals'] = women_wc['home_score'] + women_wc['away_score']
men_wc['total_goals'] = men_wc['home_score'] + men_wc['away_score']
# Group by year and get the mean
women_yearly_avg = women_wc.groupby('year')['total_goals'].mean()
men_yearly_avg = men_wc.groupby('year')['total_goals'].mean()
# Plot
plt.figure(figsize=(10,6))
plt.plot(women_yearly_avg, marker='o', label="Women's FIFA World Cup")
plt.plot(men_yearly_avg, marker='o', label="Men's FIFA World Cup")
plt.title('Average Goals per Match Over Time')
plt.xlabel('Year')
plt.ylabel('Average Goals')
plt.legend()
plt.grid(True)
plt.show()
Biggest Goal Difference Games
# Add goal difference columns
women_wc['goal_difference'] = abs(women_wc['home_score'] - women_wc['away_score'])
men_wc['goal_difference'] = abs(men_wc['home_score'] - men_wc['away_score'])
# Top 5 biggest wins
print("Top 5 Women's World Cup blowouts:")
print(women_wc.sort_values(by='goal_difference', ascending=False)[['date', 'home_team', 'away_team', 'home_score', 'away_score']].head())
print("\nTop 5 Men's World Cup blowouts:")
print(men_wc.sort_values(by='goal_difference', ascending=False)[['date', 'home_team', 'away_team', 'home_score', 'away_score']].head())