Skip to content

1 hidden cell

Hypothesis Testing with Men’s and Women’s Soccer Matches ⚽📊

Executive Summary

This project investigates whether more goals are scored in women’s international soccer matches compared to men’s. Using historical FIFA World Cup match data since 2002, a statistical hypothesis test (Mann-Whitney U test) was conducted. The analysis provides evidence for data-driven sports reporting, helping journalists and analysts craft informed articles on gender differences in scoring patterns.

Business Problem

As a sports journalist, the goal was to validate or refute the intuition that women’s matches see higher goal counts. This requires robust statistical analysis to ensure that any reported differences are significant and not due to random variation.
Key questions:

  • Do women’s international matches consistently feature higher scoring than men’s?
  • Can these insights support compelling sports journalism content?

Methodology

Data Source
women_results.csv and men_results.csv with historical international match results.
Filtered for FIFA World Cup matches since 2002, excluding qualifiers.

Data Preparation
Calculated total goals per match (home_score + away_score).
Labeled matches by gender.
Combined datasets into a single analysis table.

Hypothesis Testing

  • Null Hypothesis (H0): Mean goals in women’s matches = mean goals in men’s matches.
  • Alternative Hypothesis (HA): Mean goals in women’s matches > mean goals in men’s matches.
  • Significance Level: α = 0.10
    Used Mann-Whitney U test (pingouin.mwu) to account for non-normal distributions.

Skills

Python
Libraries: pandas, pingouin
Key Functions: .read_csv(), .concat(), .pivot(), .mwu()
Data cleaning, merging, and performing non-parametric hypothesis tests.

Results

Mann-Whitney U test outcome:
U-value: 43,273
p-value: 0.0051
Decision: Reject H0 at α = 0.10

Interpretation:
Since the p-value (0.0051) is much less than the significance level (0.10), we reject the null hypothesis.
There is statistically significant evidence that women’s FIFA World Cup matches have higher goal counts than men’s matches.

Recommendations:
Sports media can confidently report that women’s matches tend to be higher-scoring, supporting data-driven articles that compare scoring trends across genders.
Additional analysis could examine goal patterns by tournament stage, team ranking, or match location for deeper insights.

Next Steps

Visualize goal distributions for men’s vs women’s matches (boxplots or histograms).
Expand dataset to include other competitions (e.g., continental tournaments) for a broader analysis.
Apply regression or machine learning models to explore factors influencing goal counts.

🔍📈📊 Analysis

# Start your code here!
import pandas as pd
import pingouin

women = pd.read_csv('women_results.csv', index_col=0)
men = pd.read_csv('men_results.csv', index_col=0)

women['score'] = women['away_score'] + women['home_score']
men['score'] = men['away_score'] + men['home_score']

women_02_F = women[(women['date'] >= '2002-01-01') & (women['tournament'] == 'FIFA World Cup')]
men_02_F = men[(men['date'] >= '2002-01-01') & (men['tournament'] == 'FIFA World Cup')]

women_02_F['team'] = 'women'
men_02_F['team'] = 'men'


wom_men_scores = pd.concat([women_02_F[['score', 'team']], men_02_F[['score', 'team']]], axis=0, ignore_index=True)

wom_men_scores_wide = wom_men_scores.pivot(columns='team', values = 'score')

alpha=0.1
test_result = pingouin.mwu(x=wom_men_scores_wide['women'], y=wom_men_scores_wide['men'], alternative='greater')
print(test_result)

p_value = test_result['p-val'].values[0]
print(p_value)

if p_value <= alpha:
    result = 'reject'
else:
    result = 'fail to reject'

result_dict= {'p_val' : p_value, 'result' :  result }
print(result_dict)