1 hidden cell
Introduction
This project was aimed at finding out whether the more goals are scored in women's international football matches than men's. This would make an interesting investigative article that soccer fans are bound to love, but to be sure this claim it must be tested with a valid statistical hypothesis test!
This analysis was limited to the data of official FIFA World Cup matches (not including qualifiers) since 2002-01-01.
Two datasets containing the results of every official men's and women's international football match since the 19th century, was provided by DataCamp.
This data is stored in two CSV files: women_results.csv and men_results.csv.
The question I attempted to determine the answer to is:
Are more goals scored in women's international soccer matches than men's?
I assumed a 10% significance level, and used the following null and alternative hypotheses:
Data Preparation
# Importing data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
men_results = pd.read_csv('men_results.csv')
women_results = pd.read_csv('women_results.csv')men_results.head()women_results.head()# inspecting dataframes
men_results.shapewomen_results.shapeour sample sets are sufficiently large, with 44353 entries in men_results and 4884 entries in women_results but it contains data of different tournaments since 19th century. We need to subset it later to FIFA World Cup and see the characterstics of the subsetted dataframes
# converting date columns into appropriate data type and extracting year
men_results['date'] = pd.to_datetime(men_results['date'])
men_results['year'] = men_results['date'].dt.year.astype(int)
women_results['date'] = pd.to_datetime(women_results['date'])
women_results['year'] = women_results['date'].dt.year.astype(int)# creating total goals column in each dataframe for testing hypothesis
men_results['total_goals'] = men_results['home_score'] + men_results['away_score']
women_results['total_goals'] = women_results['home_score'] + women_results['away_score'] men_results.info()
men_results.describe()women_results.info()
women_results.describe()the data type of columns have been now fixed, I am now going to subset the data for FIFA World Cup tournament for entries since the year 2002