Soccer Data
This dataset contains data of every game from the 2018-2019 season in the English Premier League.
Not sure where to begin? Scroll to the bottom to find challenges!
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv("soccer18-19.csv")Let's understand what each variable means.
variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])
for i, var in enumerate(df.columns):
variables.loc[i] = [var, df[var].nunique(), df[var].unique().tolist()]
# There is also a csv file consisting of an explanation of the different variables.
# We will join this with the variables dataframe.
var_dict = pd.read_csv('variable_explanation.csv', index_col=0)
variables.set_index('Variable').join(var_dict)What defines a great team is their ability to win soccer games even when they are not playing at their best, their ability to overcome osbtacles is what separates the good from the great.
Let's look if our dataset is clean and ready to analyze
df.describe()
df.isna().sum()Source of dataset.
I'll be trying to find which teams are the best at learning from mistakes at half and correcting them by full time by making a comeback a tie will also be taking into consideration but not as much. What makes these teams better at the second half? More chances? less fouls?
How many comebacks does each premier league team had in the 19 and 20 seasons?