Skip to content

Soccer Data

This dataset contains data of every game from the 2018-2019 season in the English Premier League.

Discover the possible outcomes of matches and how those outcomes change between half time and full time. Analyse the results to see what takeaways can be taken.

Data Dictionary

ColumnExplanation
DivDivision the game was played in
DateThe date the game was played
HomeTeamThe home team
AwayTeamThe away team
FTHGFull time home goals
FTAGFull time away goals
FTRFull time result
HTHGHalf time home goals
HTAGHalf time away goals
HTRHalf time result
RefereeThe referee of the game
HSNumber of shots taken by home team
ASNumber of shots taken by away team
HSTNumber of shots taken by home team on target
ASTNumber of shots taken by away team on target
HFNumber of fouls made by home team
AFNumber of fouls made by away team
HCNumber of corners taken by home team
ACNumber of corners taken by away team
HYNumber of yellow cards received by home team
AYNumber of yellow cards received by away team
HRNumber of red cards received by home team
ARNumber of red cards received by away team

Part 1 - Exploratory Data Analysis

# Import the data and display it
import pandas as pd

match_data = pd.read_csv("soccer18-19.csv", parse_dates=['Date'])

display(match_data.head())
display(match_data.tail())
# Info display the object types to check they have been imported correctly
# Info also highlights any NULLs to check
match_data.info()
# Look for NULLs
print(match_data.isna().sum().sort_values())
# Using describe to look for any odd numeric data points.
match_data.describe()
# Look at the population of Home and Away team names to identify any typos in the names.
Home_Teams = match_data.groupby('HomeTeam').size()
Away_Teams = match_data.groupby('AwayTeam').size()

All_Teams = pd.concat([Home_Teams, Away_Teams], axis=1)
All_Teams.columns = ['HomeMatchCount', 'AwayMatchCount']

print(All_Teams)
# Given Team names are repeated it would be sensible to convert it to categorical. This will help increase the speed of the data and reduce the storage
match_data['HomeTeam'] = match_data['HomeTeam'].astype('category')
match_data['AwayTeam'] = match_data['AwayTeam'].astype('category')
# Look again at the population of Home and Away team names to identify any typos in the names.
Home_TeamsCategorical = match_data.groupby('HomeTeam').size()
Away_TeamsCategorical = match_data.groupby('AwayTeam').size()

All_TeamsCategorical = pd.concat([Home_TeamsCategorical, Away_TeamsCategorical], axis=1)
All_TeamsCategorical.columns = ['HomeMatchCount', 'AwayMatchCount']

print(All_Teams)

Raw data review

The data is loaded correctly now the date column is parsed correctly.

  • On review of the info display, the data is fully populated with no NULL data points that need to be dropped.
  • On review of the describe display, the data would seem to be populated sensibly, with no obvious outliers or incorrect data, like negative goals etc.
  • We have revieed the Teams used in the home and away, to check for typos. There are none. This has been converted to categorical data to improve performance

It would be safe to move on to transform the data for analytics.

Part 2 - Analysing Match Results

Using the half time and full time results of matches

# Find the number of matches played in the season
MatchCount = len(match_data)
print(MatchCount)
# Show the different results and how many games for each at half time 
HalfTimeResults = match_data.groupby('HTR').size().sort_values(ascending=False)
FullTimeResults = match_data.groupby('FTR').size().sort_values(ascending=False)
                                                               
Results_df = pd.concat([HalfTimeResults, FullTimeResults], axis=1)
Results_df.columns = ['HTR', 'FTR']
display(Results_df)