Skip to content
2018-19 Premier League Season Analysis
Soccer Data
This dataset contains data of every game from the 2018-2019 season in the English Premier League.
Discover the possible outcomes of matches and how those outcomes change between half time and full time. Analyse the results to see what takeaways can be taken.
Data Dictionary
| Column | Explanation |
|---|---|
| Div | Division the game was played in |
| Date | The date the game was played |
| HomeTeam | The home team |
| AwayTeam | The away team |
| FTHG | Full time home goals |
| FTAG | Full time away goals |
| FTR | Full time result |
| HTHG | Half time home goals |
| HTAG | Half time away goals |
| HTR | Half time result |
| Referee | The referee of the game |
| HS | Number of shots taken by home team |
| AS | Number of shots taken by away team |
| HST | Number of shots taken by home team on target |
| AST | Number of shots taken by away team on target |
| HF | Number of fouls made by home team |
| AF | Number of fouls made by away team |
| HC | Number of corners taken by home team |
| AC | Number of corners taken by away team |
| HY | Number of yellow cards received by home team |
| AY | Number of yellow cards received by away team |
| HR | Number of red cards received by home team |
| AR | Number of red cards received by away team |
Part 1 - Exploratory Data Analysis
# Import the data and display it
import pandas as pd
match_data = pd.read_csv("soccer18-19.csv", parse_dates=['Date'])
display(match_data.head())display(match_data.tail())# Info display the object types to check they have been imported correctly
# Info also highlights any NULLs to check
match_data.info()# Look for NULLs
print(match_data.isna().sum().sort_values())# Using describe to look for any odd numeric data points.
match_data.describe()# Look at the population of Home and Away team names to identify any typos in the names.
Home_Teams = match_data.groupby('HomeTeam').size()
Away_Teams = match_data.groupby('AwayTeam').size()
All_Teams = pd.concat([Home_Teams, Away_Teams], axis=1)
All_Teams.columns = ['HomeMatchCount', 'AwayMatchCount']
print(All_Teams)# Given Team names are repeated it would be sensible to convert it to categorical. This will help increase the speed of the data and reduce the storage
match_data['HomeTeam'] = match_data['HomeTeam'].astype('category')
match_data['AwayTeam'] = match_data['AwayTeam'].astype('category')# Look again at the population of Home and Away team names to identify any typos in the names.
Home_TeamsCategorical = match_data.groupby('HomeTeam').size()
Away_TeamsCategorical = match_data.groupby('AwayTeam').size()
All_TeamsCategorical = pd.concat([Home_TeamsCategorical, Away_TeamsCategorical], axis=1)
All_TeamsCategorical.columns = ['HomeMatchCount', 'AwayMatchCount']
print(All_Teams)Raw data review
The data is loaded correctly now the date column is parsed correctly.
- On review of the info display, the data is fully populated with no NULL data points that need to be dropped.
- On review of the describe display, the data would seem to be populated sensibly, with no obvious outliers or incorrect data, like negative goals etc.
- We have revieed the Teams used in the home and away, to check for typos. There are none. This has been converted to categorical data to improve performance
It would be safe to move on to transform the data for analytics.
Part 2 - Analysing Match Results
Using the half time and full time results of matches
# Find the number of matches played in the season
MatchCount = len(match_data)
print(MatchCount)# Show the different results and how many games for each at half time
HalfTimeResults = match_data.groupby('HTR').size().sort_values(ascending=False)
FullTimeResults = match_data.groupby('FTR').size().sort_values(ascending=False)
Results_df = pd.concat([HalfTimeResults, FullTimeResults], axis=1)
Results_df.columns = ['HTR', 'FTR']
display(Results_df)