Skip to content
Duplicate of Soccer Data Analysis
2018-19 English Premier League: An Exploratory Data Analysis
- This dataset contains data of every game from the 2018-2019 season in the English Premier League.
- In this project, I aim to explore the data and communicate some interesting findings.
- The last section of this project shows the correlation between various columns of the data.
Source of dataset.
Data Dictionary
| Column | Explanation | 
|---|---|
| Div | Division the game was played in | 
| Date | The date the game was played | 
| HomeTeam | The home team | 
| AwayTeam | The away team | 
| FTHG | Full time home goals | 
| FTAG | Full time away goals | 
| FTR | Full time result | 
| HTHG | Half time home goals | 
| HTAG | Half time away goals | 
| HTR | Half time result | 
| Referee | The referee of the game | 
| HS | Number of shots taken by home team | 
| AS | Number of shots taken by away team | 
| HST | Number of shots taken by home team on target | 
| AST | Number of shots taken by away team on target | 
| HF | Number of fouls made by home team | 
| AF | Number of fouls made by away team | 
| HC | Number of corners taken by home team | 
| AC | Number of corners taken by away team | 
| HY | Number of yellow cards received by home team | 
| AY | Number of yellow cards received by away team | 
| HR | Number of red cards received by home team | 
| AR | Number of red cards received by away team | 
#Importing necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset into a dataframe
df = pd.read_csv("soccer18-19.csv")
#Printing the number of rows and columns
print('Number of rows and columns:', df.shape)
#Printing out the first five rows
df.head()
Understanding Columns & Values
- The info() function ia useful tool to summarize the data.
- Here, I'm going to analyze each column's name, datatype and number of non-null rows they carry.
- This is important to see if there are any missing values and to get familiar with the overall dataset.
df.info()- Now, let's use the isna() function and aggregating it using sum() to get the total count of missing values.
df.isna().sum()- The data is complete as there are no null values.
- This means that I don't have to alter the dataframe in any way.
Useful Statistics
- Here, we'll be using the describe() function.
- This gives us helpful descriptive stats for our data
- Null values are excluded here. In our case, however, there aren't any.
df.describe()- Using the unique() function to print distinct values of the 'Home Team' column.
- This will show us all the teams that participated in the season.
df['HomeTeam'].unique()- Using the value_counts() function to print out the number of rows for each unique team.
- This shows how many matches each team played as Home Team.
- Note: Every team playes 19 matches as Home Team and other 19 as Away.
df['HomeTeam'].value_counts(dropna=True)