Skip to content
Duplicate of Soccer Data Analysis
  • AI Chat
  • Code
  • Report
  • 2018-19 English Premier League: An Exploratory Data Analysis

    • This dataset contains data of every game from the 2018-2019 season in the English Premier League.
    • In this project, I aim to explore the data and communicate some interesting findings.
    • The last section of this project shows the correlation between various columns of the data.

    Source of dataset.

    Data Dictionary

    ColumnExplanation
    DivDivision the game was played in
    DateThe date the game was played
    HomeTeamThe home team
    AwayTeamThe away team
    FTHGFull time home goals
    FTAGFull time away goals
    FTRFull time result
    HTHGHalf time home goals
    HTAGHalf time away goals
    HTRHalf time result
    RefereeThe referee of the game
    HSNumber of shots taken by home team
    ASNumber of shots taken by away team
    HSTNumber of shots taken by home team on target
    ASTNumber of shots taken by away team on target
    HFNumber of fouls made by home team
    AFNumber of fouls made by away team
    HCNumber of corners taken by home team
    ACNumber of corners taken by away team
    HYNumber of yellow cards received by home team
    AYNumber of yellow cards received by away team
    HRNumber of red cards received by home team
    ARNumber of red cards received by away team
    #Importing necessary libraries
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Loading the dataset into a dataframe
    df = pd.read_csv("soccer18-19.csv")
    
    #Printing the number of rows and columns
    print('Number of rows and columns:', df.shape)
    
    #Printing out the first five rows
    df.head()
    

    Understanding Columns & Values

    • The info() function ia useful tool to summarize the data.
    • Here, I'm going to analyze each column's name, datatype and number of non-null rows they carry.
    • This is important to see if there are any missing values and to get familiar with the overall dataset.
    df.info()
    • Now, let's use the isna() function and aggregating it using sum() to get the total count of missing values.
    df.isna().sum()
    • The data is complete as there are no null values.
    • This means that I don't have to alter the dataframe in any way.

    Useful Statistics

    • Here, we'll be using the describe() function.
    • This gives us helpful descriptive stats for our data
    • Null values are excluded here. In our case, however, there aren't any.
    df.describe()
    • Using the unique() function to print distinct values of the 'Home Team' column.
    • This will show us all the teams that participated in the season.
    df['HomeTeam'].unique()
    • Using the value_counts() function to print out the number of rows for each unique team.
    • This shows how many matches each team played as Home Team.
    • Note: Every team playes 19 matches as Home Team and other 19 as Away.
    df['HomeTeam'].value_counts(dropna=True)