Skip to content

Soccer Data Analysis

This dataset contains data of every game from the 2018-2019 season in the English Premier League.

{Premier League Banner. (n.d.). https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.tvinsider.com%2Fshow%2Fpremier-league-soccer%2F&psig=AOvVaw32-CxpRtcUP-dLOPnk1iuf&ust=1721973135848000&source=images&cd=vfe&opi=89978449&ved=0CBQQjRxqFwoTCOCBhfq_wYcDFQAAAAAdAAAAABAU}

# original data set; no modifications.
import pandas as pd
soccer = pd.read_csv("soccer18-19.csv")
print(soccer.shape)
soccer.head(100)

Data Dictionary

ColumnExplanation
DivDivision the game was played in
DateThe date the game was played
HomeTeamThe home team
AwayTeamThe away team
FTHGFull time home goals
FTAGFull time away goals
FTRFull time result
HTHGHalf time home goals
HTAGHalf time away goals
HTRHalf time result
RefereeThe referee of the game
HSNumber of shots taken by home team
ASNumber of shots taken by away team
HSTNumber of shots taken by home team on target
ASTNumber of shots taken by away team on target
HFNumber of fouls made by home team
AFNumber of fouls made by away team
HCNumber of corners taken by home team
ACNumber of corners taken by away team
HYNumber of yellow cards received by home team
AYNumber of yellow cards received by away team
HRNumber of red cards received by home team
ARNumber of red cards received by away team

Source of dataset.

Fixing DataSet

Before we making SQL analyis, let's make the following changes to the DB:

  1. Add a matchID index column
  2. Drop 'Div' column as it's unnecessary
  3. Replace the 'AS' column name with 'ASS' because 'AS' is an SQL keyword and might cause errors
# create match_id column numbering it 1 - 380
soccer.insert(0, 'MatchID', range(1, len(soccer) + 1))

# drop 'Div' column
soccer = soccer.drop(columns='Div')

# replace column AS with ASS because AS is an SQL keyword and would interrupt future calculations
soccer['ASS'] = soccer.pop('AS')
soccer.head(5)

2018/19 Season Premier League Table

  • Summary of main end-of-season
  • SORT: by points (highest on top!)
  • To see code, toggle 'Hidden code' ABOVE table or the eye icon to the TOP LEFT of table 👇
Spinner
DataFrameas
df1
variable
-- Count matches won and lost for each team
SELECT 
    team,
	COUNT(MatchID) AS MP,
    COUNT(CASE WHEN result = 'win' THEN 1 END) AS W,
	COUNT(CASE WHEN result = 'draw' THEN 1 END) AS D,
    COUNT(CASE WHEN result = 'loss' THEN 1 END) AS L,
	SUM(goals_for) AS GF,
	SUM(goals_against) AS GA,
	SUM(goal_difference) AS GD,
	SUM(point) AS points
FROM (
    SELECT 
        HomeTeam AS team,
		MatchID,
        CASE 
            WHEN FTHG > FTAG THEN 'win'
            WHEN FTHG < FTAG THEN 'loss'
            ELSE 'draw'
        END AS result,
		FTHG as goals_for,
		FTAG as goals_against,
		(FTHG - FTAG) AS goal_difference,
		CASE 
			WHEN FTHG > FTAG THEN 3 
	   		WHEN FTHG = FTAG THEN 1 
			ELSE 0 
		END AS point
    FROM 
        soccer

    UNION ALL

    SELECT 
        AwayTeam AS team,
		MatchID,
        CASE 
            WHEN FTAG > FTHG THEN 'win'
            WHEN FTAG < FTHG THEN 'loss'
            ELSE 'draw'
        END AS result,
		FTAG as goals_for,
		FTHG as goals_against,
		(FTAG - FTHG) AS goal_difference,
		CASE 
			WHEN FTHG < FTAG THEN 3 
	   		WHEN FTHG = FTAG THEN 1 
			ELSE 0 
		END AS point
    FROM 
        soccer
) AS matches
GROUP BY 
	team
ORDER BY 
	points DESC;

Shooting Statistics

  • Summary of major shooting⚽️ statistics like shots taken, faced, cleansheets 🧤, goaless games, etc
  • SORT: Alphabetical
Hidden code df