Skip to content
Soccer Data
  • AI Chat
  • Code
  • Report
  • Soccer Data

    This dataset contains data of every game from the 2018-2019 season in the English Premier League.

    Not sure where to begin? Scroll to the bottom to find challenges!

    import pandas as pd
    
    pd.read_csv("soccer18-19.csv")

    Data Dictionary

    ColumnExplanation
    DivDivision the game was played in
    DateThe date the game was played
    HomeTeamThe home team
    AwayTeamThe away team
    FTHGFull time home goals
    FTAGFull time away goals
    FTRFull time result
    HTHGHalf time home goals
    HTAGHalf time away goals
    HTRHalf time result
    RefereeThe referee of the game
    HSNumber of shots taken by home team
    ASNumber of shots taken by away team
    HSTNumber of shots taken by home team on target
    ASTNumber of shots taken by away team on target
    HFNumber of fouls made by home team
    AFNumber of fouls made by away team
    HCNumber of corners taken by home team
    ACNumber of corners taken by away team
    HYNumber of yellow cards received by home team
    AYNumber of yellow cards received by away team
    HRNumber of red cards received by home team
    ARNumber of red cards received by away team

    Source of dataset.

    Don't know where to start?

    Challenges are brief tasks designed to help you practice specific skills:

    • πŸ—ΊοΈ Explore: What team commits the most fouls?
    • πŸ“Š Visualize: Plot the percentage of games that ended in a draw over time.
    • πŸ”Ž Analyze: Does the number of red cards a team receives have an effect on its probability of winning a game?

    Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

    You have just been hired as a data analyst for a local soccer team. The team has recently signed on some junior players and wants to give them as much experience as possible without losing games. If the head coach could be confident in the outcome of a game by halftime, they would be more likely to give the junior players time on the field.

    The coach has asked you whether you can predict the outcome of a game by the results at halftime and how confident you would be in the prediction.

    You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

    πŸ—ΊοΈ Explore: What team commits the most fouls?

    # Load the dataset
    df = pd.read_csv("soccer18-19.csv")
    
    # Group the data by team and sum the fouls committed
    # HF	Number of fouls made by home team
    # AF	Number of fouls made by away team
    fouls_by_team = df.groupby('HomeTeam')['HF'].sum() + df.groupby('AwayTeam')['AF'].sum()
    
    # Get the team with the most fouls
    team_with_most_fouls = fouls_by_team.idxmax()
    
    # Print the team with the most fouls
    print ("The team which commits the most fouls was " + team_with_most_fouls)

    🎯Answer: The team which commits the most fouls was Brighton

    πŸ“Š Visualize: Plot the percentage of games that ended in a draw over time.

    import matplotlib.pyplot as plt
    
    # Calculate the percentage of games that ended in a draw over time
    # FTR	Full time result
    draw_percentages = df[df['FTR'] == 'D'].count()['FTR'] / df.count()['FTR']*100
    
    labels = 'Draw', 'No-draw'
    sizes = [draw_percentages, (100-draw_percentages)]
    
    fig, ax = plt.subplots()
    ax.pie(sizes, labels=labels, autopct='%1.1f%%')

    πŸ”Ž Analyze: Does the number of red cards a team receives have an effect on its probability of winning a game?

    # Create a new column with the difference in red cards between home and away teams
    # HR	Number of red cards received by home team
    # AR	Number of red cards received by away team
    df['RedDiff'] = df['HR'] - df['AR']
    
    # Create a new column indicating whether the home team won or not
    # FTR	Full time result
    df['HomeWin'] = df['FTR'].apply(lambda x: 1 if x == 'H' else 0)
    
    # Group the data by the difference in red cards and calculate the mean of home wins
    red_diff_home_win = df.groupby('RedDiff')['HomeWin'].mean()
    
    # Plot the results
    plt.plot(red_diff_home_win.index, red_diff_home_win.values)
    plt.xlabel('Difference in Red Cards (Home - Away)')
    plt.ylabel('Probability of Home Win')
    plt.title('Effect of Red Cards on Probability of Home Win')
    plt.show()