Serie A 2023–2024: Calculating Match Odds Using PCA, KMeans and Support Vector Machines
The goal of this project is to emulate the way bookies assign odds to match results.
For this purpose, Serie A data is used.
Team stats and match results are taken from the https://fbref.com website.
Notes: The data is taken directly from the page using web scraping. If you want to replicate the code and get the exact same results, you need to use the xlsx files that are in the repo (merged_df_serieA, games_serieA, baseprueba2_serieA). If you run the code not using those files, you will get different results because, at this time, it is half-season and as games continue, stats and match results will be updated. At the same time, the accuracy of the model should be higher in the end stages of the season and lower at the beginning, so it is better to make the calculations after around the first 8 or more week games. Last but not least, if you want to know the meaning of each variable in fbref data you should visit the website.
Last but not least, if you want to know the meaning of each variable in the fbref data, you should visit the website.
First, load the Python modules that are going to be needed:
pip install scrapy
import requests
from scrapy import Selector
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
The next cells contain custom functions that I will use later in the code:
12 hidden cells
#Urls and Table IDs
urlStats = "https://fbref.com/en/comps/11/Serie-A-Stats"
resultTableID = 'results2023-2024111_overall'
urlGames = "https://fbref.com/en/comps/11/schedule/Serie-A-Scores-and-Fixtures"
gamesTableID = 'sched_2023-2024_11_1'
#get html
selector = conexion_fbref_page(urlStats)
#We read all the tables where we want to extract data
season = tableScraper(selector, resultTableID)
standard = tableScraper(selector, 'stats_squads_standard_for')
goalkeeping = tableScraper(selector, 'stats_squads_keeper_for')
shooting = tableScraper(selector, 'stats_squads_shooting_for')
passing = tableScraper(selector, 'stats_squads_passing_for')
defensive = tableScraper(selector, 'stats_squads_defense_for')
#Extract only relevant data from season table
season = season.iloc[:, [1,8, 10, 14,15]]
#Convert column of last five matches results into points
season['Last 5'] = season['Last 5'].apply(resultPoints)
#Print table
#Extract only relevant data from season standard
standard = standard.iloc[:, [0] + list(range(20, len(standard.columns)))]
#rename columns
standard = colNames(standard)
#Print
standard
#Extract only relevant data from season goalkeeping
goalkeeping = goalkeeping.iloc[:, [0] + list(range(7, 16))]
#rename
goalkeeping = colNames(goalkeeping)
#print
goalkeeping
#Extract only relevant data from season shooting
shooting = shooting.iloc[:, [0] + list(range(4, 11))+ list(range(17, 20))]
#rename
shooting = colNames(shooting)
#print
shooting
#Extract only relevant data from season shooting
passing = passing.iloc[:, [0] + list(range(3, 17))]
#We rename in a diferent ways beacause of duplicates column names
columnas = [c[0] + c[1] for c in passing.columns]
columnas[0] = 'Squad'
passing.columns = columnas
#print
passing
#Extract only relevant data from season shooting
defensive = defensive.iloc[:, [0] + list(range(3, 8)) + list(range(12, 19))]
#rename
defensive = colNames(defensive)
#print
defensive