## Soccer Data

This dataset contains data of every game from the 2018-2019 season in the English Premier League.

Not sure where to begin? Scroll to the bottom to find challenges!

```
#This are a few questions to answer:
#Are shots related to goals?
#Are shots on target related to goals?
#Are shots on target related to wins?
#Does winning at HT increases wins?
#Are corners related to goals?
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
soccer = pd.read_csv("soccer18-19.csv")
soccer
```

### Data Dictionary

Column | Explanation |
---|---|

Div | Division the game was played in |

Date | The date the game was played |

HomeTeam | The home team |

AwayTeam | The away team |

FTHG | Full time home goals |

FTAG | Full time away goals |

FTR | Full time result |

HTHG | Half time home goals |

HTAG | Half time away goals |

HTR | Half time result |

Referee | The referee of the game |

HS | Number of shots taken by home team |

AS | Number of shots taken by away team |

HST | Number of shots taken by home team on target |

AST | Number of shots taken by away team on target |

HF | Number of fouls made by home team |

AF | Number of fouls made by away team |

HC | Number of corners taken by home team |

AC | Number of corners taken by away team |

HY | Number of yellow cards received by home team |

AY | Number of yellow cards received by away team |

HR | Number of red cards received by home team |

AR | Number of red cards received by away team |

Source of dataset.

### Don't know where to start?

**Challenges are brief tasks designed to help you practice specific skills:**

- πΊοΈ
**Explore**: What team commits the most fouls? - π
**Visualize**: Plot the percentage of games that ended in a draw over time. - π
**Analyze**: Does the number of red cards a team receives have an effect on its probability of winning a game?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

You have just been hired as a data analyst for a local soccer team. The team has recently signed on some junior players and wants to give them as much experience as possible without losing games. If the head coach could be confident in the outcome of a game by halftime, they would be more likely to give the junior players time on the field.

The coach has asked you whether you can predict the outcome of a game by the results at halftime and how confident you would be in the prediction.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.

```
#Exploratory Analysis
#Converting Date column to datetime object
soccer['Date'] = pd.to_datetime(soccer['Date'])
print(soccer.describe())
#Full time result to binary
#Team with most fouls
Hfouls = soccer.groupby('HomeTeam')['HF'].sum()
Hfouls = pd.DataFrame(Hfouls,columns = ['HF'])
Afouls = soccer.groupby('AwayTeam')['AF'].sum()
Afouls = pd.DataFrame(Afouls,columns = ['AF'])
Total_fouls = Hfouls.merge(Afouls, how = 'left', left_index = True, right_index = True)
Total_fouls['Total'] = Total_fouls['HF'] + Total_fouls['AF']
print(Total_fouls)
print(Total_fouls.loc[Total_fouls['Total'] == max(Total_fouls['Total'])])
#Percentage of draws over time
soccer['draw'] = (soccer['FTHG'] == soccer['FTAG']).astype(int)
draws = soccer['draw'].sum()
print(draws)
# Defining wins
soccer['Home_win'] = (soccer['FTHG'] > soccer['FTAG']).astype(int)
soccer['Away_win'] = (soccer['FTHG'] < soccer['FTAG']).astype(int)
# Calculating Total Wins
Hwins = soccer.groupby('HomeTeam')['Home_win'].sum()
Hwins = pd.DataFrame(Hwins,columns = ['Home_win'])
Awins = soccer.groupby('AwayTeam')['Away_win'].sum()
Awins = pd.DataFrame(Awins, columns = ['Away_win'] )
Total_wins = Hwins.merge(Awins, how = 'left', left_index = True, right_index = True)
Total_wins.index.names = ['Team']
#Home_and_total = Total_wins.merge(Hwins, how = 'left', on = 'HomeTeam')
print(Total_wins)
#Calculating Total Shots
Hc = soccer.groupby('HomeTeam')['HC'].sum()
Hc = pd.DataFrame(Hc,columns = ['HC'])
Ac = soccer.groupby('AwayTeam')['AC'].sum()
Ac = pd.DataFrame(Ac,columns = ['AC'])
Total_corners = Hc.merge(Ac, how = 'left', left_index = True, right_index = True)
Total_corners.index.names = ['Team']
print(Total_corners)
#Calculating Total Red Cards
Hred = soccer.groupby('HomeTeam')['HR'].sum()
Hred = pd.DataFrame(Hred,columns = ['HR'])
Ared = soccer.groupby('AwayTeam')['AR'].sum()
Ared = pd.DataFrame(Ared,columns = ['AR'])
Total_reds = Hred.merge(Ared, how = 'left', left_index = True, right_index = True)
Total_reds.index.names = ['Team']
print(Total_reds)
#Corners
Hshots = soccer.groupby('HomeTeam')['HS'].sum()
Hshots = pd.DataFrame(Hshots,columns = ['HS'])
Ashots = soccer.groupby('AwayTeam')['AS'].sum()
Ashots = pd.DataFrame(Ashots,columns = ['AS'])
Total_shots = Hshots.merge(Ashots, how = 'left', left_index = True, right_index = True)
Total_shots.index.names = ['Team']
print(Total_shots)
#Merging all stats
Total_stats = Total_wins.merge(Total_shots, how = 'left', left_index = True, right_index = True)
Total_stats.columns = ['Home_win', 'Away_win', 'Home_shots', 'Away_shots']
Total_stats = Total_stats.merge(Total_reds, how = 'left', left_index = True, right_index = True)
Total_stats['Total_wins'] = Total_stats['Home_win'] + Total_stats['Away_win']
Total_stats['Total_shots'] = Total_stats['Home_shots'] + Total_stats['Away_shots']
Total_stats['Total_reds'] = Total_stats['HR'] + Total_stats['AR']
#Total_stats['Total_corners'] = Total_corners['HC'] = Total_corners['AC']
print(Total_stats)
#Percentage of draws over time
soccer['draw'] = (soccer['FTHG'] == soccer['FTAG']).astype(int)
draws = soccer['draw'].sum()
Wins = Total_stats['Total_wins'].sum()
Percentage = draws/Wins*100
print('The percentage of draws over time is' ,Percentage)
#Total Wins vs Shots
plt.figure()
sns.scatterplot(data = Total_stats, x = 'Total_wins', y = 'Total_shots')
sns.regplot(data = Total_stats, x = 'Total_wins', y = 'Total_shots')
plt.title('Relation between shots and wins')
plt.xlabel('Wins')
plt.ylabel('Shots')
#Total wins vs red cards
plt.figure()
sns.scatterplot(data = Total_stats, x = 'Total_wins', y = 'Total_reds')
sns.regplot(data = Total_stats, x = 'Total_wins', y = 'Total_reds')
plt.title('Relation between red cards and wins')
plt.xlabel('Wins')
plt.ylabel('Red cards')
#Linear Regression more shots -> more wins
shots_vs_wins = ols('Total_wins ~ Total_shots', data = Total_stats)
shots_vs_wins = shots_vs_wins.fit()
print(shots_vs_wins.summary())
#Linear Regression more shots -> more wins
reds_vs_wins = ols('Total_wins ~ Total_reds', data = Total_stats)
reds_vs_wins = reds_vs_wins.fit()
print(reds_vs_wins.summary())
#Convert to tidy data
#soccer_tidy = pd.melt(soccer[['HomeTeam', 'Home_win']], ['HomeTeam'])
#soccer_tidy = soccer_tidy.sort_values(by = 'HomeTeam')
#rint(soccer_tidy)
```

```
#Same steps as before using shots on target
Hshots_target = soccer.groupby('HomeTeam')['HST'].sum()
Hshots_target = pd.DataFrame(Hshots_target,columns = ['HST'])
Ashots_target = soccer.groupby('AwayTeam')['AST'].sum()
Ashots_target = pd.DataFrame(Ashots_target,columns = ['AST'])
Total_shots_target = Hshots_target.merge(Ashots_target, how = 'left', left_index = True, right_index = True)
Total_shots_target.index.names = ['Team']
print(Total_shots)
Total_stats_target = Total_wins.merge(Total_shots_target, how = 'left', left_index = True, right_index = True)
Total_stats_target.columns = ['Home_win', 'Away_win', 'Home_shots', 'Away_shots']
Total_stats_target['Total_wins'] = Total_stats_target['Home_win'] + Total_stats_target['Away_win']
Total_stats_target['Total_shots_on_target'] = Total_stats_target['Home_shots'] + Total_stats_target['Away_shots']
Total_stats_target['Total_corners'] = Total_corners['HC'] = Total_corners['AC']
Total_stats_target['Total_reds'] = Total_reds['HR'] + Total_reds['AR']
Total_stats_target['Total_fouls'] = Total_fouls['Total']
print(Total_stats_target)
#Scatterplot with regression line Shots on target vs wins
sns.scatterplot(data = Total_stats_target, x = 'Total_wins', y = 'Total_shots_on_target')
sns.regplot(data = Total_stats_target, x = 'Total_wins', y = 'Total_shots_on_target')
plt.title('Relation between shots on target and wins')
plt.xlabel('Wins')
plt.ylabel('Shots on target')
plt.show()
#Scatterplot with regression line fouls vs wins
sns.scatterplot(data = Total_stats_target, x = 'Total_wins', y = 'Total_fouls')
sns.regplot(data = Total_stats_target, x = 'Total_wins', y = 'Total_fouls')
plt.title('Relation between fouls and wins')
plt.xlabel('Wins')
plt.ylabel('Fouls')
plt.show()
#Linear Regression more shots on target -> more wins
shots_vs_wins = ols('Total_wins ~ Total_shots_on_target', data = Total_stats_target)
shots_vs_wins = shots_vs_wins.fit()
print(shots_vs_wins.summary())
```

```
#Home teams
total_home_wins = soccer.groupby('HomeTeam')['Home_win'].sum()
total_away_wins = soccer.groupby('AwayTeam')['Away_win'].sum()
#Lineplot of home and away wins
ax = sns.lineplot(data = total_home_wins, palette = 'blue', label="Home wins")
ax = sns.lineplot(data = total_away_wins, palette = 'blue', label="Away wins")
ax.set_xlabel('Team')
ax.set_ylabel('Wins')
ax.set_title('Comparison between home and away wins')
ax.tick_params(axis = 'x', rotation = 90)
print(soccer.groupby('HomeTeam')['Home_win'].mean())
```

```
#most_goals = soccer.groupby('HomeTeam')['FTHG'].sum()
#print(most_goals)
#print(soccer)
#print(soccer_tidy)
#ax = sns.lineplot(data = foul_sum, x = 'HomeTeam', y = 'HF')
#ax.tick_params(axis='x', rotation=90)
```

```
#Machine Learning
#from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Total_stats_full = Total_stats_target[['Total_wins', 'Total_shots_on_target', 'Total_reds', 'Total_corners']]
print(Total_stats_full)
X = Total_stats_full.drop(['Total_wins', 'Total_reds'], axis = 1).values
print(X)
#X = X.reshape(-1,1)
y = Total_stats_full['Total_wins'].values
print(y)
print(X.shape, y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
reg = LinearRegression()
reg.fit(X_train, y_train)
y_predict = reg.predict(X_test)
reg_score = reg.score(X_test, y_test)
reg_mse = mean_squared_error(y_test, y_predict)
print(reg_score)
print(reg_mse)
#Model performance
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=4, shuffle=True, random_state=42)
cv_results = cross_val_score(reg, X, y,cv= kf)
print(cv_results)
print(np.mean(cv_results))
#from sklearn.metrics import classification_report, confusion_matrix
#print(confusion_matrix(y_test, y_predict))
#print(y_test)
#print(y_predict)
#GridSearchCV
from sklearn.model_selection import GridSearchCV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha' : np.arange(0.0001,1,10), 'solver':['sag, lsqr']}
ridge = Rid
```