e4 vs d4 in Chess
In chess, the first move can significantly influence the course of the match. Two of the most common opening moves are e4, known for favoring open, tactical play, and d4, know for leaning toward a more closed, positional game. I will be using a Chess.com dataset to analyze over 60,000 games to try and answer the question, is e4 really as exciting as they say. I'll be approaching the problem from a few different angles.
# import modules
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
# establish dataframe
df = pd.read_csv('club_games_data.csv')
display(df.head())Data Pre-processing
I'm analyzing a Chess.com dataset on Kaggle uploaded by user ADITYAJHA1504 that was sourced from Chess.com's API and contains over 60,000 games.
Feature Extraction From PGN
I am using some code from Kaggle user ADITYAJHA1504 that extracts features from a PGN. Please find his workbook here.
# Code in this cell was borrowed from Kaggle user ADITYAJHA1504
feature_names = ['Event', 'Site', 'Start_Date', 'End_Date', 'Start_Time',
'End_Time', 'Eco', 'EcoName', 'Round', 'Result']
feature_positions = [0, 1, 2, -6, -7, -5, -15, -14, 3, 6]
#Takes in the name you want to give the feature, and the position of the feature in
#the pgn.split('\n') and creates the feature with feature name in the dataframe
for feature_name, position in zip(feature_names, feature_positions):
df[feature_name] = df['pgn'].apply(
lambda x: x.split('\n')[position].split('"')[1])
#Extract the moves
def extract_move(pgn):
if(pgn.find('{[') == -1):
original_list = pgn.split("\n")[-2].split()
toberemoved_list = pgn.split("\n")[-2].split()[::3]
new_list = [x for x in original_list if x not in toberemoved_list]
return new_list
else:
return pgn.split("\n")[-2].split()[1::4]
df['Moves'] = df['pgn'].apply(extract_move)
Split games into E4 and D4 games
There are many opening moves white can choose, but for the purpose of thsi analysis, e4 vs d4
# Isolate first move and remove square brackets and apostrophes
df['Moves'] = df['Moves'].astype(str).str.replace('[','').str.replace(']','').str.replace(",",'').str.replace("'",'').str.strip()
df['first_move'] = df['Moves'].str.split().str[0]
# Create Boolean columns for e4 and d4 games
df['e4'] = (df['first_move'] == 'e4').astype(int)
df['d4'] = (df['first_move'] == 'd4').astype(int)
display(df)Player Levels
Using the ELO rating for both white and black, I would like to rank players as:
- Beginner - Under 1200
- Intermediate - Under 1600
- Advanced - Under 2000
- Expert - Over 2000
There are some deeper analysis that can be done looking at ELO rating as a continuous variable, but I think for the purpose of this analysis, breaking it into simpler categories will more than suffice
# Define a function to categorize the ratings
def categorize_rating(rating):
if rating < 1200:
return "Beginner"
elif rating < 1600:
return "Intermediate"
elif rating < 2000:
return "Advanced"
else:
return "Expert"
# Create new columns for the categorized ratings
df['white_level'] = df['white_rating'].apply(categorize_rating)
df['black_level'] = df['black_rating'].apply(categorize_rating)Identify Types of Results
- Checkmates
- Resignations
- Timeouts
- Draws
# Create a function to identify checkmates
def is_checkmate(result):
if result == 'checkmated':
return 1
else:
return 0
# Create a new column to indicate if the game ended in a checkmate
df['checkmated'] = df['white_result'].apply(is_checkmate) + df['black_result'].apply(is_checkmate)
# Create a function to identify resignations
def is_resignation(result):
if result == 'resigned':
return 1
else:
return 0
# Create a new column to indicate if the game ended in a resignation
df['resigned'] = df['white_result'].apply(is_resignation) + df['black_result'].apply(is_resignation)
# Create a function to identify timeouts
def is_timeout(result):
if result == 'timeout':
return 1
else:
return 0
# Create a new column to indicate if the game ended in a timeout
df['timeout'] = df['white_result'].apply(is_timeout) + df['black_result'].apply(is_timeout)
Identify Result
- Label the result by side
- Label the overall result
# Create a function to identify the result of the game
def game_result(result):
if result == '1-0':
return 'white_win'
elif result == '0-1':
return 'black_win'
elif result == '1/2-1/2':
return 'draw'
# Create new columns for the game result
df['game_result'] = df['Result'].apply(game_result)
df['white_win'] = (df['game_result'] == 'white_win').astype(int)
df['black_win'] = (df['game_result'] == 'black_win').astype(int)
df['draw'] = (df['game_result'] == 'draw').astype(int)
# Create a function to identify whether the game was a win or a tie
def overall_result(result):
if result == '1-0':
return 'win'
elif result == '0-1':
return 'win'
elif result == '1/2-1/2':
return 'draw'
df['overall_result'] = df['Result'].apply(overall_result)Length of the Game
Since more exciting games might end more quickly, I am calculating the length of the game in moves
# Create a new column to calculate the length of the Moves column
df['moves_length'] = df['Moves'].str.split().apply(len)