End to End Sports Data Analysis Project 1 - NFL Passing Through The Millenium

DataFrameas

df

variable

-- NFL passing statistics since 2001. Contains record of every player who attempted a pass within the time period. Tracked metrics include passing yards, passing touchdowns, pass attempts, completions, interceptions, and touchdown/interception/completion percentages. More advanced metrics like yards per attempt, adjusted net yards per attempt, and other similar metrics are also included.
SELECT * FROM read_csv_auto('NFL_QB_Stats.csv');

Questions to Ask:

Which decade was the most efficient in terms of passing stats?
Determine the best statistical passer every year.
Are gunslingers more likely to throw touchdowns than check-down specialists? Are they generally better quarterbacks statistically?
Create a predictive model for the average passing stats for the rest of the decade.

# Which decade was more efficient, the 1970s, 1980s, 1990s, 2000s, or 2010s?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Check the columns of the dataframe
print(df.columns)

# Filter data for the years 1971 to 2020 and players who attempted at least 100 passes in a single season
df_filtered = df[(df['Year'] >= 1971) & (df['Year'] <= 2020) & (df['Att'] >= 100)]

# Group by year and calculate the mean rate for each year
yearly_rate = df_filtered.groupby('Year')['Rate'].mean().reset_index()

# Separate the data into five groups: 1971-1980, 1981-1990, 1991-2000, 2001-2010, and 2011-2020
group1 = yearly_rate[(yearly_rate['Year'] >= 1971) & (yearly_rate['Year'] <= 1980)]
group2 = yearly_rate[(yearly_rate['Year'] >= 1981) & (yearly_rate['Year'] <= 1990)]
group3 = yearly_rate[(yearly_rate['Year'] >= 1991) & (yearly_rate['Year'] <= 2000)]
group4 = yearly_rate[(yearly_rate['Year'] >= 2001) & (yearly_rate['Year'] <= 2010)]
group5 = yearly_rate[(yearly_rate['Year'] >= 2011) & (yearly_rate['Year'] <= 2020)]

# Plot the data
plt.figure(figsize=(12, 6))
plt.plot(group1['Year'], group1['Rate'], label='1971-1980', color='green')
plt.plot(group2['Year'], group2['Rate'], label='1981-1990', color='red')
plt.plot(group3['Year'], group3['Rate'], label='1991-2000', color='purple')
plt.plot(group4['Year'], group4['Rate'], label='2001-2010', color='blue')
plt.plot(group5['Year'], group5['Rate'], label='2011-2020', color='orange')
plt.xlabel('Year')
plt.ylabel('Rate')
plt.xticks(range(1971, 2021, 1))
plt.title('Average Passing Rate by Year (1971-2020) min. 100 Att.')
plt.legend()
plt.grid(True)
plt.show()

# Plot the relationship between TD and Int, grouped by the same parameters
# Group by year and calculate the mean TD and Int for each year
yearly_td_int = df_filtered.groupby('Year')[['TD', 'INT']].mean().reset_index()

# Separate the data into five groups: 1971-1980, 1981-1990, 1991-2000, 2001-2010, and 2011-2020
group1_td_int = yearly_td_int[(yearly_td_int['Year'] >= 1971) & (yearly_td_int['Year'] <= 1980)]
group2_td_int = yearly_td_int[(yearly_td_int['Year'] >= 1981) & (yearly_td_int['Year'] <= 1990)]
group3_td_int = yearly_td_int[(yearly_td_int['Year'] >= 1991) & (yearly_td_int['Year'] <= 2000)]
group4_td_int = yearly_td_int[(yearly_td_int['Year'] >= 2001) & (yearly_td_int['Year'] <= 2010)]
group5_td_int = yearly_td_int[(yearly_td_int['Year'] >= 2011) & (yearly_td_int['Year'] <= 2020)]

# Plot the data
plt.figure(figsize=(12, 6))
plt.scatter(group1_td_int['TD'], group1_td_int['INT'], label='1971-1980', color='green')
plt.scatter(group2_td_int['TD'], group2_td_int['INT'], label='1981-1990', color='red')
plt.scatter(group3_td_int['TD'], group3_td_int['INT'], label='1991-2000', color='purple')
plt.scatter(group4_td_int['TD'], group4_td_int['INT'], label='2001-2010', color='blue')
plt.scatter(group5_td_int['TD'], group5_td_int['INT'], label='2011-2020', color='orange')
plt.xlabel('TD')
plt.ylabel('Int')
plt.title('Average TD vs Int By Year (1971-2020) min. 100 Att.')
plt.legend()
plt.grid(True)
plt.show()

# Plot the relationship between Cmp% and Yds/Att, grouped by the same parameters
# Group by year and calculate the mean Cmp% and Yds/Att for each year
yearly_td_int = df_filtered.groupby('Year')[['Cmp %', 'Yds/Att']].mean().reset_index()

# Separate the data into five groups: 1971-1980, 1981-1990, 1991-2000, 2001-2010, and 2011-2020
group1_td_int = yearly_td_int[(yearly_td_int['Year'] >= 1971) & (yearly_td_int['Year'] <= 1980)]
group2_td_int = yearly_td_int[(yearly_td_int['Year'] >= 1981) & (yearly_td_int['Year'] <= 1990)]
group3_td_int = yearly_td_int[(yearly_td_int['Year'] >= 1991) & (yearly_td_int['Year'] <= 2000)]
group4_td_int = yearly_td_int[(yearly_td_int['Year'] >= 2001) & (yearly_td_int['Year'] <= 2010)]
group5_td_int = yearly_td_int[(yearly_td_int['Year'] >= 2011) & (yearly_td_int['Year'] <= 2020)]

# Plot the data
plt.figure(figsize=(12, 6))
plt.scatter(group1_td_int['Cmp %'], group1_td_int['Yds/Att'], label='1971-1980', color='green')
plt.scatter(group2_td_int['Cmp %'], group2_td_int['Yds/Att'], label='1981-1990', color='red')
plt.scatter(group3_td_int['Cmp %'], group3_td_int['Yds/Att'], label='1991-2000', color='purple')
plt.scatter(group4_td_int['Cmp %'], group4_td_int['Yds/Att'], label='2001-2010', color='blue')
plt.scatter(group5_td_int['Cmp %'], group5_td_int['Yds/Att'], label='2011-2020', color='orange')
plt.xlabel('Cmp%')
plt.ylabel('Yds/Att')
plt.title('Average Comp% vs Yards/Attempt By Year (1971-2020) min. 100 Att')
plt.legend()
plt.grid(True)
plt.show()

Conclusion

Each decade gets progressively more efficient, with the 2010s being the most efficient based on the average of every quarterback who passed for at least 100 attempts during that season. It should also be stated that over the decades, the NFL has shifted towards a higher emphasis on passing offense.
The average quarterback passing rate, created by Don Smith in 1971 for the NFL, is a standardized measure used to evaluate a quarterback's performance and efficiency. It considers several key factors: completion percentage, yards per attempt, touchdown percentage, and interception percentage. By integrating these elements, the passer rating provides a comprehensive assessment of a quarterback's effectiveness on the field, making it a crucial statistic for comparing players across games and seasons. The rating scale ranges from 0 to 158.3, with higher values indicating better performance. The upward trend throughout the decades highlights this increase.
The relationship between touchdowns thrown and interceptions thrown highlights the frequency with which the quarterback either turns the ball over or scores, both of which are potential game-changing plays. Displaying the relationship helps to identify risk-adverse quarterbacks who may not throw many interceptions but don't take risks, as well as high-risk players who may throw more touchdowns at the risk of higher interceptions.
The relationship between completion percentage and average yards per attempt also helps to identify the relationship between check-down quarterbacks having a higher completion percentage due to a lower yards per attempt, as well as the inverse for gunslingers.

# Which quarterback has had the best statistical season in each year from 2001 to 2023?
import pandas as pd
import matplotlib.pyplot as plt

# Filter data (assuming 'df' has your data)
filtered_df = df[df['Att'] > 250]

# Rank each column separately and combine ranks
rank_cols = ['TD', 'Cmp %', 'Yds/Att', 'INT', 'Sck', 'Rate']
ranks = filtered_df[rank_cols].rank(axis=0, ascending=[False, False, False, True, True, False])

# Sum of ranks to determine the best QB
filtered_df['Rank'] = ranks.sum(axis=1)

# Find best and runner-up QB per year using groupby and nlargest
best_qb_each_year = filtered_df.loc[filtered_df.groupby('Year')['Rank'].idxmax()]
runner_up_each_year = filtered_df.groupby('Year').apply(lambda x: x.nlargest(2, 'Rank').iloc[1]).reset_index(drop=True)

# Merge best and runner-up dataframes
result = best_qb_each_year[['Year', 'Player']].merge(
    runner_up_each_year[['Year', 'Player']], on='Year', suffixes=('', ' Runner-Up')
)

# Rename columns for clarity
result.columns = ['Year', 'Best Player', 'Runner-Up']

# Display result
result

Conclusion

A collective rank of touchdowns thrown, completion percentage, average yards per attempt, interceptions thrown, sacks taken, and quarterback rating was used to objectively quantify the best statistical quarterbacks who attempted at least 250 passes in each individual season.

# Add a 'Decade' column to the dataframe
filtered_df['Decade'] = (filtered_df['Year'] // 10) * 10

# Define colors for each decade
decade_colors = {
    1970: 'green',
    1980: 'red',
    1990: 'purple',
    2000: 'blue',
    2010: 'orange'
}

# Plot the relationship between TD and Yds/Att, separated by decade
for decade, group in filtered_df.groupby('Decade'):
    plt.figure(figsize=(12, 8))
    color = decade_colors.get(decade, 'black')  # Default to black if decade not in dictionary
    plt.scatter(group['TD'], group['Yds/Att'], label=f'{decade}s', color=color)
    plt.xlabel('Touchdowns (TD)')
    plt.ylabel('Yards per Attempt (Yds/Att)')
    plt.title(f'Relationship between Touchdowns and Yards per Attempt in {decade}s')
    plt.legend(title='Decade')
    plt.grid(True)
    plt.show()

# Plot the relationship between Rate and Yds/Att, separated by decade
for decade, group in filtered_df.groupby('Decade'):
    plt.figure(figsize=(12, 8))
    color = decade_colors.get(decade, 'black')  # Default to black if decade not in dictionary
    plt.scatter(group['Rate'], group['Yds/Att'], label=f'{decade}s', color=color)
    plt.xlabel('Quarterback Rating')
    plt.ylabel('Yards per Attempt (Yds/Att)')
    plt.title(f'Relationship between QBR and Yards per Attempt in {decade}s')
    plt.legend(title='Decade')
    plt.grid(True)
    plt.show()

Conclusion

The consistent upwards trend in the relationship between touchdowns thrown and average yards per attempt throughout the decades show that gunslingers are more likely to throw touchdowns than check-down specialists.
The consistent upwards trend in the relationship between quarterback rating and average yards per attempt throughout the decades show that gunslingers are more likely to be more efficient and have a better overall statistical year than check-down specialists.

# Filter data for quarterbacks with at least 150 pass attempts
filtered_df_150 = df[df['Att'] > 150]

# Group by year and calculate the mean for the specified columns, rounding to the nearest 10th
average_qb_stats = filtered_df_150.groupby('Year').agg({
    'Pass Yds': lambda x: round(x.mean(), 1),
    'Yds/Att': lambda x: round(x.mean(), 1),
    'Cmp %': lambda x: round(x.mean(), 1),
    'TD': lambda x: round(x.mean(), 1),
    'INT': lambda x: round(x.mean(), 1),
    'Rate': lambda x: round(x.mean(), 1)
}).reset_index()

# Display the results in a table

fig, ax = plt.subplots(figsize=(12, 8))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=average_qb_stats.values, colLabels=average_qb_stats.columns, cellLoc='center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.2)

plt.show()

from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Prepare the data for the predictive model
X = average_qb_stats[['Year']]
y = average_qb_stats.drop(columns=['Year'])

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Predict the average quarterback stats for the years 2023 to 2030
years = np.arange(2023, 2031).reshape(-1, 1)
predicted_stats = model.predict(years)

# Round the predicted stats to the nearest 10th
predicted_stats = np.round(predicted_stats, 1)

# Create a DataFrame to display the predicted stats
predicted_stats_df = pd.DataFrame(predicted_stats, columns=y.columns)
predicted_stats_df.insert(0, 'Year', years.flatten())

# Display the predicted stats in a table
fig, ax = plt.subplots(figsize=(12, 8))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=predicted_stats_df.values, colLabels=predicted_stats_df.columns, cellLoc='center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(12)
table.scale(1.2, 1.2)

plt.show()

Conclusion

Based on the averages of players who passed for at least 150 attempts in a single year, data fom 1970 to 2022 was used to predict the average stats of an NFL quarterback in the years 2023 to 2030.

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Assuming average_qb_stats is already defined
# Split the data into training and testing sets
X = average_qb_stats[['Year']]
y = average_qb_stats.drop(columns=['Year'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, r2

# Create scatter plots for actual vs. predicted values
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15))
axes = axes.flatten()

for i, column in enumerate(y.columns):
    axes[i].scatter(y_test[column], y_pred[:, i], alpha=0.5)
    axes[i].plot([y_test[column].min(), y_test[column].max()], [y_test[column].min(), y_test[column].max()], 'r--')
    axes[i].set_xlabel('Actual')
    axes[i].set_ylabel('Predicted')
    axes[i].set_title(f'Actual vs. Predicted for {column}')

plt.tight_layout()
plt.show()

Conclusion

This model serves as a foundational tool to understand trends in quarterback performance over time and predict future performance. It highlights the potential and limitations of using historical data for predictive analytics, emphasizing the need for continuous model improvement and exploration of more advanced techniques.

End to End Sports Data Analysis Project 1 - NFL Passing Through The Millenium

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Conclusion

Conclusion

Conclusion

Conclusion

Conclusion

Conclusion