## How Can We Win More Games?

*Analysis of 15 games from the 2022-2023 season for the Newbury Park High School varsity lacrosse team.*

### What is the problem and what insights are we attempting to find?

As the Head Coach of the Newbury Park lacrosse team, I am trying to answer the question of how the team can win more games. I will use my data analysis skills to better understand the team's strengths and weaknesses in order to focus practice and game plans. Analytics are hardly used in the sport of lacrosse and I hope to eventually use similar regession methods as used in the book, Mathletics, to help predict the outcomes of games and the season based off of current data!

### Data Gathering

The first step in analyzing the season for the Panthers was to gather all available data on the players and the team. The game stats were tracked in person by a volunteer and then uploaded into a Google spreadsheet. The stats were not originally tracked with the intent of running data analysis and thus the format of the data and some of its contents needed to be altered.

After receiving the spreadsheet, I began the data cleaning process.

### Data Cleaning

The workbook was comprised of tabs that provided the stats for each game. As a game (tab) was added, the stats would roll into the SEASON tab for totals on the season.

While this served its purpose for a high-level overview on the season, there were issues in the format and layout that would cause problems while performing data analysis:

- Data was not presented in a table format
- All non-essential columns were removed and three tabs (tables) were created (see picture below).

- There were empty cells for some of the rows.
- I added a 0 to all empty cells.

- Percentages were formatted as strings
- I reformatted the data as a decimal to simplify the coding adjustments that would be needed.

- Column names were inconsistent and included spaces
- I replaced the spaces with an underscore and I ensured there was more consistency in the naming convention of the columns.

Below is a screenshot of a revised tab in the workbook:

### Data Loading

After gathering and cleaning the data, I upload each tab of the NPHS BOYS LAX workbook as a CSV file. I chose to read in the CSV files and create DataFrames using Python so that it was easier to reference in my SQL code.

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
#Read in CSV files and create DF
game_stats = pd.read_csv('NPHS BOYS LAX 2023 SEASON STATS - SEASON_Game.csv')
player_stats = pd.read_csv('NPHS BOYS LAX 2023 SEASON STATS - SEASON_Player.csv')
shooting_stats = pd.read_csv('NPHS BOYS LAX 2023 SEASON STATS - SEASON_Shots.csv')
```

### Data Analysis

In the analysis portion of this project I wanted to propose and answer guestions that would help give me insights into the strengths and weaknesses of our team.

**Who were the Top 5 point scorers for the team?**

To better understand the offensive personel who contributed most to the success of the team I queried the top five point scorers for the team. As expected, our three starting Attackmen (6, 9, 38) are in the top 5 point scorers as they never leave the field.

```
--Who were the Top 5 point scorers for the team?
SELECT
Player,
position,
SUM(Points) AS Total_Points
FROM player_stats
GROUP BY Player, position
HAVING Total_Points IS NOT NULL
ORDER BY Total_Points DESC
LIMIT 5;
```

**What was the team's average number of unforced turnovers?**

Unforced turnovers (UTO) can be described as an turnover which is not caused by a defender (i.e. bad pass, bad catch, self imposed infraction). Lacrosse is a fast paced game that favors the team with the most possessions on offense. An UTO is similar to throwing an interception to the defense with no one pressuring the Quarterback.

Our team averaged an unsatisfactory 11.79 UTOs, per game, for the season. Later in my analysis I dive into how each individual player faired when it came to UTOs.

For next year's analysis I would like to track how many UTO's led to goals or how many goals a team averages per possession.

```
--What was the team's average number of unforced turnovers (UTO)?
SELECT
ROUND(AVG(Total_UTO),2) AS Avg_TO_Game
FROM
(SELECT
Game_ID,
SUM(UTO) AS Total_UTO
FROM player_stats
GROUP BY Game_ID
ORDER BY Total_UTO DESC);
```

**What was the team's shots on goal percentage for each game?**

After aggregating our shots on goal (SOG) percentage and totals shots for each player, for each game, I wanted to analyze the relationship between SOG percentage, total shots, and the result of the game.

As we can see from the query below, all games that resulted in a win had a total number of shots greater than 16. The two games that meet this requirement, but resulted in losses, had a SOG percentage that was less than 60%.

This information is only one part of the story as the result of the game is predicated on the stats of both teams as we will see in later queries.

```
--What was the team's shots on goal percentage for each game?
SELECT
p.Game_ID,
ROUND(AVG(p.sog_pct),2) AS SOG_pct,
SUM(p.shots) AS tot_shots,
s.result
FROM player_stats p
JOIN shooting_stats s
ON p.game_id = s.game_id
WHERE p.sog_pct > 0
GROUP BY p.Game_ID, s.result
ORDER BY s.result DESC;
```

**How do our stats compare in our wins versus losses?**

In the dataset that is provided, it appears 10 goals is a great target for a team in order to obtain a win. It is interesting that there is little difference in the average number of successful clears from the Wins versus Losses, but the average number of failed clears is nearly double in Losses versus Wins.

```
--How do our stats compare in our wins versus losses?
WITH outcome AS(
SELECT
game_id,
result
FROM shooting_stats),
stats AS(
SELECT
game_id,
ROUND(AVG(CAST(REPLACE(opp_sog_pct,'%','') AS DECIMAL)),0) AS avg_opp_sog_pct,
SUM(shots_allowed) AS opp_shots,
SUM(saves) AS saves,
SUM(goals_against) AS goals_against,
SUM(goals_for) AS goals_for,
SUM(success_clears) AS success_clears,
SUM(fail_clears) AS failed_clears
FROM game_stats
GROUP BY game_id)
SELECT
w.result,
ROUND(AVG(s.avg_opp_sog_pct),2) AS avg_opp_sog_pct,
ROUND(AVG(s.opp_shots),2) AS avg_opp_shots,
ROUND(AVG(s.saves),2) AS avg_saves,
ROUND(AVG(s.goals_against),2) AS avg_goals_against,
ROUND(AVG(s.goals_for),2) AS avg_goals_for,
ROUND(AVG(s.success_clears),2) AS avg_success_clears,
ROUND(AVG(s.failed_clears),2) AS avg_failed_clears
FROM outcome w
JOIN stats s
ON w.game_id = s.game_id
GROUP BY w.result;
```

**What insights can be revealed through an analysis of our shot locations in a win versus loss?**

This will be a data point that needs further investigation or clarity for next season. While it is great to understand where our shots are coming from during the game, when analysis is performed over the course of the season we need further clarity on if the shot resulted in a goal or not. In general, this data looks conflicting since the closer a player is to the goal (i.e. 0-5 yards) the more likely he is to score, and thus, the losses should have less shots in those locations.

```
--What insights can be revealed through an analysis of our shot locations in a win versus loss?
SELECT
CASE WHEN result = 'Win' THEN 'NP shots in Win'
ELSE 'NP shots in Loss' END shots,
SUM(np_loc1) AS "0-5 yards left",
SUM(np_loc2) AS "0-5 yards middle",
SUM(np_loc3) AS "0-5 yards right",
SUM(np_loc4) AS "6-12 yards left",
SUM(np_loc5) AS "6-12 yards middle",
SUM(np_loc6) AS "6-12 yards right",
SUM(np_loc7) AS "13+ yards left",
SUM(np_loc8) AS "13+ yards middle",
SUM(np_loc9) AS "13+ yards right"
FROM shooting_stats
GROUP BY result;
```