Skip to content
Lacrosse Analysis
  • AI Chat
  • Code
  • Report
  • How Can We Win More Games?

    Analysis of 15 games from the 2022-2023 season for the Newbury Park High School varsity lacrosse team.

    What is the problem and what insights are we attempting to find?

    As the Head Coach of the Newbury Park lacrosse team, I am trying to answer the question of how the team can win more games. I will use my data analysis skills to better understand the team's strengths and weaknesses in order to focus practice and game plans. Analytics are hardly used in the sport of lacrosse and I hope to eventually use similar regession methods as used in the book, Mathletics, to help predict the outcomes of games and the season based off of current data!

    Data Gathering

    The first step in analyzing the season for the Panthers was to gather all available data on the players and the team. The game stats were tracked in person by a volunteer and then uploaded into a Google spreadsheet. The stats were not originally tracked with the intent of running data analysis and thus the format of the data and some of its contents needed to be altered.

    After receiving the spreadsheet, I began the data cleaning process.

    Data Cleaning

    The workbook was comprised of tabs that provided the stats for each game. As a game (tab) was added, the stats would roll into the SEASON tab for totals on the season.

    While this served its purpose for a high-level overview on the season, there were issues in the format and layout that would cause problems while performing data analysis:

    • Data was not presented in a table format
      • All non-essential columns were removed and three tabs (tables) were created (see picture below).
    • There were empty cells for some of the rows.
      • I added a 0 to all empty cells.
    • Percentages were formatted as strings
      • I reformatted the data as a decimal to simplify the coding adjustments that would be needed.
    • Column names were inconsistent and included spaces
      • I replaced the spaces with an underscore and I ensured there was more consistency in the naming convention of the columns.

    Below is a screenshot of a revised tab in the workbook:

    Data Loading

    After gathering and cleaning the data, I upload each tab of the NPHS BOYS LAX workbook as a CSV file. I chose to read in the CSV files and create DataFrames using Python so that it was easier to reference in my SQL code.

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    #Read in CSV files and create DF
    game_stats = pd.read_csv('NPHS BOYS LAX 2023 SEASON STATS - SEASON_Game.csv')
    player_stats = pd.read_csv('NPHS BOYS LAX 2023 SEASON STATS - SEASON_Player.csv')
    shooting_stats = pd.read_csv('NPHS BOYS LAX 2023 SEASON STATS - SEASON_Shots.csv')
    
    Hidden output

    Data Analysis

    In the analysis portion of this project I wanted to propose and answer guestions that would help give me insights into the strengths and weaknesses of our team.

    Who were the Top 5 point scorers for the team?

    To better understand the offensive personel who contributed most to the success of the team I queried the top five point scorers for the team. As expected, our three starting Attackmen (6, 9, 38) are in the top 5 point scorers as they never leave the field.

    Unknown integration
    DataFrameavailable as
    total_pts
    variable
    --Who were the Top 5 point scorers for the team?
    SELECT 
    	Player,
    	position,
    	SUM(Points) AS Total_Points
    FROM player_stats
    GROUP BY Player, position
    HAVING Total_Points IS NOT NULL
    ORDER BY Total_Points DESC
    LIMIT 5;
    
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    What was the team's average number of unforced turnovers?

    Unforced turnovers (UTO) can be described as an turnover which is not caused by a defender (i.e. bad pass, bad catch, self imposed infraction). Lacrosse is a fast paced game that favors the team with the most possessions on offense. An UTO is similar to throwing an interception to the defense with no one pressuring the Quarterback.

    Our team averaged an unsatisfactory 11.79 UTOs, per game, for the season. Later in my analysis I dive into how each individual player faired when it came to UTOs.

    For next year's analysis I would like to track how many UTO's led to goals or how many goals a team averages per possession.

    Unknown integration
    DataFrameavailable as
    uto
    variable
    --What was the team's average number of unforced turnovers (UTO)?
    SELECT 
    	ROUND(AVG(Total_UTO),2) AS Avg_TO_Game
    FROM 
    	(SELECT 
    		Game_ID, 
    		SUM(UTO) AS Total_UTO
    	 FROM player_stats
    	 GROUP BY Game_ID
    	 ORDER BY Total_UTO DESC);
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    What was the team's shots on goal percentage for each game?

    After aggregating our shots on goal (SOG) percentage and totals shots for each player, for each game, I wanted to analyze the relationship between SOG percentage, total shots, and the result of the game.

    As we can see from the query below, all games that resulted in a win had a total number of shots greater than 16. The two games that meet this requirement, but resulted in losses, had a SOG percentage that was less than 60%.

    This information is only one part of the story as the result of the game is predicated on the stats of both teams as we will see in later queries.

    Unknown integration
    DataFrameavailable as
    sog
    variable
    --What was the team's shots on goal percentage for each game?
    SELECT 
    	p.Game_ID, 
    	ROUND(AVG(p.sog_pct),2) AS SOG_pct, 
    	SUM(p.shots) AS tot_shots,
    	s.result
    FROM player_stats p
    JOIN shooting_stats s
    ON p.game_id = s.game_id
    WHERE p.sog_pct > 0
    GROUP BY p.Game_ID, s.result
    ORDER BY s.result DESC;
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    How do our stats compare in our wins versus losses?

    In the dataset that is provided, it appears 10 goals is a great target for a team in order to obtain a win. It is interesting that there is little difference in the average number of successful clears from the Wins versus Losses, but the average number of failed clears is nearly double in Losses versus Wins.

    Unknown integration
    DataFrameavailable as
    wvl
    variable
    --How do our stats compare in our wins versus losses?
    WITH outcome AS(
    SELECT 
    	game_id,
    	result
    FROM shooting_stats),
    
    stats AS(
    SELECT
    	game_id,
    	ROUND(AVG(CAST(REPLACE(opp_sog_pct,'%','') AS DECIMAL)),0)  AS avg_opp_sog_pct,
    	SUM(shots_allowed) AS opp_shots,
    	SUM(saves) AS saves,
    	SUM(goals_against) AS goals_against,
    	SUM(goals_for) AS goals_for,
    	SUM(success_clears) AS success_clears,
    	SUM(fail_clears) AS failed_clears
    FROM game_stats
    GROUP BY game_id)
    
    SELECT
    	w.result,
    	ROUND(AVG(s.avg_opp_sog_pct),2) AS avg_opp_sog_pct,
    	ROUND(AVG(s.opp_shots),2) AS avg_opp_shots,
    	ROUND(AVG(s.saves),2) AS avg_saves,
    	ROUND(AVG(s.goals_against),2) AS avg_goals_against,
    	ROUND(AVG(s.goals_for),2) AS avg_goals_for,
    	ROUND(AVG(s.success_clears),2) AS avg_success_clears,
    	ROUND(AVG(s.failed_clears),2) AS avg_failed_clears
    FROM outcome w
    JOIN stats s
    ON w.game_id = s.game_id
    GROUP BY w.result;
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    What insights can be revealed through an analysis of our shot locations in a win versus loss?

    This will be a data point that needs further investigation or clarity for next season. While it is great to understand where our shots are coming from during the game, when analysis is performed over the course of the season we need further clarity on if the shot resulted in a goal or not. In general, this data looks conflicting since the closer a player is to the goal (i.e. 0-5 yards) the more likely he is to score, and thus, the losses should have less shots in those locations.

    Unknown integration
    DataFrameavailable as
    np_shot
    variable
    --What insights can be revealed through an analysis of our shot locations in a win versus loss?
    SELECT
    	CASE WHEN result = 'Win' THEN 'NP shots in Win'
    	ELSE 'NP shots in Loss' END shots,
    	SUM(np_loc1) AS "0-5 yards left",
    	SUM(np_loc2) AS "0-5 yards middle",
    	SUM(np_loc3) AS "0-5 yards right",
    	SUM(np_loc4) AS "6-12 yards left",
    	SUM(np_loc5) AS "6-12 yards middle",
    	SUM(np_loc6) AS "6-12 yards right",
    	SUM(np_loc7) AS "13+ yards left",
    	SUM(np_loc8) AS "13+ yards middle",
    	SUM(np_loc9) AS "13+ yards right"
    FROM shooting_stats
    GROUP BY result;
    
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.