Skip to content

๐Ÿง  Exploratory Data Analysis โ€” Athlete Dataset

The athlete dataset provides a rich view of historical Olympic data, containing records for thousands of athletes across multiple sports, events, countries, and years. The analysis aims to uncover trends in athlete demographics, medal achievements, and performance characteristics over time.

๐Ÿ“Š Dataset Overview Columns:

ID, name, sex, age, height, team, NOC, Games, year, season, city, sport, event, medal

Objective: To derive meaningful insights regarding:

Age and physical characteristics of medal winners

Consistency and longevity of athlete performance

Diversity in participation and performance across sports and countries

๐Ÿ” Key Insights Derived

  1. Age Distribution of Medal Winners by Sport Some sports (e.g., gymnastics, swimming) tend to be dominated by younger athletes.

Other sports (e.g., equestrian, shooting) show higher average ages for medalists.

Filtering for sports with at least 10 medal winners helped ensure statistical significance.

  1. Consistent Olympic Medalists A select group of elite athletes have medaled in three or more Olympic Games, indicating not only skill but also exceptional career longevity.

These athletes represent a variety of sports, including track & field, swimming, and wrestling.

  1. Height and Medal Type Correlation Average height varies slightly by medal type, and standard deviation provides additional context on variability.

This kind of analysis is useful for identifying physical benchmarks or ideal profiles for certain sports or disciplines.

๐Ÿ›  Technical Approach The analysis leveraged:

Aggregate functions (AVG, COUNT, STDDEV)

Conditional filtering (WHERE medal IS NOT NULL)

Grouping (GROUP BY sport, medal, Games)

Window functions (in extended analysis, e.g., medal streaks or first/last appearances)

Spinner
DataFrameas
df
variable
SELECT * FROM 'athlete_events.csv'
LIMIT 100;
Spinner
DataFrameas
df1
variable
-- ๐Ÿ“Œ Scenario 1: Age and Medal Trends Across Sports
-- Business Problem:
-- Olympic committee officials want to understand which sports are won by younger athletes and which favor more experienced participants.

-- Your Task:For each sport, calculate the average age of athletes who won medals. Rank the sports from youngest to oldest based on this average age. Also, return the count of medals won per sport to filter out low-sample sports (e.g., only include sports with 10+ medal winners).

-- Skills Involved: GROUP BY, HAVING, AVG, conditional WHERE, ORDER BY
SELECT 
  sport,
  ROUND(AVG(CAST(age AS DOUBLE)), 2) AS avg_age,
  COUNT(*) AS medal_count
FROM 'athlete_events.csv'
WHERE medal IS NOT NULL AND age IS NOT NULL AND age != 'NA'
GROUP BY sport
HAVING COUNT(*) >= 10
ORDER BY avg_age ASC;
Spinner
DataFrameas
df2
variable
-- Scenario 2: Detecting Consistent Medalists
--Business Problem:
--The IOC is interested in identifying athletes who have won medals in 3 or more different Olympic Games.

--Your Task:
--List athletes (by name and team) who have won medals in 3 or more distinct Games. Show the total medals theyโ€™ve won and how many games they medaled in. Sort by the number of distinct games first, then by total medals.

--Skills Involved: COUNT(DISTINCT), GROUP BY, HAVING, aggregate filtering, multi-column sorting

SELECT 
  name,
  team,
  COUNT(DISTINCT Games) AS distinct_games_with_medals,
  COUNT(*) AS total_medals
FROM 'athlete_events.csv'
WHERE medal IS NOT NULL
GROUP BY name, team
HAVING COUNT(DISTINCT Games) >= 3
ORDER BY distinct_games_with_medals DESC, total_medals DESC;
Spinner
DataFrameas
df3
variable
-- ๐Ÿ“Œ Scenario 3: Height Distribution by Medal Type
--Business Problem:
--A sports physiology team is researching how body height varies among medal types (Gold, Silver, Bronze) across all events.

--Your Task:Create a breakdown that shows the average height and standard deviation of athletes for each medal type. Filter out null heights and exclude non-medal entries.

--Skills Involved: AVG, STDDEV, filtering NULL, grouping by category
SELECT 
  medal,
  ROUND(AVG(CAST(height AS DOUBLE)), 2) AS avg_height,
  ROUND(STDDEV(CAST(height AS DOUBLE)), 2) AS stddev_height,
  COUNT(*) AS total_athletes
FROM 'athlete_events.csv'
WHERE height IS NOT NULL AND height != 'NA' AND medal IS NOT NULL
GROUP BY medal
ORDER BY avg_height DESC;