How Likely is an Athlete to Win a Medal in Judo?
Welcome to your workspace! Here, you can write code and text, run analyses, and share your insights. In this walkthrough, you will learn the basics of Workspace as you analyze an Olympics dataset and prepare a shareable Workspace report!
To get you started, we have prepared the following scenario for you:
You are working as a data analyst for an international judo club. The club owner is looking for new ways to leverage data. One idea they have had is to use past competition data to estimate the threat of future opponents. They have provided you with a dataset of past Olympic data. They want to know whether you can use information such as the height, weight, age, and national origin of a judo competitor to estimate their probability of earning a medal.
You will need to prepare a report accessible to a broad audience. It should outline your steps, findings, and conclusions.
🏆 Load in the Olympics Data
The primary data is available in your directory in the path data/athlete_events.csv.gz. You can add other resources by opening the sidebar and clicking the + symbol to add files.
The cell handles imports of the required packages and data. Be sure to import other packages as needed!
To run a cell, click inside it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell and automatically switch to the next cell.
# Import libraries
import numpy as np
import pandas as pd
import plotly.express as px
# Import the data
olympics_data = pd.read_csv("data/athlete_events.csv.gz")
# Preview the DataFrame
olympics_dataThe team column is messy and sometimes contains countries separated by forward slashes or hyphens. Let's clean this by using .str.extract() to extract the first country mentioned in the cases of slashes or hyphens (e.g., "Denmark/Sweden" becomes "Denmark").
# Split the team column on hyphens and slashes
olympics_data["team_clean"] = olympics_data["team"].str.split(r"[\/\-]").str[0]
# Preview the new column
olympics_data["team_clean"].unique()[:10]🌐 Bring in additional data
Depending on your analysis, you may find using additional world data helpful. In Workspace, you can add and run SQL cells to access relational databases. The query below retrieves data from a MariaDB database containing information on nations of the world. It stores the query results as a pandas DataFrame with a name of your choosing.
You can click the "Browse tables" button to view the available tables and columns or continue with the existing query.
-- Select country data based on most recent year
WITH world_countries AS (
SELECT
countries.name AS country,
continents.name AS continent,
year,
population,
gdp,
ROW_NUMBER() OVER(PARTITION BY country ORDER BY year DESC) AS year_index
FROM countries
INNER JOIN country_stats USING(country_id)
INNER JOIN regions USING(region_id)
INNER JOIN continents USING(continent_id)
GROUP BY 1, 2, 3, 4, 5)
SELECT
country,
continent,
year,
population,
gdp
FROM world_countries
WHERE year_index = 1We now have country data that we can combine with the Olympics data. We will use the .merge() method to combine the two DataFrames using the team_clean and country columns.
A "left" join matches on rows in the olympics_data DataFrame, as some teams will not be present in the countries_data DataFrame. We then filter for rows where the sport is "Judo".
# Perform a left join between the two DataFrames using the country columns
olympics_country_data = olympics_data.merge(
countries_data, left_on="team_clean", right_on="country", how="left"
)
# Filter for Judo
judo_data = olympics_country_data[olympics_country_data["sport"] == "Judo"]
# Preview the data
judo_data🎨 Visualize the data
An essential part of exploratory analysis is visualizing the data. We will begin with a boxplot to identify differences in the medal counts between continents.
First, we will need to create a column named medal_count. This column will represent the number of non-null values by country (i.e., the number of medals).
The visualization also includes hover data for outliers, which means that you can learn more about each outlier. Be sure to hover over each box and outlier to gain additional insights from the plot!
💪 You can add a title to Plotly Express plots by using the title argument (e.g., title="My Fantastic Plot"). Try adding a title to this plot to help readers interpret your visualization!
# Group by the team and calculate the number of medals earned
judo_data["medal_count"] = judo_data.groupby("team_clean")["medal"].transform(
lambda x: x.notnull().sum()
)
# Create a boxplot of the medal count by continent
fig = px.box(judo_data, x="continent", y="medal_count", hover_data=["team_clean"])
fig.show()✏️ Markdown cells such as this one contain text, and can be edited to add your own notes, observations, and conclusions.
To edit the text, simply click inside the cell and click "Edit", or double-click the cell. You can then add text using Markdown and finalize it by clicking the "View" button, or running the cell using Shift + Enter.
💪 Try replacing the text in this cell to make observations on the plot above, introduce your analysis plan, and whatever else you want to note at this point in your analysis. You can also continue to use the "Add Markdown" button throughout this workspace to add more cells to share the insights from your analysis!
🔬 Go forth and analyze!
Well done! You have successfully used Python and SQL to load data from multiple sources, process it, and begin to perform an analysis. It's now up to you to further explore the data, create new features, and estimate the probability that a judo athlete will earn a medal.
Not sure where to go from here? Here are a few potential next steps:
- Some countries lack country data because they have different names in the
olympics_dataand thecountries_data(e.g., "Great Britain" and "United Kingdom"). Can you resolve these discrepancies? - How do athlete statistics, such as height and weight, relate to judo performance?
- What type of model(s) will you use to predict whether an athlete wins a medal? Consider checking out Machine Learning with scikit-learn if you want a refresher on machine learning in Python!
When you're finished, be sure to share your analysis. Click the Share button to build a sleek publication that includes the rendered text, code cells, and code output. When you publish your work, it appears automatically on your DataCamp profile so you can build out your portfolio. If you do not want to share your work with the world, you can adjust the sharing settings.