The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.
In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Start coding here!First, we will preview the head of the dataframe to understand what data we have available to us.
df = pd.read_csv("data/nobel.csv")
df.head(10)Let's start by addressing the first question...
We want to understand which gender and birth country is most commonly awarder the Nobel Prize.
We can extract this information by taking the value counts of the 'birth_country' and 'sex' columns.
This will return a pandas series object, with our target variable set as the index.
Since the counts are ordered in descending order by default, we can take the first element of the index and this will be the country/sex with the highest count.
# Get the first index of value counts
top_gender = df['sex'].value_counts().index[0]
top_country = df['birth_country'].value_counts().index[0]
print(f"The most commonly awarded gender is: {top_gender}")
print(f"The most commonly awarded birth country is: {top_country}")Next, we want to identify which decade had the highest ratio of US-born Nobel Prize winners to total winners in all categories.
To achieve this, for each decade, we will first need to bin the 'year' column to actually get information for each decade.
Then, for each decade, we will divide the count of 'United States of America' in the 'birth_country' column by the total count to get the ratio.
We can order the results by the ratio and then extract the top result.
# Create new columns for decade and US-born
df['decade'] = (df['year'] // 10) * 10
df['us_born'] = df['birth_country'] == "United States of America"
# Group our data by decade and calculate the US-born ratio for each decade
grouped = df.groupby('decade')['us_born']
ratio_by_decade = grouped.mean() # mean of True/False == ratio of True
# Extract the decade with the highest proportion of US-born prize winners
max_decade_usa = ratio_by_decade.idxmax()
max_ratio = ratio_by_decade.max()
print(f"Decade with highest US-born ratio: {max_decade_usa} ({max_ratio:.2%})")Next, we want to identify which decade and Nobel Prize category combination had the highest proportion of female laureates.
We already have the decade column from our previous analysis, however this time we are interested in grouping the data by both decade and prize category.
We can then perform a similar operation as we did with the birth country, where we create a female boolean column that we can use to identify the ratio of true/false female within each category for each decade.
# Create boolean column for Female sex
df['is_female'] = df['sex'] == "Female"
# Group the data by both decade and category, and extract the True/False ratio
grouped = df.groupby(['decade', 'category'])['is_female'].mean()
# Extract the combination of decade and category with the highest proportion of Female prize winners
max_combo = grouped.idxmax()
max_ratio = grouped.max()
# Create a dictionary to store our values
max_female_dict = {
max_combo[0] : max_combo[1]
}
print(f"Highest female ratio: {max_ratio:.2%} in decade {max_combo[0]} for category '{max_combo[1]}'")Next, we want to identify the name of the first woman to receive a Nobel prize, and in which category.
To do this, we can order our data by year, look for the first occurrence of Female in the sex column, and use this index to extract the value from the category column.
# Sort by year (ascending)
df_sorted = df.sort_values(by='year')
# Find the index of the first female laureate
first_female_idx = df_sorted[df_sorted['sex'] == "Female"].index[0]
# Extract the relevant row
first_female = df_sorted.loc[first_female_idx]
# Get the name (if available), year, and category
first_woman_name = first_female.get('full_name', 'Unknown') # Adjust column name as needed
first_woman_year = first_female['year']
first_woman_category = first_female['category']
print(f"The first woman to win a Nobel Prize was {first_woman_name} in {first_woman_year} for {first_woman_category}.")
Finally, we want to identify which individuals or organizations have won more that one Nobel Prize throughout the years.
To do this, we can simply look for duplicates within the full_name column where the laureate_type is either 'Individual' or 'Organization'.
# Filter for Individuals or Organizations
filtered_df = df[df['laureate_type'].isin(['Individual', 'Organization'])]
# Count occurrences of each name
name_counts = filtered_df['full_name'].value_counts()
# Get names with more than one award
repeat_list = name_counts[name_counts > 1].index.tolist()
print(repeat_list)