Skip to content

The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
# 1. Load the dataset and find the most common gender and birth country
winners_df = pd.read_csv("nobel.csv")

# Store as string variables: top_gender, top_country
top_gender = winners_df['sex'].value_counts().index[0]
top_country = winners_df['birth_country'].value_counts().index[0]
# 2. Identify the decade with the highest ratio of US-born winners

# Create a flag column for US-born winners
winners_df['US_born'] = (winners_df['birth_country'] == 'United States of America') | (winners_df['birth_country'] == 'USA')

# Create the decade column by dividing year values by 10 and wrap this in np.floor()
winners_df['decade'] = np.floor(winners_df['year'] / 10) * 10
winners_df['decade'] = winners_df['decade'].astype(int)

# Find the ratio of US-born winners
winners_df['US_born'] = winners_df['US_born'].astype(int) 
avg_us_winners_per_decade = winners_df.groupby('decade', as_index=False)['US_born'].mean()

# Decade with the highest ratio of US-born winners
highest_us_winners_ratio_row = avg_us_winners_per_decade[avg_us_winners_per_decade['US_born'] == avg_us_winners_per_decade['US_born'].max()]
max_decade_usa = highest_us_winners_ratio_row['decade'].values[0]

# Create a relational line plot
sns.set_style("whitegrid")
sns.relplot(data = avg_us_winners_per_decade, x = 'decade', y = 'US_born', kind="line", height = 5, aspect = 2)
# 3. Find the decade and category with the highest proportion of female laureates

# Create a flag column for female winners
winners_df['female_winners'] = (winners_df['sex'] == 'Female').astype(int)

# Find the ratio of female winners per decade and category
avg_female_winners_per_decade = winners_df.groupby(['decade', 'category'], as_index=False)['female_winners'].mean()

# Find decade with the highest ratio of female winners
max_female_row = avg_female_winners_per_decade.loc[avg_female_winners_per_decade['female_winners'].idxmax()]

# Create a dictionary with the year and category pair with the most female winners
max_female_dict = {max_female_row['decade']: max_female_row['category']}

# Create a relational line plot
sns.set_style("whitegrid")
sns.relplot(data = avg_female_winners_per_decade, x = 'decade', y = 'female_winners', kind="line", height = 5, aspect = 2)
# 4. Find first woman to win a Nobel Prize

# Filtering the female winners column
female_winners_df = winners_df[winners_df['sex'] == 'Female']

# Find earliest year and corresponding category in female_winners_df
min_year_female_winners_row = female_winners_df[female_winners_df['year'] == female_winners_df['year'].min()]

# First woman to win a Nobel Prize and the category
first_woman_name = min_year_female_winners_row['full_name'].values[0]
first_woman_category = min_year_female_winners_row['category'].values[0]
# 5. Determine repeat winners

# Count number of times each winner has won
winner_counts = winners_df['full_name'].value_counts()

# Subset the counts to keep only those with >= 2, extract only the names, and save as a list
repeat_list = list(winner_counts[winner_counts >= 2].index)