Project: Visualizing the History of Nobel Prize Winners

The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

#import packages 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
file_path = 'data/nobel.csv'
df = pd.read_csv(file_path)

# 1. Most commonly awarded gender and birth country

# Calculate most common gender and birth country
top_gender = df['sex'].value_counts().index[0]
top_country = df['birth_country'].value_counts().index[0]

# 2. Decade with the highest ratio of US-born Nobel Prize winners to total winners
# Add a decade column
df['decade'] = (np.floor(df['year'] / 10) * 10).astype(int)

# Add a column for US-born winners
df['us_winner'] = df['birth_country'] == 'United States of America'

# Calculate the proportion of US winners by decade
prop_usa_winners = df.groupby('decade', as_index=False)['us_winner'].mean()

# Find the decade with the highest proportion of US-born winners
max_decade_usa = prop_usa_winners.loc[prop_usa_winners['us_winner'].idxmax(), 'decade']

# 3. Decade and category combination with the highest proportion of female laureates
# Add a column for female laureates
# Ensure consistency in 'sex' column
df['sex'] = df['sex'].str.lower()
df['sex'] = df['sex'].fillna('unknown')  # Handle missing values in 'sex'

df['female_winner'] = df['sex'] == 'female'

# Group by decade and category, and calculate the proportion of female winners
prop_female_winners = df.groupby(['decade', 'category'], as_index=False)['female_winner'].mean()

# Find the row with the highest proportion of female winners
max_female_row = prop_female_winners.loc[prop_female_winners['female_winner'].idxmax()]

# Create a dictionary for the decade and category with the highest female ratio
max_female_dict = {max_female_row['decade']: max_female_row['category']}

# 4. First woman to receive a Nobel Prize, and in what category
female_winners = df[df['female_winner']]

if not female_winners.empty:
    # Find the first female laureate
    first_female_winner = female_winners.sort_values(by='year').iloc[0]
    first_woman_name = first_female_winner['full_name']
    first_woman_category = first_female_winner['category']
else:
    first_woman_name = None
    first_woman_category = None

# 5. Individuals or organizations that have won more than one Nobel Prize
repeat_winners = df['full_name'].value_counts()
repeat_list = list(repeat_winners[repeat_winners > 1].index)

# 6. Visualization of US-born winners by decade
sns.lineplot(x='decade', y='us_winner', data=prop_usa_winners)
plt.xlabel("Decade")
plt.ylabel("Proportion of US Winners")
plt.title("Proportion of US-born Nobel Winners by Decade")
plt.show()

# Results
print("\n Most Common Gender:", top_gender)
print("Most Common Birth Country:", top_country)
print("Decade with highest ratio of US winners:", max_decade_usa)
print("Decade and category with highest female ratio:", max_female_dict)
print("First woman to receive a Nobel Prize:", first_woman_name)
print("First woman category:", first_woman_category)
print("Individuals/Organizations with multiple Nobel Prizes:", repeat_list)