Project: Visualizing the History of Nobel Prize Winners in Python

The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np

# Start coding here!\
import matplotlib.pyplot as plt

1. Load the dataset and find the most common gender and birth country

Load the dataset into a DataFrame using pandas and then extract the top values from sex and birth_country. Store your answers as string variables top_gender and top_country.

# Step 1: Load the dataset
file_path = './data/nobel.csv'
nobel_data = pd.read_csv(file_path)

# Step 1a: Find the most common gender and birth country
top_gender = nobel_data['sex'].value_counts().index[0]
top_country = nobel_data['birth_country'].value_counts().index[0]

print(f"Most common gender: {top_gender}")
print(f"Most common birth country: {top_country}")

2. Identify the decade with the highest ratio of US-born winners

To calculate the ratio, first create a column that creates a flag for winners whose birth country is "United States of America", then create a decade column, and use both to find the ratio. Store this as an integer called max_decade_usa.

# Step 2: Identify the decade with the highest ratio of US-born winners
# Step 2a: Create a flag column for US-born winners
nobel_data['us_born_winner'] = nobel_data['birth_country'] == 'United States of America'

# Step 2b: Create a decade column
nobel_data['decade'] = (np.floor(nobel_data['year'] / 10) * 10).astype(int)

# Step 2c: Calculate the ratio of US-born winners by decade
us_winners_ratio = (
    nobel_data.groupby('decade', as_index=False)
    .agg({'us_born_winner': 'mean'})
    .rename(columns={'us_born_winner': 'us_winner_ratio'})
)

# Step 2d: Identify the decade with the highest ratio
max_decade_usa = us_winners_ratio[us_winners_ratio['us_winner_ratio'] == us_winners_ratio['us_winner_ratio'].max()]['decade'].values[0]

print(f"Decade with the highest US-born winners ratio: {max_decade_usa}")

# Step 2e: Create a line plot for US-born winners' ratio over decades
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.lineplot(data=us_winners_ratio, x='decade', y='us_winner_ratio', marker='o')
plt.title('Proportion of US-born Nobel Laureates by Decade')
plt.xlabel('Decade')
plt.ylabel('Proportion of US-born Winners')
plt.show()

3. Find the decade and category with the highest proportion of female laureates

You can copy and modify your code from the previous tasks to create a DataFrame for the proportion of female winners, then create a dictionary called max_female_dict with the year and category pair with the most female winners. Store this as a dictionary called max_female_dict where the decade is the key and the category is the value. There should only be one key:value pair.

# Step 3: Find the decade and category with the highest proportion of female laureates
# Step 3a: Filter for female winners
nobel_data['female_winner'] = nobel_data['sex'] == 'Female'

# Step 3b: Group by decade and category, calculate the proportion of female laureates
female_winner_ratio = (
    nobel_data.groupby(['decade', 'category'], as_index=False)
    .agg({'female_winner': 'mean'})
    .rename(columns={'female_winner': 'female_ratio'})
)

# Step 3c: Identify the decade and category with the highest proportion
max_female_row = female_winner_ratio[female_winner_ratio['female_ratio'] == female_winner_ratio['female_ratio'].max()]
max_female_decade = max_female_row['decade'].values[0]
max_female_category = max_female_row['category'].values[0]

# Step 3d: Create a dictionary to store result
max_female_dict = {max_female_decade: max_female_category}

print(f"Decade and category with the highest proportion of female laureates: {max_female_dict}")

# Step 3e: Create a line plot for female winners' proportions by decade and category
plt.figure(figsize=(14, 8))
sns.lineplot(data=female_winner_ratio, x='decade', y='female_ratio', hue='category', marker='o')
plt.title('Proportion of Female Nobel Laureates by Decade and Category')
plt.xlabel('Decade')
plt.ylabel('Proportion of Female Laureates')
plt.legend(title='Category')
plt.show()

4. Find first woman to win a Nobel Prize

Filter the DataFrame for the rows with Female winners and find the earliest year and corresponding category in this subset. Save your string answers as first_woman_name and first_woman_category.

# Step 4: Find the first woman to win a Nobel Prize
# Step 4a: Filter the dataframe for female laureates
female_winners = nobel_data[nobel_data['female_winner']]

# Step 4b: Determine the year and category of the first female laureate
first_female_winner = female_winners[female_winners['year'] == female_winners['year'].min()]
first_woman_name = first_female_winner['full_name'].values[0]
first_woman_category = first_female_winner['category'].values[0]
first_woman_year = first_female_winner['year'].values[0]

print(f"First woman to win a Nobel Prize: {first_woman_name} for {first_woman_category} in {first_woman_year}.")

5. Determine repeat winners

Count the number of times each winner has won, then select those with counts of two or more, saving the full names as a list called repeats. Store the full names in a list named repeat_list.

# Step 5: Identify repeat winners
# Step 5a: Count the number of times each laureate has won
winner_counts = nobel_data['full_name'].value_counts()

# Step 5b-5c: Find the counts of 2 or more wins and store the names of such laureates
repeat_list = winner_counts[winner_counts >= 2].index.tolist()

print(f"Individuals or organizations with multiple Nobel Prizes: {repeat_list}")