Skip to content

The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Importing in the data
nobel=pd.read_csv("data/nobel.csv")

# Printing the first 10 rows of the data to ensure I understand the dataset
nobel.head(10)
# What is the most commonly awarded gender?
    # Note. Gender is coded as 'sex'

# Creating a figure for practice
sns.set_style("white")
plot1=sns.countplot(x="sex", 
             data=nobel,
             hue="sex")
plot1.set_title("Distribution of Nobel Prize Winners by Gender")
plot1.set(xlabel="Gender",
         ylabel="Count")
plot1.legend(title="Sex") # Changing the label of sex to Sex in the figure legend
plot1.figure.show()

# Adding counts to the top of each bar
for container in plot1.containers:
    plot1.bar_label(container, fmt='N=%d', label_type='edge', padding=0.1)

# What is the most commonly birth country?
    # Note. Country is coded as 'birth_country'
# What is the most commonly awarded gender?
    # Note. Gender is coded as 'sex'
gender_count= nobel['sex'].value_counts()
top_gender = gender_count.idxmax()
print(top_gender)

# What is the most commonly birth country?
    # Note. Country is coded as 'birth_country'
bc_count= nobel['birth_country'].value_counts()
top_country = bc_count.idxmax()
print(top_country)
# Which decade had the highest ratio of US-born Nobel Prize winners to total winners in all categories?
# First I will deduce both the number of us winners by decade and the total winners by decade. 
        #Note. use the variable year
        # 1. what is the range of years
print("The range of years included in this dataset: " + str(nobel['year'].min()) + " to " + str(nobel['year'].max()))
            # Note. The range is 1901 to 2023

        # 2. Creating decades as an integer
# Checking for missing values for birth country or year
print("The number of missing by year is: " + str(nobel['year'].isna().sum()))
print("The number of missing by country is: " + str(nobel['birth_country'].isna().sum()))
#Note. there are 31 missing birth country values and no missing year values

# Maing a column for decade
nobel['decade'] = ((nobel['year'] // 10) * 10)
#checking this: 
nobel['decade'].dtype
    # Note. This is an integer
nobel.head()
    #Note. This looks good so far

# Creating an identifier for us born
nobel['usa'] = np.where(
    nobel['birth_country'].isna(), # accounts for missing
    np.nan,
    np.where(nobel['birth_country'] == 'United States of America', 1, 0))

ratio_by_decade = nobel.groupby(['decade']).mean()
# Checking to make sure this is doing what I want
print(ratio_by_decade)

# Calculating the maximum decade
ratio_by_decade_usa = nobel.groupby('decade')['usa'].mean()

# Decade with the highest ratio (as a plain Python int)
max_decade_usa = int(ratio_by_decade_usa.idxmax())
# Which decade and Nobel Prize category combination had the highest proportion of female laureates?
    # Store this as a dictionary called max_female_dict where the decade is the key and the category is the value. 
    # Note. There should only be one key:value pair.

# Checking if any 'sex' categories are missing
nobel['sex'].isna().sum()
    # Note. There are 30 that are missing by sex. 

# Creating an identifier for female
nobel['female'] = np.where(
    nobel['sex'].isna(), # accounts for missing
    np.nan,
    np.where(nobel['sex'] == 'Female', 1, 0))

# Checking this to ensure that females < males (logic check) and that the nas were coded as NA
nobel['female'].value_counts(dropna=False)
    # Note. This looks correct

ratio_by_sex = nobel.groupby(['decade', 'category'])['female'].mean()
# Checking to make sure this is doing what I want

# Caluclating the maximum decade
# Sorting the ratio by sex in decreasing order
# The first row now holds the highest female ratio with the latest decade in case of ties
max_row = ratio_by_sex.reset_index(name='female_ratio') \
    .sort_values(['female_ratio','decade','category'], ascending=[False, False, True]) \
    .iloc[0]
max_female_dict = {int(max_row['decade']): max_row['category']}

# Calculating the maximum decade
print(max_female_dict)   
# Who was the first woman to receive a Nobel Prize, and in what category?
    #Note. Save your string answers as first_woman_name and first_woman_category.
first_women_data = nobel[nobel['sex'] == "Female"].sort_values('year').iloc[0]
# print(first_women_data)
first_woman_name = str(first_women_data["full_name"])
first_woman_category = str(first_women_data["category"])
# Which individuals or organizations have won more than one Nobel Prize throughout the years?
# Store the full names in a list named repeat_list.
repeat_name = (nobel['full_name'].value_counts()[nobel['full_name'].value_counts() > 1])
repeat_org = (nobel['organization_name'].value_counts()[nobel['organization_name'].value_counts() > 1])
repeat_list = nobel['full_name'].value_counts()[nobel['full_name'].value_counts() >= 2].index.tolist()
print(repeat_list)