Skip to content

The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

# Importing necessary libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Defining the path to the dataset
data_path = 'data/nobel.csv'

# Loading the dataset into a pandas DataFrame
df = pd.read_csv(data_path)

# Displaying the first few rows of the DataFrame to check its structure
df.head()
# Getting the most common gender
top_gender = df['sex'].value_counts().index[0]

# Getting the most common country
top_country = df['birth_country'].value_counts().index[0]

# Dispalying the results
print(top_gender)
print(top_country)
# Create a new column 'us_born' that is True if 'birth_country' is 'United States of America'
df['us_born'] = df['birth_country'] == 'United States of America'

# Create a new column 'decade' by integer-dividing 'year' by 10, then multiplying by 10
# This groups the years into decades
df['decade'] = (df['year'] // 10 * 10)

# Group the DataFrame by decade and calculate the mean of the 'us_born' column for each decade
# This gives the proportion of US-born laureates for each decade
# Then, find the index of the decade with the highest proportion of US-born laureates
max_decade_usa = df.groupby('decade')['us_born'].mean().idxmax()

# Dispalying the results
max_decade_usa
# Creating a line plot using seaborn to display which decade had the highest ratio of us-born Nobel Prize winners
sns.relplot(x='decade', y='us_born',
            data=df,
            kind='line', 
            ci=None)

# Displaying the plot
plt.show()
# Adding a new column 'female_winner' to the DataFrame to indicate if the laureate is female
df['female_winner'] = df['sex'] == 'Female'

# Grouping the DataFrame by 'decade' and 'category' and calculating the mean of 'female_winner'
# This gives the proportion of female winners in each category for each decade
female_proportion = df.groupby(['decade', 'category'], as_index=False)['female_winner'].mean()

# Finding the row with the maximum proportion of female winners
max_female_row = female_proportion.loc[female_proportion['female_winner'].idxmax()]

# Creating a dictionary with the decade as the key and the category with the highest proportion of female winners as the value
max_female_dict = {max_female_row['decade']: max_female_row['category']}

# Displaying the dictionary
max_female_dict
# Creating a line plot with seaborn to visualize the proportion of female winners by decade and category
sns.relplot(x='decade', y='female_winner', data=df, 
            kind='line', 
            hue='category',
            ci=None)  

# Displaying the plot
plt.show()  
# Filtering the DataFrame to include only female winners and selecting the first row
first_w_female = df[df['sex'] == 'Female'].iloc[0]

# Extracting the name and category of the first Nobel Prize female winner
first_woman_name = first_w_female['full_name'] 
first_woman_category = first_w_female['category']

# Displaying the results
print(first_woman_name)
print(first_woman_category)
# Counting occurrences of each full name in the dataframe
names_count = df['full_name'].value_counts()

# Creating a list of names that appear two or more times
repeat_list = names_count[names_count >= 2].index.tolist()

# Displaying the result
repeat_list