The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.
In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Start coding here!
df = pd.read_csv("data/nobel.csv")
df.head()What is the most commonly awarded gender and birth country?
top_gender = df["sex"].value_counts().index[0]
top_country = df["birth_country"].value_counts().index[0]
print ("top gender is " ,top_gender , " and top country is ", top_country)
sns.set_style('whitegrid')
plt.figure(figsize=(6, 4))
sns.countplot(x=df["sex"], hue=df["sex"])
plt.show()
# Combine small frequencies into 'Other countries'
threshold = 20 # You can adjust this threshold as needed
country_counts = df["birth_country"].value_counts()
other_countries = country_counts[country_counts < threshold].sum()
country_counts = country_counts[country_counts >= threshold]
country_counts["Other countries"] = other_countries
plt.figure(figsize=(14, 6))
country_counts.plot.pie(autopct='%1.1f%%')
plt.ylabel('')
plt.show()Which decade had the highest ratio of US-born Nobel Prize winners to total winners in all categories?
df['decade'] = (df['year'] // 10) * 10
df['us_born'] = df['birth_country'] == 'United States of America'
total_winners = df.groupby('decade')['full_name'].count()
us_born =df.groupby('decade')['us_born'].sum()
ratio = us_born/total_winners
max_decade_usa = int(ratio.sort_values().index[-1])
plt.figure(figsize=(10, 6))
sns.lineplot(data=ratio, marker='o')
plt.title('Ratio of US-born Nobel Prize Winners to Total Winners by Decade')
plt.xlabel('Decade')
plt.ylabel('Ratio')
plt.grid(True)
plt.show()
Which decade and Nobel Prize category combination had the highest proportion of female laureates?
# Female winners per decade & category
total_win = df[df['sex'] == 'Female'].groupby(['decade','category'])['full_name'].count().reset_index(name='count')
# Total winners per decade
total_winners = df.groupby(['decade'])['full_name'].count().reset_index(name='total_count')
# Total winners per decade & category
total_cat = df.groupby(['decade','category'])['full_name'].count().reset_index(name='total_catigory')
# Merge
total_win = total_win.merge(total_winners, on='decade')
total_win = total_win.merge(total_cat, on=['decade', 'category'])
# Add ratios
total_win['ratio_w'] = total_win['count'] / total_win['total_count']
total_win['ratio_w_in_cat'] = total_win['count'] / total_win['total_catigory']
# Plot
plt.figure(figsize=(12, 6))
sns.set_style('whitegrid')
sns.set_palette("tab10")
sns.barplot(data=total_win, x='decade', y='ratio_w', ci=None)
plt.show()
plt.figure(figsize=(12, 6))
sns.set_palette("Paired")
sns.barplot(data=total_win, x='decade', y='ratio_w_in_cat', hue='category', ci=None)
plt.show()#from the fig1 we see that 2020 is most decade women ratio to total is high, and from fig2 we found that Literature in 2020 is the highest ratio for women
max_female_dict={2020:"Literature"}
max_female_dictWho was the first woman to receive a Nobel Prize, and in what category?
f_women = df[df['sex'] == 'Female'].groupby('year')['full_name'].count()
name = df[(df['year']==f_women.index[0]) & (df['sex'] == 'Female')][['full_name',"category"]]
first_woman_name='Marie Curie, née Sklodowska'
first_woman_category='Physics'
nameWhich individuals or organizations have won more than one Nobel Prize throughout the years
total_indev_comp = pd.DataFrame(df.groupby(["laureate_type", 'full_name'])['full_name'].count().reset_index(name='count_names'))
total_indev_comp_win = total_indev_comp[total_indev_comp['count_names'] > 1]
repeat_list = total_indev_comp_win["full_name"].tolist()
sns.set_style('darkgrid')
sns.barplot(data=total_indev_comp_win, y='full_name', x='count_names', hue='count_names')
plt.title('Individuals or Organizations with Multiple Nobel Prizes')
plt.xlabel('Number of Prizes')
plt.ylabel('Full Name')
plt.show()