Analyzing the Nobel Minds
Perhaps the Nobel Prize is the most prestigious international award in the scientific world. Since its foundation in 1901, the Nobel Committees hands out the awards to the brightest minds for their groundbreaking accomplishments in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the world-wide respect and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv
file in the data
folder.
In this project, we'll get our hands dirty exploring the Nobel Prize dataset and try to answer interesting questions related to the winners.
1. Getting Started
First things first, we will load some handy tools including Pandas, Numpy, Pyplot, and Seaborn. Then we will load, read the dataset, and take a look at the first few rows.
#Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#Read the data and create a DF
nobel = pd.read_csv("data/nobel.csv")
# Taking a look at the first several winners
nobel.head(6)
2. Who Wins the Nobel Prize?
At first sight, Europeans seem to be leading the way, as almost all of the winners were from Europe in 1901. But having an idea about the data by looking at only a few samples can be misleading. In this part, we will find out the total number of winners, the most awarded gender, and the top 10 countries with the most winners.
#Total number of winners
total = len(nobel)
print(f"So far {total} people and organizations were awarded with the Nobel prize.")
# winners by gender.
winner_gender = nobel['sex'].value_counts()
print(winner_gender)
#the most commonly awarded gender
top_gender = winner_gender.index[0]
print(f"The gender with the most Nobel laureates is: {top_gender}.")
# Top 10 countries with the most winners.
print(nobel['birth_country'].value_counts().head(10))
#the most awarded country
top_country = nobel["birth_country"].value_counts().index[0]
print(f"The most common birth country of Nobel laureates is: {top_country}.")
3. America First
Results show that the USA iS the actual frontrunner in the Nobel Prize race. Remember all the winners were from Europe in 1901? When did the USA get ahead of the Europe in the Nobel Prize charts?
#convert years to decades
nobel["decade"] = nobel["year"]//10*10
#label us-born laureates
nobel["usa_born_winner"] = nobel["birth_country"] == "United States of America"
#proportion of the us-born winners by decade
prop_usa_winners = nobel.groupby("decade", as_index= False)["usa_born_winner"].mean()
print(prop_usa_winners)
##decade with the highest proportion of the us-born laureates
max_decade_usa = prop_usa_winners[prop_usa_winners["usa_born_winner"] == prop_usa_winners["usa_born_winner"].max()]["decade"].values[0]
print(f"{max_decade_usa}s had the highest proportion of US-born winners")
4. Visualizing the USA Dominance
Visualizing data may provide us a different perspective. So let's plot the the tabular data.
#prepare the key cordinates of the plot
peakx = max_decade_usa
peaky = prop_usa_winners[prop_usa_winners["usa_born_winner"] == prop_usa_winners["usa_born_winner"].max()]["usa_born_winner"].values[0]
#plot the proportion us-born laureates by decade
plt.figure(figsize=(10,5))
plt.plot(prop_usa_winners["decade"], prop_usa_winners["usa_born_winner"], ls = ":", marker = "o")
plt.plot(peakx, peaky, marker="o", markerfacecolor = "r")
plt.annotate(f"Peak Period 2000s({round(peaky*100,2)}%)", xy=(2000, peaky), xytext = (1920, 0.33),fontsize = 10, arrowprops = {"arrowstyle": "->", "connectionstyle":"arc3, rad= -0.2", "color": "k"})
plt.title("Proportion of the US-born Nobel Prize Winners by decade", y = 1.05)
plt.xlabel("Decades")
plt.ylabel("Proportion the US-born Nobel Laureates")
plt.xticks(np.arange(1900, 2023, step=10))
plt.show
4. Gender diversity of the Nobel Prize winners.
One of every three Nobel winners was American starting from 1940s and this nuumber peaked in 2000s. It is interesting to see that this date coincides with the mass migration of European scientists to the USA. The great minds of Europe fled to the USA escaping from the rise of fasist governments in Europe. Does winners-list reflect the zeitgeist? If so, can we also observe gender inequality in Nobel Prize history. Historically, women have been under-represented in scientific circles relative to men, due to men's privileged social status. In section 2, we saw that there are only 65 female compared to 905 male Nobel laureates. Is this imbalance same across all the prize categories? How has the gender imbalance changed by time? Are women still under-represented in Nobel Prizes?
# Calculating the proportion of female laureates per decade
nobel['female_win'] = nobel['sex'] == 'Female'
fem_dec_cat = nobel.groupby(['decade', 'category'], as_index=False)['female_win'].mean()
# Plotting female winners with % winners on the y-axis
ax = sns.lineplot(x='decade', y='female_win', hue='category', data=fem_dec_cat)
sns.set_style("darkgrid")
# Adding %-formatting to the y-axis
from matplotlib.ticker import PercentFormatter
ax.yaxis.set_major_formatter(PercentFormatter(1.0))
Wow! The chaotic fluctuations in the plot reminds the cryptocurrency market. But one can easily spot a trend: since 2000s, more women have been awarded with the Nobel Prize. This trend is visible in literature, peace, and slightly in medicine. It is promising to see more woman are being recognized, yet Gender gap remains to exist in the fields such as physics,economics, and and chemistry. If that's the case, then which category and decade have the highest number of female laurates?
#highest proportion of female winners by decade and category
max_fem_dec_cat = fem_dec_cat[fem_dec_cat["female_win"] == fem_dec_cat["female_win"].max()]
fem_decade = max_fem_dec_cat["decade"].values[0]
fem_category = max_fem_dec_cat["category"].values[0]
#answer
print(f"{fem_category} category has the highest proportion of female winners in {fem_decade}s.")