1. The most Nobel of Prizes

The Nobel Prize is perhaps the world's most well known scientific award. Except for the honor, prestige and substantial prize money the recipient also gets a gold medal showing Alfred Nobel (1833 - 1896) who established the prize. Every year it's given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the Prize was very Eurocentric and male-focused, but nowadays it's not biased in any way whatsoever. Surely. Right?
Well, we're going to find out! The Nobel Foundation has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016. Let's load it in and take a look.
# Loading in required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Reading in the Nobel Prize data
nobel = pd.read_csv('datasets/nobel.csv')
# Taking a look at the first several winners
nobel.head(6)2. So, who gets the Nobel Prize?
Just looking at the first couple of prize winners, or Nobel laureates as they are also called, we already see a celebrity: Wilhelm Conrad Röntgen, the guy who discovered X-rays. And actually, we see that all of the winners in 1901 were guys that came from Europe. But that was back in 1901, looking at all winners in the dataset, from 1901 to 2016, which sex and which country is the most commonly represented?
(For country, we will use the birth_country of the winner, as the organization_country is NaN for all shared Nobel Prizes.)
nobel.info()# find nan
nobel.isna().sum()nobel.head(2)# Display the number of (possibly shared) Nobel Prizes handed
# out between 1901 and 2016
display('Display the number of Nobel Prizes handed, out between 1091 and 2016')
display(len(nobel.loc[(nobel['year']>1901) & (nobel['year']<2016),'prize'],))
# Display the number of prizes won by male and female recipients.
display('The number of prizes won by male and female recipients.')
display(nobel['sex'].value_counts())
# Display the number of prizes won by the top 10 nationalities.
nobel['birth_country'].value_counts().head(10)3. USA dominance
Not so surprising perhaps: the most common Nobel laureate between 1901 and 2016 was a man born in the United States of America. But in 1901 all the winners were European. When did the USA start to dominate the Nobel Prize charts?
#example decade
year = pd.Series([1843, 1877, 1923])
(np.floor(year / 10) * 10).astype(int)# Calculating the proportion of USA born winners per decade
nobel['usa_born_winner'] = nobel['birth_country'] == 'United States of America'
nobel['decade'] = (np.floor(nobel.year/10 *10).astype(int)) 
prop_usa_winners = nobel.groupby('decade', as_index=False)['usa_born_winner'].mean()
# Display the proportions of USA born winners per decade
display(prop_usa_winners)4. USA dominance, visualized
A table is OK, but to see when the USA started to dominate the Nobel charts we need a plot!
# Setting the plotting theme
sns.set()
# and setting the size of all plots.
plt.rcParams['figure.figsize'] = [11, 7]
# Plotting USA born winners 
ax = sns.lineplot(x='decade',y='usa_born_winner',data=nobel)
# Adding %-formatting to the y-axis.
#Fix the y-scale so that it shows percentages using PercentFormatter.
#See here for a Stack Overflow answer on how PercentFormatter
#https://stackoverflow.com/questions/31357611/format-y-axis-as-percent/36319915#36319915
from matplotlib.ticker import PercentFormatter
ax.yaxis.set_major_formatter(PercentFormatter(1.0))
plt.show()5. What is the gender of a typical Nobel Prize winner?
So the USA became the dominating winner of the Nobel Prize first in the 1930s and had kept the leading position ever since. But one group that was in the lead from the start, and never seems to let go, are men. Maybe it shouldn't come as a shock that there is some imbalance between how many male and female prize winners there are, but how significant is this imbalance? And is it better or worse within specific prize categories like physics, medicine, literature, etc.?
# Calculating the proportion of female laureates per decade
nobel['female_winner'] = nobel.sex =='Female'
# trung bình của 'female_winner' theo nhóm ['decade','category']
prop_female_winners = nobel.groupby(['decade','category'],as_index=False)['female_winner'].mean()
# Plotting USA born winners with % winners on the y-axis
ax = sns.lineplot(x='decade',y='female_winner',data = nobel,hue='category')
ax.yaxis.set_major_formatter(PercentFormatter(1.0))
plt.show()
print(nobel['female_winner'].value_counts())
# mean = (862*0+49*1)/(862+49)
print('mean:',nobel['female_winner'].mean())