The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.
In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
# Load Dataset into DataFrame
nobel = pd.read_csv("data/nobel.csv")
print(nobel.head())
# Extract top values from 'sex' and 'birth_country'
# Use value_count() to find the most common value in a column and tag .index[0] at the end
top_gender = nobel['sex'].value_counts().index[0]
top_country = nobel['birth_country'].value_counts().index[0]
print("\n The gender with the most number of Nobel Prize Award is :", top_gender)
print("\n The most common country of Nobel Prize Award is :", top_country)
# Create a new column
nobel['usa_born_winner'] = nobel['birth_country'] == 'United States of America'
# Create a decade column and Divide the 'year' to 10 wrap that in a np.floor() then * 10 and chain as.type(int)
nobel['decade'] = (np.floor(nobel['year'] / 10) * 10).astype(int)
# Group the 'decade' then use .mean() on 'usa_born_winners' then store all that to new df
prop_usa_winners = nobel.groupby('decade', as_index=False)['usa_born_winner'].mean()
# Filter the DataFrame to find row with .max()
# Use .values[] to save only the decade value
max_decade_usa = prop_usa_winners[prop_usa_winners['usa_born_winner'] == prop_usa_winners['usa_born_winner'].max()]['decade'].values[0]
# This step is optional. Use retplot() function to make relational "line" plot
ax1 = sns.relplot(x="decade", y="usa_born_winner", data=prop_usa_winners, kind="line")
# Filter for female winner and Group by two columns
nobel['female_winner'] = nobel['sex'] == 'Female'
prop_female_winners = nobel.groupby(['decade', 'category'], as_index=False)['female_winner'].mean()
# Filter the new DataFrame and add the two columns at the end
max_female_decade_category = prop_female_winners[prop_female_winners['female_winner'] == prop_female_winners['female_winner'].max()][['decade', 'category']]
# Create a dictionary for max_female_dict
max_female_dict = {max_female_decade_category['decade'].values[0]: max_female_decade_category['category'].values[0]}
# This step is also a optional. Use retplot() and add hue parameter, mapping it to category variable
ax2 = sns.relplot(x="decade", y="female_winner", data=prop_female_winners, kind="line", hue='category')
# Filter the DataFrame of female winner column
# Find the minimum value in a column and save it as min_row. Use .min()
nobel_women = nobel[nobel['female_winner']]
min_row = nobel_women[nobel_women['year'] == nobel_women['year'].min()]
first_woman_name = min_row['full_name'].values[0]
first_woman_category = min_row['category'].values[0]
print(f"\n The first woman to win a Nobel Prize Award was {first_woman_name}, in the category of {first_woman_category}.")
# Use .value_counts() to count the values in a column
# Subset the counts then save it as a list
counts = nobel['full_name'].value_counts()
repeats = counts[counts >= 2].index
repeat_list = list(repeats)
print("\n The repeat winners is :", repeat_list)