The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.
In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Start coding here!df=pd.read_csv('data/nobel.csv')
# Display the first few rows to understand the structure
print(df.head())
# Check column names and data types
print(df.info())
# Check for missing values
print(df.isnull().sum())
# Most common gender
top_gender = df['sex'].value_counts().index[0]
print("Most common gender:", top_gender)
# Most common birth country
top_country = df['birth_country'].value_counts().index[0]
print("Most common birth country:", top_country)
#Which decade had the highest ratio of US-born Nobel Prize winners to total winners in all categories?
# Create a column to flag US-born winners
df['us_born_winner'] = df['birth_country'] == 'United States of America'
# Create a decade column
df['decade'] = (np.floor(df['year'] / 10) * 10).astype(int)
# Group by decade and calculate the mean ratio of US-born winners
us_born_ratio = df.groupby('decade', as_index=False)['us_born_winner'].mean()
us_born_ratio.rename(columns={'us_born_winner': 'us_born_ratio'}, inplace=True)
# Find the decade with the highest ratio of US-born winners
max_decade_usa = us_born_ratio.loc[us_born_ratio['us_born_ratio'].idxmax(), 'decade']
print("Decade with the highest ratio of US-born winners:", max_decade_usa)
# Create a relational line plot to visualize the trend
sns.relplot(
x='decade',
y='us_born_ratio',
data=us_born_ratio,
kind='line',
marker='o'
)
# Add labels and title
plt.title('Ratio of US-born Nobel Prize Winners by Decade')
plt.xlabel('Decade')
plt.ylabel('US-born Winner Ratio')
plt.show()# Filter for female laureates
female_df = df[df['sex'] == 'Female']
# Add a column to indicate female winners
df['female_winner'] = df['sex'] == 'Female'
# Group by decade and category, then calculate the mean proportion of female winners
female_proportion = (
df.groupby(['decade', 'category'], as_index=False)['female_winner'].mean().rename(columns={'female_winner': 'female_proportion'})
)
# Find the row with the highest proportion of female winners
max_female_row = female_proportion.loc[female_proportion['female_proportion'].idxmax()]
# Create a dictionary with the decade and category
max_female_dict = {max_female_row['decade']: max_female_row['category']}
print("Decade and category with the highest proportion of female laureates:", max_female_dict)
# Create a relational line plot with multiple categories
sns.relplot(
x='decade',
y='female_proportion',
hue='category',
data=female_proportion,
kind='line',
marker='o'
)
# Add labels and title
plt.title('Proportion of Female Nobel Laureates by Decade and Category')
plt.xlabel('Decade')
plt.ylabel('Proportion of Female Laureates')
plt.show()#Find first woman to win a Nobel Prize
# Filter the DataFrame for female winners
female_winners = df[df['sex'] == 'Female']
# Find the row with the earliest year
first_female_winner = female_winners[female_winners['year'] == female_winners['year'].min()]
# Extract relevant information
first_female_year = first_female_winner['year'].values[0]
first_woman_category = first_female_winner['category'].values[0]
first_woman_name = first_female_winner['full_name'].values[0]
print(f"The first woman to win a Nobel Prize was {first_woman_name} in {first_female_year}, in the category of {first_woman_category}.")#Determine repeat winners
# Count the number of wins for each laureate
winner_counts = df['full_name'].value_counts()
# Filter for repeat winners (counts >= 2)
repeat_winners = winner_counts[winner_counts >= 2].index
# Save the names as a list
repeat_list = list(repeat_winners)
print("Repeat winners:", repeat_list)Analyze this project and tell me a description to write in portfolio
Nobel Prize Data Analysis Project
This project explores the history of Nobel Prize winners using a dataset containing laureate information. Key analyses include:
- Identifying the first woman to win a Nobel Prize, including her name, the year, and the prize category.
- Detecting repeat Nobel laureates by analyzing the frequency of wins for each individual.
The project demonstrates data filtering, aggregation, and extraction of historical insights using Python and pandas. It highlights the ability to answer specific questions about gender milestones and patterns of repeated excellence in Nobel history.