The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.
In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!# Python Code
# Loading in required libraries
library(tidyverse)
library(readr)
library(ggplot2)
# Start coding here!1. Load the dataset and find the most common gender and birth country
Load the dataset into a data frame and then extract the top values from sex and birth_country storing them as top_gender and top_country.
# Step 1: Load the dataset
file_path <- "./data/nobel.csv"
nobel <- suppressWarnings(
read_csv(file_path, show_col_types = FALSE) # Suppress both column type display and parsing warnings
)
# Step 1a: Find the most common gender and birth country
top_gender <- nobel %>%
count(sex, sort = TRUE) %>%
slice(1) %>%
pull(sex)
top_country <- nobel %>%
count(birth_country, sort = TRUE) %>%
slice(1) %>%
pull(birth_country)
cat("Most common gender:", top_gender, "\n")
cat("Most common birth country:", top_country, "\n")2. Identify the decade with the highest ratio of US-born winners
To calculate the proportion, first create a column that creates a flag for winners whose birth country is "United States of America", then create a decade column, and use both to find the proportion; save the decade with the highest proportion to max_decade_usa.
# Step 2: Identify the decade with the highest ratio of US-born winners
# Step 2a: Create a flag column for US-born winners
nobel <- nobel %>%
mutate(us_born_winner = birth_country == "United States of America")
# Step 2b: Create a decade column
nobel <- nobel %>%
mutate(decade = floor(year / 10) * 10)
# Step 2c: Calculate the ratio of US-born winners by decade
us_winners_ratio <- nobel %>%
group_by(decade) %>%
summarize(us_winner_ratio = mean(us_born_winner, na.rm = TRUE), .groups = "drop")
# Step 2d: Identify the decade with the highest ratio
max_decade_usa <- us_winners_ratio %>%
filter(us_winner_ratio == max(us_winner_ratio)) %>%
pull(decade)
cat("Decade with the highest US-born winners ratio:", max_decade_usa, "\n")
# Step 2e: Create a line plot for US-born winners' ratio over decades
ggplot(us_winners_ratio, aes(x = decade, y = us_winner_ratio)) +
geom_line() +
geom_point() +
labs(
title = "Proportion of US-born Nobel Laureates by Decade",
x = "Decade",
y = "Proportion of US-born Winners"
) +
theme_minimal()3. Find the decade and category with the highest proportion of female laureates
You can copy and modify your code from the previous tasks to find the proportion of female winners, then create a list called max_female_list with the decade and category pair with the most female winners.
# Step 3: Find the decade and category with the highest proportion of female laureates
# Step 3a: Create a flag column for female winners
nobel <- nobel %>%
mutate(female_winner = sex == "Female")
# Step 3b: Group by decade and category, calculate the proportion of female winners
female_winner_ratio <- nobel %>%
group_by(decade, category) %>%
summarize(female_ratio = mean(female_winner, na.rm = TRUE), .groups = "drop")
# Step 3c: Sort by female_ratio, decade, and category
max_female_row <- female_winner_ratio %>%
arrange(desc(female_ratio), desc(decade), category) %>%
slice(1)
# Step 3d: Store the results as a list
max_female_list <- list(
decade = max_female_row$decade,
category = max_female_row$category
)
cat("Decade and category with the highest proportion of female laureates:\n")
print(max_female_list)
# Step 3e: Create a line plot for female winners' proportions by decade and category
ggplot(female_winner_ratio, aes(x = decade, y = female_ratio, color = category)) +
geom_line() +
geom_point() +
labs(
title = "Proportion of Female Nobel Laureates by Decade and Category",
x = "Decade",
y = "Proportion of Female Laureates"
) +
theme_minimal()4. Find first woman to win a Nobel Prize
Filter the data for the rows with Female winners and find the earliest year and corresponding category in this subset.
# Step 4: Find the first woman to win a Nobel Prize
female_winners <- nobel %>%
filter(female_winner)
first_female_winner <- female_winners %>%
filter(year == min(year)) %>%
slice(1)
first_woman_name <- first_female_winner$full_name
first_woman_category <- first_female_winner$category
first_woman_year <- first_female_winner$year
cat("First woman to win a Nobel Prize:", first_woman_name,
"for", first_woman_category, "in", paste0(first_woman_year, "."))5. Determine repeat winners
Count the number of times each winner has won, then select those with counts of two or more, saving the full names as a data frame called repeats.
# Step 5: Identify repeat winners with counts
# a. Count the values in the full_name column
repeats <- nobel %>%
count(full_name, sort = TRUE) %>% # Count occurrences of each winner
# b. Filter for winners with counts of 2 or more
filter(n >= 2) %>%
# c. Save as a data frame
arrange(desc(n)) # Arrange in descending order of counts
# Display the results
cat("Individuals or organizations with multiple Nobel Prizes:\n")
print(repeats)Congratulations, you completed the project!