The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.
The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the nobel.csv file in the data folder.
In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np
# Start coding here!df = pd.read_csv('data/nobel.csv')
df.head()#count the number gender per category
top_gender_a=df['sex'].value_counts()
#select the top gender
top_gender = str(top_gender_a.index[0])
top_genderdf.dtypes#count the number country
top_country_a=df['birth_country'].value_counts()
#select the top birth_country of the awardees
top_country = str(top_country_a.index[0])
print(top_country)import numpy as np
#create new column
df['US_born'] = df['birth_country'] == 'United States of America'
#create a decade column
df['decade'] = np.floor((df['year'] // 10) * 10).astype(int)#Group the data by the decade column
decade_usa = df.groupby("decade", as_index=False)['US_born'].mean()
#sort in decending order to get max born
new_decade_usa = decade_usa.sort_values("US_born",ascending=False)
#get the values for the year with highest max ratio
max_decade_usa = new_decade_usa['decade'].values[0]
print(max_decade_usa)
#create a relational line plot to visualise
import matplotlib.pyplot as plt
g=sns.relplot(x='decade',y='US_born',data=new_decade_usa,kind = 'line',marker ='v',color = 'g')
plt.show()
g.fig.suptitle('Ratio of US_born Nobel Prize winners per decade')
#create new column for female alone
df['female_winner'] = df['sex'] == 'Female'
#Group the data by the decade and category column
decade_female = df.groupby(["decade","category"], as_index=False)['female_winner'].mean()
decade_female
#sort the df to get the max mean value row with max value
sorted_df = decade_female.sort_values(by='female_winner', ascending=False)
sorted_df.head()
#get the values for the year with highest max ratio
max_row = sorted_df.iloc[0][['decade', 'category']]
#creat and store the answer in a dictionary
max_female_dict={max_row['decade']: max_row['category']}
print(max_female_dict)#create a relational line plot to visualise
import matplotlib.pyplot as plt
g=sns.relplot(x='decade',y='female_winner',data=sorted_df,kind = 'line',marker ='o',hue = 'category',ci = None)
plt.show()
g.fig.suptitle('Proportion of female_laureate per decade')#filter the dataframe with female winners
first_woman = df[df['female_winner'] == True]
first_woman.head()
min_woman = first_woman[first_woman['year']== first_woman['year'].min()]
min_woman.head()
#save the answer as string
first_woman_name = min_woman['full_name'].values[0]
first_woman_category = min_woman['category'].values[0]
print(first_woman_name)
print(first_woman_category)#count the full_name appeared more than once and filter the data
repeated_awardees = df['full_name'].value_counts()
repeated_awardees.head(10)
#save the full_names in a list
repeat_list=repeated_awardees.index.tolist()[0:6]
print(repeat_list)