Everyone Can Learn Data Scholarship
# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 201๏ธโฃ Part 1 (Python) - Dinosaur data ๐ฆ
๐ Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import plotly.express as px
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Show dinosaurs
dinosaurs# Find unique dinosaurs
unique_dinosaurs = dinosaurs["name"].unique()
# Print unique_dinosaurs
unique_dinosaurs# find missing dinosaur
missing_dinosaurs = dinosaurs.isnull().sum()
filter_dinosaurs = dinosaurs.dropna()
# Find the largest dinosaur
largest_dinosaurs = filter_dinosaurs.loc[filter_dinosaurs['max_ma'].idxmax()]
# print the results
print(f"The largest dinosaurs is {largest_dinosaurs['name']} with a max_ma of {largest_dinosaurs['max_ma']}")
print(largest_dinosaurs)
print("Missing data in the dataset:")
print(missing_dinosaurs)# Dinosaur type with the most occurrences
filter_dinosaurs_counts = filter_dinosaurs['type'].value_counts().reset_index()
filter_dinosaurs_counts.columns = ['type', 'count']
import matplotlib.pyplot as plt
import numpy as np
# Visualization
fig, ax = plt.subplots()
ax.bar(filter_dinosaurs_counts['type'], filter_dinosaurs_counts['count'], color=plt.cm.viridis(np.linspace(0, 1, len(filter_dinosaurs_counts))))
ax.set_title('Number of Dinosaurs per Type')
ax.set_xlabel('Dinosaur Type')
ax.set_ylabel('Number of Dinosaurs')
plt.xticks(rotation=45)
plt.show()# Calculate dinosaur age
filter_dinosaurs['age'] = filter_dinosaurs['max_ma'] - filter_dinosaurs['min_ma']
filter_dinosaurs['age']
import seaborn as sns
sns.regplot(x=filter_dinosaurs['age'], y=filter_dinosaurs['length_m'], ci=False, line_kws={"color":"red"})
plt.xlabel('Age (years)')
plt.ylabel('Dinosaur Length')
plt.title('Dinosaur Length vs. Age')
plt.show()The plot contains numerous data points, each representing an individual dinosaur's age and length. The data points are scattered across the plot, with a higher concentration of points towards the lower age and length values.
A red trend line is drawn across the scatter plot, indicating a negative correlation between age and length. This suggests that, on average, as the age of the dinosaurs increases, their length tends to decrease slightly.
Overall, the scatter plot shows a general trend where younger dinosaurs tend to be longer, while older dinosaurs tend to be shorter, although there is considerable variability in the data.
import pandas as pd
import numpy as np
import plotly.express as px
# Create an interactive map
if 'lat' in filter_dinosaurs.columns and 'lng' in filter_dinosaurs.columns:
fig = px.scatter_geo(filter_dinosaurs, lat='lat', lon='lng', hover_name='name', color='region',
title='Dinosaurs Locations',
projection='natural earth')
fig.show()
else:
print("The dataset does not contain 'lat' and 'lng' columns.")filter_dinosaurs['length_m'].describe()How many different dinosaur names are present in the data?
2๏ธโฃ Part 2 (SQL) - Understanding movie data ๐ฅ
๐ Background
You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.
SELECT *
FROM cinema.films
LIMIT 10โ
โ