Skip to content
1 hidden cell
1 hidden cell
Zoobia's notebook.ipynb
Everyone Can Learn Data Scholarship
1️⃣ Part 1 (Python) - Dinosaur data 🦕
📖 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
1 hidden cell
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Count different dinosaur names
num_dinosaur_names = dinosaurs['name'].nunique()
print(f"Number of different dinosaur names: {num_dinosaur_names}")
# Handle missing data
df_cleaned = dinosaurs.dropna(subset=['length_m'])
# Identify the largest dinosaur
largest_dinosaur = df_cleaned.loc[df_cleaned['length_m'].idxmax()]
print(f"The largest dinosaur is {largest_dinosaur['name']} with a length of {largest_dinosaur['length_m']} meters.")
# Count the occurrences of each dinosaur type
dinosaur_type_counts = df_cleaned['type'].value_counts()
# Create a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=dinosaur_type_counts.index, y=dinosaur_type_counts.values, palette='viridis')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# Rename columns to match your description if necessary
dinosaurs.rename(columns={
'length_m': 'length',
'max_ma': 'max_age',
'min_ma': 'min_age'
}, inplace=True)
# Calculate the average age if 'age' column doesn't exist
dinosaurs['age'] = (dinosaurs['max_age'] + dinosaurs['min_age']) / 2
# Verify the columns
print("Columns in the DataFrame:", dinosaurs.columns)
# Create a scatter plot for dinosaur length vs. age
plt.figure(figsize=(10, 6))
sns.scatterplot(x='age', y='length', data=dinosaurs, hue='type', palette='viridis', s=100)
plt.title('Dinosaur Length vs. Age')
plt.xlabel('Age (millions of years ago)')
plt.ylabel('Length (meters)')
plt.show()
# Create a map centered at the average location
map_center = [dinosaurs['lat'].mean(), dinosaurs['lng'].mean()]
dino_map = folium.Map(location=map_center, zoom_start=2)
# Add points for each dinosaur
for _, row in dinosaurs.iterrows():
folium.Marker(
location=[row['lat'], row['lng']],
popup=f"{row['name']} ({row['class']}), Age: {row['age']} million years",
tooltip=row['name']
).add_to(dino_map)
# Display the map
display(dino_map)
# Count occurrences by dinosaur type and age
type_age_distribution = dinosaurs.groupby(['type', 'age']).size().unstack(fill_value=0)
print(type_age_distribution)# Preview the dataframe
dinosaurs💪 Challenge I
ACCEPTED!!!
- How many different dinosaur names are present in the data.
- ans. 1042 different dinosaurs names are present in the data.
- Which was the largest dinosaur? What about missing data in the dataset?
- ans. The largest dinosaur is Supersaurus with a length of 35.0 meters. Drop rows for missing data in this dataset.
- What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
- ans. Ornithopod has the most occurences in this dataset.
- Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
- ans. No! Dinosaurs got smaller over time. Although 150 million years ago the dinosaurs were the largest known species to walk on earth.
- Use the AI assitant to create an interactive map showing each record.
- ans. DONE!
- Any other insights you found during your analysis?
- ans. The last table can be used to analyze the distribution and prevalence of different dinosaur types over geological time periods.
2️⃣ Part 2 (SQL) - Understanding movie data 🎥
📖 Background
You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.
1 hidden cell
DataFrameas
df
variable
SELECT *
FROM cinema.films
LIMIT 10;
SELECT COUNT(*) AS num_movies
FROM cinema.films;
SELECT COUNT(*) AS num_missing_rows
FROM cinema.films
WHERE gross IS NULL OR budget IS NULL;
SELECT COUNT(DISTINCT certification) AS num_certifications
FROM cinema.films;
SELECT country, COUNT(*) AS num_movies_produced
FROM cinema.films
GROUP BY country
ORDER BY num_movies_produced DESC
LIMIT 5;
SELECT language, AVG(duration) AS avg_duration
FROM cinema.films
WHERE language IN ('English', 'French')
GROUP BY language;💪 Challenge II
ACCEPTED!!!
- How many movies are present in the database.
- ans. 4968 movies are present in the database.
- There seems to be a lot of missing data in the gross and budget columns. How many rows have missing data? What would you recommend your manager to do with these rows?
- ans. 1076 rows have missing data. I could recommend including imputation, removal of rows, or further investigation depending on the importance and context of the missing data.
- How many different certifications or ratings are present in the database?
- ans. _13 _ different certifications or ratings are present in the database.
- What are the top five countries in terms of number of movies produced?
- ans. USA has produced 3750 movies, UK has produced 443 movies, France has produced 153 movies, Canada has produced 123 movies and Germany has produced 97 movies.
- What is the average duration of English versus French movies? (Don't forget you can use the AI assistant!)
- ans. French movies have an average duration of 104.6666666667 while English movies have an average duration of 107.6365405405.
- Any other insights you found during your analysis?
- ans. We can also investigate the distribution of movie genres.