Everyone Can Learn Data Scholarship by Mebarek

Everyone Can Learn Data Scholarship

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs

💪 Challenge I

Help your colleagues at the museum to gain insights on the fossil record data. Include:

How many different dinosaur names are present in the data?
Which was the largest dinosaur? What about missing data in the dataset?
What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
Use the AI assitant to create an interactive map showing each record.
Any other insights you found during your analysis?

# Count the number of unique dinosaur names
unique_names_count = dinosaurs['name'].nunique()

# Print the result
print("Number of different dinosaur names present in the data:", unique_names_count)

# Finding the largest dinosaur
largest_dinosaur = dinosaurs[dinosaurs['length_m'] == dinosaurs['length_m'].max()]
largest_dinosaur_name = largest_dinosaur['name'].iloc[0]
largest_dinosaur_length = largest_dinosaur['length_m'].iloc[0]

print("The largest dinosaur is:", largest_dinosaur_name)
print("Its length is:", largest_dinosaur_length, "meters")

# Checking for missing data
missing_data = dinosaurs.isnull().sum()
print("\nMissing data in the dataset:")
print(missing_data)

The largest dinosaur in the dataset is the Supersaurus, with a length of 35.0 meters.

Regarding missing data in the dataset:

There are no missing values in the occurrence_no, name, max_ma, min_ma, lng, lat, and class columns.

diet, type, length_m, region, and family columns have missing values.

diet, type, and family have 1355 missing values each.

length_m has 1383 missing values.

region has 42 missing values.

import matplotlib.pyplot as plt
import seaborn as sns

# Count occurrences of each dinosaur type
dinosaur_type_counts = dinosaurs['type'].value_counts()

# Create the horizontal bar chart with Seaborn
sns.barplot(
    x=dinosaur_type_counts.values,  
    y=dinosaur_type_counts.index, 
    orient='h',                   
    color='skyblue',
)
# Add title, labels, and formatting
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Number of Occurrences')
plt.ylabel('Dinosaur Type')
plt.xticks(fontsize=12)
plt.tight_layout()
plt.show()

The bar chart illustrates the number of dinosaurs per type. According to the dataset, ornithopods have the highest occurrences, with approximately 811. Following ornithopods, large theropods are the next most abundant, with around 733 occurrences. Small theropods closely trail, with about 717 occurrences. Sauropods are recorded at approximately 665 occurrences, while ceratopsians and armored dinosaurs have roughly 363 and 307 occurrences, respectively.

# Filter out rows with missing values in 'length_m' and 'max_ma' columns
valid_data = dinosaurs.dropna(subset=['length_m', 'max_ma'])

# Create the scatter plot with Seaborn
sns.scatterplot(
    x='max_ma',
    y='length_m',
    data=valid_data,
    hue='type',  # Color points by dinosaur type (optional)
    palette='viridis',  # Color scheme for different types (optional)
    alpha=0.7
)
plt.title('Relation between Dinosaur Length and Age')
plt.xlabel('Age (million years)')
plt.ylabel('Length (meters)')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()
plt.show()

The scatter plot doesn't show a clear correlation between dinosaur length and age. There are long dinosaurs that lived for a short amount of time and short dinosaurs that lived for a long time.

There are a few reasons why there might not be a strong correlation between dinosaur length and age. Dinosaurs came in many shapes and sizes. Some species were gigantic, while others were minimal. This size variation is likely due to a variety of factors, including diet, habitat, and evolutionary history. Age is just one factor that can affect an organism's size. Other factors, such as genetics and nutrition, can also play a role. For example, a dinosaur that had access to abundant food resources would likely grow larger than a dinosaur that did not. Overall, the scatter plot suggests that there is no clear relationship between dinosaur length and age. Dinosaur length was likely influenced by a variety of factors, not just age.

import folium

# Create a map centered at the mean latitude and longitude of the dinosaur records
mean_lat = dinosaurs['lat'].mean()
mean_lng = dinosaurs['lng'].mean()
mymap = folium.Map(location=[mean_lat, mean_lng], zoom_start=4)

# Add a marker for each dinosaur record
for index, row in dinosaurs.iterrows():
    folium.Marker(location=[row['lat'], row['lng']], popup=row['name']).add_to(mymap)

# Save the map as an HTML file
mymap.save('dinosaur_records_map.html')

mymap