Dinosaur and Films Dataset Analysis
# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 20Modify any of the numbers and rerun the cell.
You can add a Markdown, Python|R, or SQL cell by clicking on the Add Markdown, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.
๐ค You can also make use of our AI assistent, by asking it what you want to do. See it in action here.
Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.
1๏ธโฃ Part 1 (Python) - Dinosaur data ๐ฆ
๐ Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
๐พ The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
import numpy as np
import pandas as pd
# Load the data
df = pd.read_csv('data/dinosaurs.csv')
#1. How many different dinosaur names are present in the data? 1042
unique_dinosaur_names = df['name'].dropna()
unique_dinosaur_names = df['name'].nunique()
print("Number of different dinosaur names:", unique_dinosaur_names)
The number of distinct name is 1042 names in whole of the dataframe without cleaning all data frame columns only the title
#2. Which was the largest dinosaur? Supersaurus
largest_dinosaur = df.loc[df['length_m'].idxmax()]['name']
#What about missing data in the dataset?
df.isnull().sum()
largest_dinosaur = df.loc[df['length_m'].idxmax()]
# Check for missing data in the dataset
missing_data_summary = df.isnull().sum()
print("Largest Dinosaur:")
print(largest_dinosaur)
print("\nMissing Data Summary:")
print(missing_data_summary)
The largest dinosaur identified from the dataset is the Supersaurus, a herbivorous sauropod. Detailed information about this dinosaur includes:
Length: 35.0 meters Era: Lived approximately between 155.7 and 145.0 million years ago (Late Jurassic period) Region: Fossils found in Colorado Class: Saurischia Family: Diplodocidae This significant finding highlights the remarkable size of the Supersaurus, which is among the largest dinosaurs ever discovered.
Missing Data Summary A review of the dataset reveals some gaps that could affect the analysis:
Diet: 1,355 records are missing dietary information. Type: 1,355 records are missing the type classification. Length: 1,383 records are missing length information. Region: 42 records are missing region information. Family: 1,457 records are missing family classification.
-- Based on these requirements, I will clean the data according to the specific questions and my analysis To avoid cleaning the entire CSV file to prevent losing a significant number of rows.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the data
df = pd.read_csv('data/dinosaurs.csv')
# Clean the data: Drop rows with missing 'type' values
clean_df = df.dropna(subset=['type'])
# Group by 'type' and count occurrences
type_counts = clean_df['type'].value_counts()
# Create the bar plot
plt.figure(figsize=(12, 8))
sns.barplot(x=type_counts.index, y=type_counts.values, palette='viridis')
# Add labels and title
plt.xlabel('Dinosaur Type', fontsize=14)
plt.ylabel('Number of Occurrences', fontsize=14)
plt.title('Number of Dinosaurs per Type', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
# Display the plot
plt.show()
# Save the cleaned dataframe to a new CSV file
cleaned_file_path = 'data/dinosaurs_cleaned.csv'
clean_df.to_csv(cleaned_file_path, index=False)
Result:
As a result of the cleaning process, I removed entries with null values in the "types" column. This decision was based on my analysis, as I aimed to avoid cleaning the entire CSV file to prevent losing a significant number of rows.
The ornithopods are like the popular kids in school, showing up nearly 800 times in the fossils. Right behind them are the large theropods and small theropods, each strutting their stuff with over 700 appearances. The sauropods also make a grand entrance, making a statement with over 600 sightings. On the other hand, the ceratopsians and armored dinosaurs are the shy ones in the group, with the ceratopsians just above 400 and the armored dinosaurs slightly under 400 appearances.
It's clear from this lineup that the ornithopods and theropods were the life of the party back in their dinosaur days, stealing the spotlight as the most widespread and diverse groups around.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
# Load the data
df = pd.read_csv('data/dinosaurs.csv')
# Calculate the midpoint of the age interval
df['mid_ma'] = (df['max_ma'] + df['min_ma']) / 2
df['duration'] = np.abs(df['max_ma'] - df['min_ma']) # Duration of existence
# Drop rows with missing values in 'mid_ma' and 'max_ma'
df.dropna(subset=['mid_ma', 'max_ma','type'], inplace=True)
# Drop rows with any remaining missing values
df.dropna(inplace=True)
# Create the plot
plt.figure(figsize=(12, 8))
ax = plt.gca()
# Count the occurrences of each dinosaur type
type_counts = df['type'].value_counts()
# Create a color palette based on the unique types
unique_types = df['type'].unique()
palette = sns.color_palette("tab10", len(unique_types))
type_colors = {type_name: palette[i] for i, type_name in enumerate(unique_types)}
# Plot each data point as an ellipse
for idx, row in df.iterrows():
ellipse = mpatches.Ellipse((row['mid_ma'], row['length_m']),
width=row['duration'] / 2, height=.1, # Adjust these values to stretch the marker
edgecolor=type_colors[row['type']], facecolor='none')
ax.add_patch(ellipse)
# Customize the plot
plt.xlim(df['mid_ma'].max() + 10, df['mid_ma'].min() - 10) # Extend limits for better visualization
plt.ylim(0, max(df['length_m']) + 10)
plt.title('Dinosaur Length Over Time with Elliptical Markers')
plt.xlabel('Age Midpoint (Millions of Years Ago)')
plt.ylabel('Length (Meters)')
# Create a legend
handles = [mpatches.Patch(color=type_colors[type_name], label=type_name) for type_name in unique_types]
plt.legend(handles=handles, title="Dinosaur Type")
# Show the plot
plt.show()
#4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
# Set up the plot
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(18, 8))
# Iterate over each dinosaur type
types = df['type'].unique()
for dinosaur_type in types:
subset = df[df['type'] == dinosaur_type]
# Plot the scatter points and the regression line
sns.scatterplot(x='mid_ma', y='length_m', data=subset, label=dinosaur_type, ax=ax1)
sns.regplot(x='mid_ma', y='length_m', data=subset, order=2, scatter=False, ax=ax1)
# Customize the scatter plot
ax1.set_title('Dinosaur Length Over Time by Type with Polynomial Regression')
ax1.set_xlabel('Midpoint Age (Millions of Years Ago)')
ax1.set_ylabel('Length (Meters)')
ax1.invert_xaxis() # Invert the x-axis so older ages (higher values) are on the left
ax1.legend()
# Plot the KDE for all fossils
sns.kdeplot(data=df['mid_ma'], bw_adjust=1, color='blue', label='All Fossils', ax=ax2)
# Customize the KDE plot
ax2.set_title('Density Plot of All Fossils Over Time')
ax2.set_xlabel('Midpoint Age (Millions of Years Ago)')
ax2.set_ylabel('Density')
ax2.invert_xaxis() # Invert the x-axis so older ages (higher values) are on the left
ax2.legend()
# Set the same x-axis limits for both subplots
min_x = df['mid_ma'].min()
max_x = df['mid_ma'].max()
ax1.set_xlim(max_x, min_x)
ax2.set_xlim(max_x, min_x)
plt.tight_layout()
plt.show()
I'm not a paleontologist, but I am a data scientist ๐. According to the theory of evolution, all living organisms share a common origin. Going back hundreds of millions of years, it is natural for living organisms to exhibit simplicity in biological complexity and size. As we progress along the evolutionary tree, diversity increases, and with it, the number of organisms also increases.
In the first chart, it shows the relationship between the ages of fossils and their lengths for eight types of dinosaurs that lived in North America. We observe that the lengths of the dinosaurs were initially small and then began to increase over time. The length of some species, after reaching massive sizes such as the sauropod, decreased over time, possibly due to the destruction of vegetation or the inability of their size to adapt to changing conditions.
The second chart shows the increase in the number of fossils of these dinosaurs proportionally with the progression of time.
The logic indicates that there is no need to know all the details of the evolutionary tree to prove its existence; it is enough to construct the tree based on fossil results, and this has been illustrated. """
โ
โ