Blockbusters and Beasts: A Data-Driven Adventure through Dinosaurs and Movies

Unearthing Insights from the Past: A Deep Dive into Dinosaur Fossil Records 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.templates.default = "ggplot2"
plt.style.use('ggplot')
# sns.set_style('whitegrid')

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs.head()

# Print dataset overview
dinosaurs.info()

The dataset comprises 4951 entries across 12 columns, with complete data in 7 columns (occurrence_no, name, max_ma, min_ma, lng, lat, and class). However, the diet, type, length_m, family, and region columns have significant missing values, necessitating data imputation or handling strategies for analyses involving these variables. Next up, we will find out the exact number of missing entries for each column

# Unique name of dinosaur in dataset
dinosaurs['name'].unique()

# Number of unique name of dinosaur in dataset
dinosaurs['name'].nunique()

We have an impressive total of 1042 unique dinosaur names in our dataset. This abundance makes me curious about the diversity and complexity of the prehistoric world. Check this to learn more

# Get missing data
dinosaurs.isna().sum()

We observe that there are over 1,000 missing entries in the diet, type, length_m, and family columns, with the region column having slightly fewer than 50 missing entries. The next step is to calculate the percentage of missing data in each of these columns.

# Prints % missing values
(dinosaurs.isna().sum() / len(dinosaurs) * 100)

From the summary above, it's evident that we have missing data in several key attributes: diet, type, length_m, and family, accounting for over 27% of the dataset, with region missing in less than 1% of cases. To address these missing values, I've opted for imputation as the preferred strategy.
Next, we will visualize the distribution of values in the length_m column to gain insights into the spread and patterns of dinosaur lengths.

‌
‌
‌

Blockbusters and Beasts: A Data-Driven Adventure through Dinosaurs and Movies

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Unearthing Insights from the Past: A Deep Dive into Dinosaur Fossil Records 🦕

📖 Background

💾 The data

Unearthing Insights from the Past: A Deep Dive into Dinosaur Fossil Records 🦕