Unearthing Insights from the Past: A Deep Dive into Dinosaur Fossil Records 🦕
📖 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
💾 The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
Column name | Description |
---|---|
occurence_no | The original occurrence number from the Paleobiology Database. |
name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
diet | The main diet (omnivorous, carnivorous, herbivorous). |
type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
length_m | The maximum length, from head to tail, in meters. |
max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
region | The current region where the fossil record was found. |
lng | The longitude where the fossil record was found. |
lat | The latitude where the fossil record was found. |
class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
# Import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
pio.templates.default = "ggplot2"
plt.style.use('ggplot')
# sns.set_style('whitegrid')
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Preview the dataframe
dinosaurs.head()
# Print dataset overview
dinosaurs.info()
The dataset comprises 4951 entries across 12 columns, with complete data in 7 columns (occurrence_no
, name
, max_ma
, min_ma
, lng
, lat
, and class
). However, the diet
, type
, length_m
, family
, and region
columns have significant missing values, necessitating data imputation or handling strategies for analyses involving these variables. Next up, we will find out the exact number of missing entries for each column
# Unique name of dinosaur in dataset
dinosaurs['name'].unique()
# Number of unique name of dinosaur in dataset
dinosaurs['name'].nunique()
We have an impressive total of 1042 unique dinosaur names in our dataset. This abundance makes me curious about the diversity and complexity of the prehistoric world. Check this to learn more
# Get missing data
dinosaurs.isna().sum()
We observe that there are over 1,000 missing entries in the diet
, type
, length_m
, and family
columns, with the region
column having slightly fewer than 50 missing entries. The next step is to calculate the percentage of missing data in each of these columns.
# Prints % missing values
(dinosaurs.isna().sum() / len(dinosaurs) * 100)
-
From the summary above, it's evident that we have missing data in several key attributes:
diet
,type
,length_m
, andfamily
, accounting for over 27% of the dataset, withregion
missing in less than 1% of cases. To address these missing values, I've opted for imputation as the preferred strategy. -
Next, we will visualize the distribution of values in the
length_m
column to gain insights into the spread and patterns of dinosaur lengths.
‌
‌