Skip to content
Insight for Dinosaur fossil records(Python) & Movie data(SQL)
Everyone Can Learn Data Scholarship
1️⃣ Part 1 (Python) - Dinosaur data 🦕
📖 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
💾 The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
Column name | Description |
---|---|
occurence_no | The original occurrence number from the Paleobiology Database. |
name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
diet | The main diet (omnivorous, carnivorous, herbivorous). |
type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
length_m | The maximum length, from head to tail, in meters. |
max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
region | The current region where the fossil record was found. |
lng | The longitude where the fossil record was found. |
lat | The latitude where the fossil record was found. |
class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# double check the uniqueness
len(dinosaurs['occurrence_no'].unique()) == len(dinosaurs['occurrence_no'])
# Preview the dataframe
dinosaurs.head()
dinosaurs.info()
dinosaurs.describe()
1. How many different dinosaur names are present in the data?
There are 1042
different dinosaurs names in the data.
len(dinosaurs['name'].unique())
# get categorized Period/Epoch data from International Commission on Stratigraphy, https://stratigraphy.org/
age = pd.read_excel('age.xlsx')
age['numerical age (Ma)'] = age['numerical age (Ma)'].astype(str)
age['numerical age (Ma)'] = age['numerical age (Ma)'].str.replace('~','')
age['numerical age (Ma)'] = age['numerical age (Ma)'].str.split('±').str.get(0).str.strip().astype(float).round(2)
age['EP']= age['Epoch']+' '+age['Period']
age_group = age.groupby('EP')['numerical age (Ma)'].agg([min, max]).sort_values('min').reset_index()
age_group
# Categorize the max_ma with epoch
bins =[age_group.loc[len(age_group)-1,'max']]
for v in age_group.loc[:,'min']:
bins.append(v)
bins = sorted(bins)
labels = age_group['EP'].values
dinosaurs['epoch'] = pd.cut(dinosaurs['max_ma'], bins= bins, labels=labels)
bins
2. Which was the largest dinosaur? What about missing data in the dataset?
The largest dinosaur is Supersaurus
, and its body length is up to 35.0
meters!!!
Because around 30%
of the values in the 'length_m' columns are missing, we should use mean or median to fill in the gaps. Based on shape of the distribution of body length (right-skewed), median
is preferred value to fill.
for v in ['family','type','diet','length_m']:
print(dinosaurs[dinosaurs[v].isna()]['name'].isin(dinosaurs[~dinosaurs[v].isna()]['name']).sum())