Hop'N a History Tour: π¦ & π₯
π Background of our tour
Welcome to the historic tour of dinasours and movies!
In the first part of our tour, we will be digging deeper of fossil records and uncover some cool insights. Then, we will step into the fascinating world of movie history.
Ready for adventure? Hop in!
1οΈβ£ π: Mysterious World of Dinosaurs π¦
π Purpose
To dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
πΎ The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')# Preview the dataframe
dinosaursQuestion 1: How many different dinosaur names are present in the data?
There are 1042 different dinosaur names are present in the dataset.
dinosaurs['name'].nunique()Question 2: Which was the largest dinosaur? What about missing data in the dataset?
This is a ticky question. If we consider the data at hand and "largest dinosaur" meant the tallest, Supersaurus and Argentinosaurus are the winners. However,754 different named dinosaur has no length information, which makes 72% (754/1042) of uniqued named dinosaurs. In order to be able to use this column for any decision making/ forecasting, I would gather more data using various resources. If that is not possible, I would use knn imputer to fill the missing values.
# Inspect whether there is any missing data esp.-length_m column
dinosaurs.info()# Investigate missing information
dinosaurs1= dinosaurs[dinosaurs['length_m'].isnull()]
#Find the frequency of missing dinosaur name and associated counts
pd.crosstab(index=dinosaurs1['name'], columns='count')
# What is the maximum height excluding the missing data? 35
dinosaurs.describe()β
β