Hop'N a History Tour: 🦕 & 🎥

🌏 Background of our tour

Welcome to the historic tour of dinasours and movies!

In the first part of our tour, we will be digging deeper of fossil records and uncover some cool insights. Then, we will step into the fascinating world of movie history.

Ready for adventure? Hop in!

1️⃣ 🐍: Mysterious World of Dinosaurs 🦕

📖 Purpose

To dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np

# Import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs

Question 1: How many different dinosaur names are present in the data?

There are 1042 different dinosaur names are present in the dataset.

dinosaurs['name'].nunique()

Question 2: Which was the largest dinosaur? What about missing data in the dataset?

This is a ticky question. If we consider the data at hand and "largest dinosaur" meant the tallest, Supersaurus and Argentinosaurus are the winners. However,754 different named dinosaur has no length information, which makes 72% (754/1042) of uniqued named dinosaurs. In order to be able to use this column for any decision making/ forecasting, I would gather more data using various resources. If that is not possible, I would use knn imputer to fill the missing values.

# Inspect whether there is any missing data esp.-length_m column
dinosaurs.info()

# Investigate missing information
dinosaurs1= dinosaurs[dinosaurs['length_m'].isnull()]

#Find the frequency of missing dinosaur name and associated counts
pd.crosstab(index=dinosaurs1['name'], columns='count')

# What is the maximum height excluding the missing data? 35
dinosaurs.describe()

‌
‌
‌

Hop'N a History Tour: 🦕 & 🎥

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Hop'N a History Tour: 🦕 & 🎥

🌏 Background of our tour

1️⃣ 🐍: Mysterious World of Dinosaurs 🦕

📖 Purpose

💾 The data

Question 1: How many different dinosaur names are present in the data?

Question 2: Which was the largest dinosaur? What about missing data in the dataset?

Hop'N a History Tour: 🦕 & 🎥