Skip to content
0

PART 1: Understanding Dinosaurs Fossil Data

Key Findings

Here are some of my findings from working on the dinosaurs dataset:

  1. There are 1042 unique dinosaurs names in the dataset
  2. The Supersaurus and Argentinosaurus were the largest dinosaurs with a length of 35 meters each.
  3. The dataset had 4951 datapoints with only the following columns having missing values:
  • family - 1457 missing values
  • length_m - 1383 missing values
  • diet - 1355 missing values
  • type - 1355 missing values
  • region - 42 missing values
  1. The most occuring dinosaur was the ornithopod.
  2. Dinosaurs became bigger with time.
  3. Herbivorous dinosaurs are the most common in the dataset.
  4. Alberta is the leading region where dinosaur fossils are being discovered.
  5. The Saurischia is the most frquent taxonomic class of dinosaurs discovered.

1.1 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to explore the fossil records to find some interesting insights, and advise the museum on the quality of the data.

1.2 Objectives

The objective of this research is to explore the fossil records, find insights and advise the museum on the quality of their data.

Below are some of the questions my analysis aims to answer:

  1. How many different dinosaur names are present in the data?
  2. Which was the largest dinosaur? What about missing data in the dataset?
  3. What dinosaur type has the most occurrences in this dataset?
  4. Did dinosaurs get bigger over time?
  5. Create an interactive map showing each record using AI.
  6. Any other insights found during the analysis.

1.3 Introduction

I am currently an intern at a National Museum where they recently created a database containing all dinosaur records of past field campaigns. I am tasked with the responsibility of exploring the dataset, deriving interesting insights, and advising the management on the quality of the data they have.

1.4 Data Description

The data originates from real data of the Paleobiology Database (source) enriched with data from Wikipedia The following information describes key variables of the data:

  1. occurence_no - The original occurrence number from the Paleobiology Database.
  2. name - The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
  3. diet - The main diet (omnivorous, carnivorous, herbivorous).
  4. type - The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
  5. length_m - The maximum length, from head to tail, in meters.
  6. max_ma - The age in which the first fossil records of the dinosaur were found, in million years.
  7. min_ma - The age in which the last fossil records of the dinosaur were found, in million years.
  8. region - The current region where the fossil record was found.
  9. lng - The longitude where the fossil record was found.
  10. lat - The latitude where the fossil record was found.
  11. class - The taxonomical class of the dinosaur (Saurischia or Ornithischia).
  12. family - The taxonomical family of the dinosaur (if known).

1.5 Exploratory Data Analysis

In order to better understand our data we begin by importing the relevant python packages,converting our data into a dataframe, and then we preview our data. We will also check for missing values and perform basic statistics on the data.

# Import the pandas,matplotlib and numpy packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs.head(5)
#print the information about the non-missing values and corresponding datatype per column
dinosaurs.info()
# performing descriptive statistics for numerical columns  
dinosaurs.describe()

1.6 Main Analysis

Time now to delve deeper into our data.

1. How many Different dinosaur names are present in the data?

There are 1042 unique dinosaur names in the dataset.

# Calculating the number of unique dinosaur names
dinosaurs_unique_names = dinosaurs['name'].nunique()

# print the results
print(f"The number of unique dinosaur names in the dataset is : {dinosaurs_unique_names}")

2. Which was the largest dinosaur? What about missing data in the dataset?

2.1 Which was the largest dinosaur?

The two largest dinosaurs in our dataset with a length of 35 meters are:

  1. Supersaurus
  2. Argentinosaurus

The Supersaurus whose name means "super lizard" is a genus of the diplodocid sauropod dinosaur that lived in North America during the Late Jurassic period. It was a very large sauropod, with the largest specimens reaching 33–35 meters (108–115 ft) in length and weighed approximately 35–40 metric tons.

The Argentinosaurus is a genus of giant sauropod dinosaurs that lived during the Late Cretaceous period in what is present day Argentina. It is among the largest known animals on land of all time. The largest Argentinosaurus measured 30–35 metres (98–115 ft) long and weighed 65–80 tonnes.