Skip to content
0

Part1 (Python) - Dinosaur Fossil Records Analysis πŸ“šπŸ¦•

Introduction 🌍

Dinosaurs have long fascinated people of all ages, captivating imaginations with their sheer size, diversity, and dominance over the Earth millions of years ago. The Paleobiology Database has amassed a rich collection of fossil records that offers a window into this ancient world. This report aims to uncover intriguing insights from this dataset, shedding light on the diversity, size, diet, and geographical distribution of dinosaurs. By doing so, we hope to provide the museum with valuable information to enhance their exhibits and educational programs.

Data Overview πŸ“Š

The dataset consists of 4,951 entries detailing various dinosaur fossils, enriched with data from Wikipedia (source). The key attributes include:

Column nameDescription
occurence_noThe original occurrence number from the Paleobiology Database.
nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
dietThe main diet (omnivorous, carnivorous, herbivorous).
typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_mThe maximum length, from head to tail, in meters.
max_maThe age in which the first fossil records of the dinosaur where found, in million years.
min_maThe age in which the last fossil records of the dinosaur where found, in million years.
regionThe current region where the fossil record was found.
lngThe longitude where the fossil record was found.
latThe latitude where the fossil record was found.
classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
familyThe taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

πŸ”„ Loading the dataset

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print(dinosaurs)

Data Cleaning 🧹

Before diving into the analysis, we ensured the data's integrity by:

  • Dropping rows with missing critical values.
  • Removing duplicate entries.
  • Verifying data types and consistency.

This resulted in a cleaned dataset with 3,551 records.

# Drop rows with missing critical values
dinosaurs_cleaned = dinosaurs.dropna(subset=['name', 'type', 'length_m', 'lat', 'lng', 'max_ma', 'region'])

# Drop duplicate rows if any
dinosaurs_cleaned = dinosaurs_cleaned.drop_duplicates()

# Display basic information after cleaning
print(dinosaurs_cleaned.info())

Analysis and Insights πŸ”

1. Dinosaur Diversity πŸ¦–

Insight: There are 288 unique dinosaur names in the dataset.

# Number of unique dinosaur names
unique_dinosaurs = dinosaurs_cleaned['name'].nunique()
print(f'There are {unique_dinosaurs} different dinosaur names in the dataset.')

2. The Largest Dinosaur πŸ¦•

Insight: The largest dinosaur recorded is Supersaurus, with a length of 35.0 meters.

# Identify the largest dinosaur by length
largest_dinosaur = dinosaurs_cleaned.loc[dinosaurs_cleaned['length_m'].idxmax()]
print("The largest dinosaur is:")
print(largest_dinosaur)

Supersaurus, a massive herbivorous sauropod, exemplifies the awe-inspiring size that some dinosaurs achieved.

3. Dinosaur Type Prevalence πŸ“Š

Insight: The most common dinosaur type is ornithopod.

# Count the occurrences of each dinosaur type
dinosaur_type_counts = dinosaurs_cleaned['type'].value_counts()

plt.figure(figsize=(10, 6))
dinosaur_type_counts.plot(kind='bar', color='green')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Understanding the prevalence of different dinosaur types helps the museum emphasize the variety of dinosaur adaptations and ecosystems.

β€Œ
β€Œ
β€Œ