Part1 (Python) - Dinosaur Fossil Records Analysis ππ¦
Introduction π
Dinosaurs have long fascinated people of all ages, captivating imaginations with their sheer size, diversity, and dominance over the Earth millions of years ago. The Paleobiology Database has amassed a rich collection of fossil records that offers a window into this ancient world. This report aims to uncover intriguing insights from this dataset, shedding light on the diversity, size, diet, and geographical distribution of dinosaurs. By doing so, we hope to provide the museum with valuable information to enhance their exhibits and educational programs.
Data Overview π
The dataset consists of 4,951 entries detailing various dinosaur fossils, enriched with data from Wikipedia (source). The key attributes include:
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
π Loading the dataset
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the dataset
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print(dinosaurs)Data Cleaning π§Ή
Before diving into the analysis, we ensured the data's integrity by:
- Dropping rows with missing critical values.
- Removing duplicate entries.
- Verifying data types and consistency.
This resulted in a cleaned dataset with 3,551 records.
# Drop rows with missing critical values
dinosaurs_cleaned = dinosaurs.dropna(subset=['name', 'type', 'length_m', 'lat', 'lng', 'max_ma', 'region'])
# Drop duplicate rows if any
dinosaurs_cleaned = dinosaurs_cleaned.drop_duplicates()
# Display basic information after cleaning
print(dinosaurs_cleaned.info())Analysis and Insights π
1. Dinosaur Diversity π¦
Insight: There are 288 unique dinosaur names in the dataset.
# Number of unique dinosaur names
unique_dinosaurs = dinosaurs_cleaned['name'].nunique()
print(f'There are {unique_dinosaurs} different dinosaur names in the dataset.')2. The Largest Dinosaur π¦
Insight: The largest dinosaur recorded is Supersaurus, with a length of 35.0 meters.
# Identify the largest dinosaur by length
largest_dinosaur = dinosaurs_cleaned.loc[dinosaurs_cleaned['length_m'].idxmax()]
print("The largest dinosaur is:")
print(largest_dinosaur)Supersaurus, a massive herbivorous sauropod, exemplifies the awe-inspiring size that some dinosaurs achieved.
3. Dinosaur Type Prevalence π
Insight: The most common dinosaur type is ornithopod.
# Count the occurrences of each dinosaur type
dinosaur_type_counts = dinosaurs_cleaned['type'].value_counts()
plt.figure(figsize=(10, 6))
dinosaur_type_counts.plot(kind='bar', color='green')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()Understanding the prevalence of different dinosaur types helps the museum emphasize the variety of dinosaur adaptations and ecosystems.
β
β