Competition - Everyone Can Learn Data Scholarship - Bara Al Sedih workspace

Part1 (Python) - Dinosaur Fossil Records Analysis 📚🦕

Introduction 🌍

Dinosaurs have long fascinated people of all ages, captivating imaginations with their sheer size, diversity, and dominance over the Earth millions of years ago. The Paleobiology Database has amassed a rich collection of fossil records that offers a window into this ancient world. This report aims to uncover intriguing insights from this dataset, shedding light on the diversity, size, diet, and geographical distribution of dinosaurs. By doing so, we hope to provide the museum with valuable information to enhance their exhibits and educational programs.

Data Overview 📊

The dataset consists of 4,951 entries detailing various dinosaur fossils, enriched with data from Wikipedia (source). The key attributes include:

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

🔄 Loading the dataset

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print(dinosaurs)

Data Cleaning 🧹

Before diving into the analysis, we ensured the data's integrity by:

Dropping rows with missing critical values.
Removing duplicate entries.
Verifying data types and consistency.

This resulted in a cleaned dataset with 3,551 records.

# Drop rows with missing critical values
dinosaurs_cleaned = dinosaurs.dropna(subset=['name', 'type', 'length_m', 'lat', 'lng', 'max_ma', 'region'])

# Drop duplicate rows if any
dinosaurs_cleaned = dinosaurs_cleaned.drop_duplicates()

# Display basic information after cleaning
print(dinosaurs_cleaned.info())

Analysis and Insights 🔍

1. Dinosaur Diversity 🦖

Insight: There are 288 unique dinosaur names in the dataset.

# Number of unique dinosaur names
unique_dinosaurs = dinosaurs_cleaned['name'].nunique()
print(f'There are {unique_dinosaurs} different dinosaur names in the dataset.')

2. The Largest Dinosaur 🦕

Insight: The largest dinosaur recorded is Supersaurus, with a length of 35.0 meters.

# Identify the largest dinosaur by length
largest_dinosaur = dinosaurs_cleaned.loc[dinosaurs_cleaned['length_m'].idxmax()]
print("The largest dinosaur is:")
print(largest_dinosaur)

Supersaurus, a massive herbivorous sauropod, exemplifies the awe-inspiring size that some dinosaurs achieved.

3. Dinosaur Type Prevalence 📊

Insight: The most common dinosaur type is ornithopod.

# Count the occurrences of each dinosaur type
dinosaur_type_counts = dinosaurs_cleaned['type'].value_counts()

plt.figure(figsize=(10, 6))
dinosaur_type_counts.plot(kind='bar', color='green')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Understanding the prevalence of different dinosaur types helps the museum emphasize the variety of dinosaur adaptations and ecosystems.

‌
‌
‌