Everyone Can Learn Data Scholarship
Submission of John Paul Curada from Polytechnic University of the Philippines - Manila
Reader's Guide:
- For optimal viewing of this report, please use a desktop or laptop.
- For an enhanced reading experience with interactive visualizations, please switch to
Reportview and wait for the visualizations to load.
PART I: Understanding Fossil Records
Image from Paleobiology Database
Key Findings
Here are the several findings I discovered after investigating our data from Paleobiology Database:
- There are 1042 different dinosaur names in the dataset.
- The largest dinosaur frequently recorded in the dataset is Supersaurus and Argentinosaurus , with a length of 35 meters.
- The Ornithopod dinosaur type occurs most frequently in the dataset.
- Age alone is not a strong predictor of dinosaur size, as there is no strong correlation between the two.
- Discoveries like feathered dinosaurs in Liaoning, China, shed light on the dino-bird connection.
- Alberta, Canada boasts rich fossil beds showcasing dinosaur diversity, particularly large theropods.
- Australia's lack of dino finds might be due to landmass shifts, bad fossil weather, or less digging.
- Regions with abundant herbivores and carnivores likely enjoyed warm climates and plentiful food sources.
- The abundance of fossils in an area indicates favorable conditions for bone preservation.
- Pangea's breakup led to different dinosaur types on separate continents.
1.1 Background
I am applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. My job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
1.2 Objectives
My main objective is to help my colleagues at the museum to gain insights on the fossil record data. Specifically, I aim to answer the following questions.
- How many different dinosaur names are present in the data?
- Which was the largest dinosaur? What about missing data in the dataset?
- What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
- Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
- Use the AI assitant to create an interactive map showing each record.
- Any other insights you found during your analysis?
1.3 Introduction
This report delves into the fascinating world of dinosaurs through a data-driven analysis of the Paleobiology Database. We explore a wealth of information on these prehistoric giants, encompassing diversity, size, geographic distribution, and potential environmental factors. By examining fossil records, the report aims to shed light on the connection between dinosaur species, their habitats, and the influence of continental drift on their evolution. This data exploration will provide valuable insights for paleontologists and anyone curious about the lives and legacy of dinosaurs.
1.4 Data Description
A real dataset containing dinosaur records from the Paleobiology Database:
| Column name | Description | Data type |
|---|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. | int64 |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). | object |
| diet | The main diet (omnivorous, carnivorous, herbivorous). | object |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). | object |
| length_m | The maximum length, from head to tail, in meters. | float64 |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. | float64 |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. | float64 |
| region | The current region where the fossil record was found. | object |
| lng | The longitude where the fossil record was found. | float64 |
| lat | The latitude where the fossil record was found. | float64 |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). | object |
| family | The taxonomical family of the dinosaur (if known). | object |
The data was enriched with data from Wikipedia.
1.5 Exploratory Data Analysis
# Import necessary packages
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"def extract_csv(file_path):
"""
Load a CSV file using pandas and provide a summary of its content.
This function reads a CSV file into a DataFrame using pandas. It prints a summary of the DataFrame,
including the number of rows and columns, data types of the columns, and a count of missing values in
each column. It returns the DataFrame.
Parameters
----------
file_path : str
The file path of the CSV file to be loaded.
Returns
-------
DataFrame
"""
df = pd.read_csv(file_path)
print(f"Here is a little bit of information about the data stored in \n{file_path}:")
print(f"\nThere are {df.shape[0]} rows and {df.shape[1]} columns in this DataFrame.")
print("\nThe columns in this DataFrame take the following types: ")
print(df.dtypes.to_string()) # I added `to_string()` here to remove the `dtype: object`
print("\nThe columns in this DataFrame have the following missing values count: ")
print(df.isna().sum().to_string())
print("\nThe columns in this DataFrame have the following unique values count: ")
print(df.nunique())
print(f"\nTo view the extracted DataFrame, display the value returned by this function.\n\n")
return df
# Call the extract function
dinosaurs_df = extract_csv("data/dinosaurs.csv")
# Load the dataset into Pandas DataFrame
dinosaurs_df = pd.read_csv("data/dinosaurs.csv")A preliminary examination (Exploratory Data Analysis - EDA) reveals the dinosaurs.csv dataset to be a comprehensive compilation of dinosaur occurrences, encompassing 4,951 individual entries distributed across 12 distinct columns. Each column represents a specific dinosaur characteristic, providing a detailed profile for each occurrence.
- Lots of details: It has things like dinosaur names, what they ate (diet), their type (e.g., meat-eater, plant-eater), size (length), location (region, latitude/longitude), and even their family.
- Not all info is there: Some information is missing for some dinosaurs, especially about what they ate, their type, and their length (around 27% for each).
- Reliable info for location: Luckily, things like location (region, latitude/longitude) and when they lived (based on min/max size estimates) are well documented.
- Great for exploring: This table is a great starting point to learn more about dinosaurs, even though some information is missing.
DataFrame's Summary Information
# Computing descriptive statistics for the DataFrame
dinosaurs_df.describe()This section explores key characteristics of the dinosaurs identified in the Paleobiology Database through descriptive statistics.
-
Dino Sizes: Our analysis focused on the size of 3,568 dinosaurs (out of a total of 4,951). These dinosaurs exhibited a remarkable range in length, from a mere 0.45 meters for the smallest to a gigantic 35 meters for the largest. On average, dinosaurs measured approximately 8.21 meters in length, but with a significant variation in size, as indicated by the standard deviation of 6.63 meters.
-
Dino Time Period: The fossils tell us dinosaurs roamed Earth for a long time, from a whopping 252.17 million years ago to a more recent 70.6 million years ago. The average fossil is about 117.52 million years old (first appearance) and 106.62 million years old (last appearance), with a spread of about 45 million years in each direction (based on standard deviation).
-
Dino Distribution: A geographical analysis of the fossils indicates a higher concentration in the northern and western hemispheres, suggesting a potential bias in fossil discoveries across different regions.