Dinosaur Data and Film Data Analysis with Python and SQL

1️⃣ Part 1 (Python) - Dinosaur data 🦕

💾 The data dictionary

Source: the Paleobiology Database (source link):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

2 hidden cells

The data

Hidden code

The Dinosaur Data Analysis

Dinosaur museum

Image from Houston Museum of Natural Science

The national museum for natural history has recently created a database containing all dinosaur records of past field campaigns. and requires some insights from this data.

The Objectives of this analysis:

Extract fascinating insights for my colleagues at the museum
Examine the quality of the data to provide valuable advice to the museum on any issues I discover.

Phase 1 - Data Quality Assessment

Data Quality Assessment scope:

Validity: Does the dataset contain any invalid or erroneous data entries?
Duplication: Are there any duplicate entries or redundant information in the dataset?
Completeness: Are there any missing data entries or incomplete records in the dataset?

I'll begin the DQA (Data Qaulity Assessment) by getting a general understanding of the dataset and then validating the dataset:

checking if the dataset aligns perfectly with what is in the data dictionary
Checking if there are errors/ typos

Hidden code

The dataset contains 4951 records/ rows and 12 columns.

According to the data dictionary, these columns have specific allowed values:

Diet: omnivorous, carnivorous, herbivorous (3 unique values)
Type: small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur (6 unique values)
Class: Saurischia or Ornithischia (2 unique values)

I'll make sure all values in these columns match the allowed options.

Hidden code

The values matched (no misspelling, typo, or invalid value).

Based on domain knowledge, it's known fact that

dinosaurs stopped to exist 65 million years ago (approx.)
the largest dinosaur size ever recorded is 60 meters.

So I'm going to confirm that:

the latest age (min_ma) when each dinosaur was discovered is greater than 65 million.
the length of each dinosaur is less than 60 meters (it could be an indication that a wrong unit i.e. feet was used for some rows)

I'll also confirm that the length_m, max_ma, and min_ma columns contain only positive values as per the nature of the information captured in those columns (it's impossible to have a negative length or age so negative values might be an indication of error).

Hidden code

The table above confirms that:

All dinosaurs have a length less than or equal to 60 meters.
All dinosaurs have a minimum age greater than 65 million years.
The length_m, max_ma, and min_ma columns contain only positive values.

Next, I'm going to validate the max_ma and min_ma columns further.

According to the data dictionary, the max_ma column represents the geological age, in million years, at which the earliest fossil record of a particular dinosaur species or genus was discovered while the min_ma represents the geological age, in million years, at which the latest fossil record of the same dinosaur species or genus was discovered.

This means that max_ma is the upper boundary of the time period while min_ma is the lower boundary. Therefore, we're going to validate that:

The max_ma value is always greater than the min_ma value for every record.

‌
‌
‌

Dinosaur Data and Film Data Analysis with Python and SQL

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}1️⃣ Part 1 (Python) - Dinosaur data 🦕

💾 The data dictionary

The data

The Dinosaur Data Analysis

The Objectives of this analysis:

Phase 1 - Data Quality Assessment

Data Quality Assessment scope:

1️⃣ Part 1 (Python) - Dinosaur data 🦕