1️⃣ Part 1 (Python) - Dinosaur data 🦕
💾 The data dictionary
Source: the Paleobiology Database (source link):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
2 hidden cells
The data
The Dinosaur Data Analysis

Image from Houston Museum of Natural Science
The national museum for natural history has recently created a database containing all dinosaur records of past field campaigns. and requires some insights from this data.
The Objectives of this analysis:
- Extract fascinating insights for my colleagues at the museum
- Examine the quality of the data to provide valuable advice to the museum on any issues I discover.
Phase 1 - Data Quality Assessment
Data Quality Assessment scope:
- Validity: Does the dataset contain any invalid or erroneous data entries?
- Duplication: Are there any duplicate entries or redundant information in the dataset?
- Completeness: Are there any missing data entries or incomplete records in the dataset?
I'll begin the DQA (Data Qaulity Assessment) by getting a general understanding of the dataset and then validating the dataset:
- checking if the dataset aligns perfectly with what is in the data dictionary
- Checking if there are errors/ typos
The dataset contains 4951 records/ rows and 12 columns.
According to the data dictionary, these columns have specific allowed values:
- Diet: omnivorous, carnivorous, herbivorous (3 unique values)
- Type: small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur (6 unique values)
- Class: Saurischia or Ornithischia (2 unique values)
I'll make sure all values in these columns match the allowed options.
The values matched (no misspelling, typo, or invalid value).
Based on domain knowledge, it's known fact that
- dinosaurs stopped to exist 65 million years ago (approx.)
- the largest dinosaur size ever recorded is 60 meters.
So I'm going to confirm that:
- the latest age (min_ma) when each dinosaur was discovered is greater than 65 million.
- the length of each dinosaur is less than 60 meters (it could be an indication that a wrong unit i.e. feet was used for some rows)
I'll also confirm that the length_m, max_ma, and min_ma columns contain only positive values as per the nature of the information captured in those columns (it's impossible to have a negative length or age so negative values might be an indication of error).
The table above confirms that:
- All dinosaurs have a length less than or equal to 60 meters.
- All dinosaurs have a minimum age greater than 65 million years.
- The length_m, max_ma, and min_ma columns contain only positive values.
Next, I'm going to validate the max_ma and min_ma columns further.
According to the data dictionary, the max_ma column represents the geological age, in million years, at which the earliest fossil record of a particular dinosaur species or genus was discovered while the min_ma represents the geological age, in million years, at which the latest fossil record of the same dinosaur species or genus was discovered.
This means that max_ma is the upper boundary of the time period while min_ma is the lower boundary. Therefore, we're going to validate that:
- The max_ma value is always greater than the min_ma value for every record.