Skip to content
0

Dino Data 🦕

The fossil records from field campaigns offer interesting insights about geographic distributions and changes in average dinosaur sizes over hundreds of millions of years. Here's what we're working with:

💾 The data

Column nameDescription
occurence_noThe original occurrence number from the Paleobiology Database.
nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
dietThe main diet (omnivorous, carnivorous, herbivorous).
typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_mThe maximum length, from head to tail, in meters.
max_maThe age in which the first fossil records of the dinosaur where found, in million years.
min_maThe age in which the last fossil records of the dinosaur where found, in million years.
regionThe current region where the fossil record was found.
lngThe longitude where the fossil record was found.
latThe latitude where the fossil record was found.
classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
familyThe taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

Hidden code

Here are pairwise comparisons of numerical columns; on the diagonal from top left to bottom right are histograms for each of those columns, then scatter plots on top and KDE (cluster) plots below.

import seaborn as sns
import matplotlib.pyplot as plt
ppd = sns.pairplot(dinosaurs.iloc[:,1:])
ppd.map_upper(plt.scatter)
ppd.map_lower(sns.kdeplot)
plt.show()

Longitude near 600 must be a mistake; let's see if that one disapears with removal of nulls.

Hidden code

We're only going to concern ourselves with the rows where no columns are null. Here are summary statistics after dropping null-containing rows:

Hidden code

The record with an out-of-bounds longitude was removed; it evidently had a null in a column, too.

Hidden code
Hidden code
Hidden code
# 3
import matplotlib.pyplot as plt
import seaborn as sns

# Define a color palette with unique colors for each type
unique_types = dinosaurs['type'].unique()
palette = sns.color_palette("husl", len(unique_types))
color_dict = dict(zip(unique_types, palette))

# Plot absolute counts
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

dinosaurs['type'].value_counts().plot(kind='barh', ax=ax[0], color=[color_dict[x] for x in dinosaurs['type'].value_counts().index])
ax[0].set_title('Absolute Counts by Type')

# Plot relative counts as proportions
(100*dinosaurs['type'].value_counts(normalize=True)).plot(kind='barh', ax=ax[1], color=[color_dict[x] for x in dinosaurs['type'].value_counts().index])
ax[1].set_title('Relative Counts by Type (Percent of Total)')
print('Distributions by type of dinosaur:')
plt.tight_layout()
plt.show()

It appears there was a period around 155mya (give or take 10my) when large sauropods were abundant, and then possibly experienced a population crash around 140mya before partly recovering. The dataset in this analysis represents a minescule fraction of true dinosaur distributions, which are very difficult to accurately estimate, since finding fossils is the primary means of counting and taxonomy. Specimens from longer ago, logically, will be rarer and more difficult to find, having had more exposure time in various environments that could damage the fossil. That noted, in the dataset, average sizes increased gradually from the time of the great sauropods toward the extinction level event that caused the extinction of most dinosaur specii.