Skip to content
0

Truly, everyone can learn data.

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column nameDescription
occurence_noThe original occurrence number from the Paleobiology Database.
nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
dietThe main diet (omnivorous, carnivorous, herbivorous).
typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_mThe maximum length, from head to tail, in meters.
max_maThe age in which the first fossil records of the dinosaur where found, in million years.
min_maThe age in which the last fossil records of the dinosaur where found, in million years.
regionThe current region where the fossil record was found.
lngThe longitude where the fossil record was found.
latThe latitude where the fossil record was found.
classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
familyThe taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Preview the dataframe
dinosaurs
# View basic statistics of the dataset
print(dinosaurs.describe())
# Number of unique Dinosaur names in the dataset
number_of_unique_names = len(dinosaurs['name'].drop_duplicates())
print(f'The dataset has {number_of_unique_names} different dinosaur names present in it.')

# Largest Dinosaur
largest_dinosaur = dinosaurs[dinosaurs['length_m'] == dinosaurs['length_m'].max()][['name']].drop_duplicates()
smallest_dinosaur = dinosaurs[dinosaurs['length_m'] == dinosaurs['length_m'].min()]['name'].iloc[0]
print(f'{largest_dinosaur.iloc[0, 0]} and {largest_dinosaur.iloc[1, 0]} are the largest dinosaurs with same length of {dinosaurs["length_m"].max()} meters while the smallest dinosaur is {smallest_dinosaur} with a length of {dinosaurs["length_m"].min()} meters.')

# Number of missing data points in the length_m column
missing_length = dinosaurs['length_m'].isnull().sum()
available_length = dinosaurs['length_m'].dropna().count()
print(f'There are {missing_length} mising data points in the length_m column with {available_length} available.' )

# Finding the mean, median and standard deviation of the dinosaur length
measures_length = dinosaurs[['length_m']].agg(['mean', 'median', 'std'])
print(f'The mean, median and standard deviation of the length of the Dinosaur dataset is as follows; \n{measures_length}')
# Plotting the distribution of Dinosaur length
plt.style.use('seaborn-v0_8-colorblind')
sns.histplot(x='length_m', data=dinosaurs, bins=20, element='bars', fill=True)
plt.xlabel('Dinosaur Length (meters)')
plt.title('Distribution of Dinosaur Length')
plt.show()
print('The above Histogram is right skewed. Hence, missing data points are better filled with median rather than with mean.')
# Fill missing data in length_m column
dinosaurs['length_m'] = dinosaurs['length_m'].fillna(dinosaurs['length_m'].median())
# Boxplot of Dinosaur length
sns.boxplot(y='length_m', data=dinosaurs)
plt.xlabel('Dinosaurs'); plt.ylabel('Length')
plt.title('Boxplot of Dinosaur length', y=1.1)
plt.show()

The above is a boxplot of all Dinosasur length after filling the missing datapoints with the median.

# Boxplot of the length of different Dinosaur types 
sns.catplot(y='type', x='length_m', data=dinosaurs, kind='box')
plt.xlabel('Length of Dinosaur (meters)')
plt.ylabel('Dinosaur Type')
plt.title('Boxplot of of each Dinosaur type.', y=1.1)
plt.show()

The above plot shows that the Sauropod is the largest dinosaur type with a median length of 21 meters. And Small Theropod is the smallest type having a median length of 2 meters with 5 outliers.

# Dinosaur type with most occurrences in the dataset and diet preference
g = sns.catplot(y='type', data=dinosaurs, kind='count', palette='Set1', hue='diet')
g.set_axis_labels('Count', 'Dinosaur Type') 
g.fig.suptitle('Frequency of Dinosaur types and diet.', y=1.05)  
plt.show()

print(f'The dinosaur type with the most occurrence in the dataset is {dinosaurs["type"].value_counts().idxmax()} as shown in the chart above.\nMost dinosaurs are Herbivorous with the least being Omnivorous found only in the Small theropod, Large theropod and Sauropod types. The Carnivorous dinosaurs belong to the small & large Theropod types only.')