Insights on Dinosaur and Movie data using Python and SQL

Truly, everyone can learn data.

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs

# View basic statistics of the dataset
print(dinosaurs.describe())

# Number of unique Dinosaur names in the dataset
number_of_unique_names = len(dinosaurs['name'].drop_duplicates())
print(f'The dataset has {number_of_unique_names} different dinosaur names present in it.')

# Largest Dinosaur
largest_dinosaur = dinosaurs[dinosaurs['length_m'] == dinosaurs['length_m'].max()][['name']].drop_duplicates()
smallest_dinosaur = dinosaurs[dinosaurs['length_m'] == dinosaurs['length_m'].min()]['name'].iloc[0]
print(f'{largest_dinosaur.iloc[0, 0]} and {largest_dinosaur.iloc[1, 0]} are the largest dinosaurs with same length of {dinosaurs["length_m"].max()} meters while the smallest dinosaur is {smallest_dinosaur} with a length of {dinosaurs["length_m"].min()} meters.')

# Number of missing data points in the length_m column
missing_length = dinosaurs['length_m'].isnull().sum()
available_length = dinosaurs['length_m'].dropna().count()
print(f'There are {missing_length} mising data points in the length_m column with {available_length} available.' )

# Finding the mean, median and standard deviation of the dinosaur length
measures_length = dinosaurs[['length_m']].agg(['mean', 'median', 'std'])
print(f'The mean, median and standard deviation of the length of the Dinosaur dataset is as follows; \n{measures_length}')

# Plotting the distribution of Dinosaur length
plt.style.use('seaborn-v0_8-colorblind')
sns.histplot(x='length_m', data=dinosaurs, bins=20, element='bars', fill=True)
plt.xlabel('Dinosaur Length (meters)')
plt.title('Distribution of Dinosaur Length')
plt.show()
print('The above Histogram is right skewed. Hence, missing data points are better filled with median rather than with mean.')

# Fill missing data in length_m column
dinosaurs['length_m'] = dinosaurs['length_m'].fillna(dinosaurs['length_m'].median())

# Boxplot of Dinosaur length
sns.boxplot(y='length_m', data=dinosaurs)
plt.xlabel('Dinosaurs'); plt.ylabel('Length')
plt.title('Boxplot of Dinosaur length', y=1.1)
plt.show()

The above is a boxplot of all Dinosasur length after filling the missing datapoints with the median.

# Boxplot of the length of different Dinosaur types 
sns.catplot(y='type', x='length_m', data=dinosaurs, kind='box')
plt.xlabel('Length of Dinosaur (meters)')
plt.ylabel('Dinosaur Type')
plt.title('Boxplot of of each Dinosaur type.', y=1.1)
plt.show()

The above plot shows that the Sauropod is the largest dinosaur type with a median length of 21 meters. And Small Theropod is the smallest type having a median length of 2 meters with 5 outliers.

# Dinosaur type with most occurrences in the dataset and diet preference
g = sns.catplot(y='type', data=dinosaurs, kind='count', palette='Set1', hue='diet')
g.set_axis_labels('Count', 'Dinosaur Type') 
g.fig.suptitle('Frequency of Dinosaur types and diet.', y=1.05)  
plt.show()

print(f'The dinosaur type with the most occurrence in the dataset is {dinosaurs["type"].value_counts().idxmax()} as shown in the chart above.\nMost dinosaurs are Herbivorous with the least being Omnivorous found only in the Small theropod, Large theropod and Sauropod types. The Carnivorous dinosaurs belong to the small & large Theropod types only.')

‌
‌
‌

Insights on Dinosaur and Movie data using Python and SQL

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Truly, everyone can learn data.

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

💾 The data

Truly, everyone can learn data.