Everyone Can Learn Data Scholarship
1️⃣ Part 1 (Python) - Dinosaur data 🦕
💾 The data
A real dataset containing dinosaur records from the Paleobiology Database (source):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
# Import the pandas, numpy & matplotlib packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')# Preview the dataframe
dinosaurs1. Introduction
The objective of this analysis is to provide valuable insights into the dinosaur fossil records for the National Museum of Natural History. By delving into the data, I aim to answer key questions about the diversity, size, distribution, and trends of dinosaurs over time. This analysis will assist the museum in enhancing the quality of its database and furthering its research on these magnificent creatures.
# inspecting the Dataframe
dinosaurs.head()
dinosaurs.info()
dinosaurs.describe()2. Analysis
2.1. Count of the different dinosaurs names
To determine the diversity of dinosaurs in the dataset, I counted the unique dinosaur names using the pandas function .nunique(). This provided an overview of the variety of species represented in the fossil records.
# Counting the unique names
unique_names = dinosaurs['name'].nunique()
print("Number of different dinosaur names = " + str(unique_names))2.2. Largest Dinosaur and Handling Missing Data
To identify the largest dinosaur, I first converted the length data to numeric values and handled missing data by filling it with the median length. This ensured our analysis remained accurate. The largest dinosaur in the dataset is Supersaurus with a length of 35 meters.
# Converting the length column to numeric to ensure accurate analysis
dinosaurs['length_m'] = pd.to_numeric(dinosaurs['length_m'], errors='coerce')
# Finding the maximum length
max_length = dinosaurs['length_m'].max()
# Getting the row with the maximum length
largest_dinosaur = dinosaurs[dinosaurs['length_m'] == max_length].iloc[0]
print("The largest dinosaur is: " + largest_dinosaur['name'])
# Handling the missing data
# Filling missing values with the median length
median_length = dinosaurs['length_m'].median()
dinosaurs['length_m'].fillna(median_length, inplace=True)
print("Filled missing length values with the median length: " + str(median_length))2.3. Most Common Dinosaur Type
I analyzed the occurrences of each dinosaur type using the .value_counts() method and visualized the results with a bar chart. This allowed me to easily compare the prevalence of different dinosaur types. The most common dinosaur type as seen in the visualization is the ornithopod.
# Counting occurrence of each dinosaur type and storing in a DataFrame
dinosaur_type = dinosaurs['type'].value_counts().reset_index()
dinosaur_type.columns = ['type', 'count']
# Creating a bar chart to visualize the count of each dinosaur type
# Bar chart set-up
plt.figure(figsize=(10,6))
bars = plt.bar(dinosaur_type['type'], dinosaur_type['count'], color='skyblue')
plt.xlabel('Dinosaur Type')
plt.ylabel('Count')
plt.title('Count of Each Dinosaur Type')
plt.xticks(rotation=45)
plt.show()2.4. Dinosaur Size Over Time
To explore whether dinosaurs grew larger over time, I calculated the average age of each fossil record and created a scatter plot to visualize the relationship between the age and length of the dinosaurs. The scatter plot illustrates the relationship between the dinosaurs' length and their average age, providing insights into size trends over time. The scatter plot indicates a slight trend of dinosaurs getting somewhat larger over time, but it is not strong or definitive. Dinosaur lengths remain relatively consistent across different ages, showing no clear indication of significant size increase.
# converting the length and age to numeric to ensure accurate analysis
dinosaurs['length_m'] = pd.to_numeric(dinosaurs['length_m'], errors='coerce')
dinosaurs['max_ma'] = pd.to_numeric(dinosaurs['max_ma'], errors='coerce')
dinosaurs['min_ma'] = pd.to_numeric(dinosaurs['min_ma'], errors='coerce')
# calculating average age
dinosaurs['avg_ma'] = dinosaurs[['max_ma', 'min_ma']].mean(axis=1)
# drop rows with missing values for 'length_m' & 'avg_ma'
dinosaurs = dinosaurs.dropna(subset=['length_m', 'avg_ma'])
# Creating a scatter plot to show relation between dinosaur length and average age
# scatter plot set-up
plt.figure(figsize=(10,6))
scatter_plot = plt.scatter(dinosaurs['avg_ma'], dinosaurs['length_m'], alpha=0.5, color='red')
plt.xlabel('Average age (millions of years)')
plt.ylabel('Length (meters)')
plt.title('Dinosaur length vs Average age')
plt.grid(True)
# Adding a trend line to better visualize the relationship
z = np.polyfit(dinosaurs['avg_ma'], dinosaurs['length_m'], 1)
p = np.poly1d(z)
plt.plot(dinosaurs['avg_ma'], p(dinosaurs['avg_ma']), "b--")
plt.show()