1️⃣ Part 1 (Python) - Dinosaur data 🦕
📖 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
💾 The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
Column name | Description |
---|---|
occurence_no | The original occurrence number from the Paleobiology Database. |
name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
diet | The main diet (omnivorous, carnivorous, herbivorous). |
type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
length_m | The maximum length, from head to tail, in meters. |
max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
region | The current region where the fossil record was found. |
lng | The longitude where the fossil record was found. |
lat | The latitude where the fossil record was found. |
class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
Introduction
Welcome, everyone. Today, we are embarking on an exciting project that marries the rich world of paleontology with cutting-edge data analytics. Our focus is on the National Museum's dinosaur fossil record database. We will be cleaning, analyzing, and transforming this data to unlock valuable insights that can enhance our understanding and decision-making processes.
Imagine our database as a treasure trove, filled with meticulously collected dinosaur fossils from various expeditions. Each fossil holds critical information about the dinosaurs' size, diet, and the era they lived in. However, like any extensive collection, our database has some gaps and inconsistencies. These incomplete entries are akin to missing pieces of a complex puzzle.
Our mission is to meticulously clean and organize this data, ensuring its accuracy and completeness. By doing so, we will turn this database into a robust tool that can provide us with profound insights and drive strategic decisions. This process will enhance the museum's value and impact, highlighting the power of data analytics in uncovering the hidden stories within our historical records.
Data Wrangling and cleaning
Just like restoring a precious artifact, data cleaning requires a delicate touch. Here's our toolbox of techniques, keeping in mind the unique nature of museum data:
# Import the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv', na_values=' ')
# Count unique dinosaur names
n_unique_names = dinosaurs['name'].nunique()
print(f"There are {n_unique_names} unique dinosaur names present in the data.")
# Preview the dataframe
dinosaurs
Explaratory Data Analysis
# Display basic information and check for missing values
print('Basic Information:')
dinosaurs.info()
print('\nMissing Values:')
dinosaurs.isnull().sum()
Missing Data Percentage:
No missing data for some key columns like occurrence_no, name, max_ma, min_ma, lng, and lat. But, there are missing values in several other columns:
- diet (1355 missing values) - Approximately 27.6% missing data
- type (1355 missing values) - Approximately 27.6% missing data
- length_m (1383 missing values) - Approximately 27.9% missing data
- region (42 missing values) - Approximately 0.9% missing data
- family (1457 missing values) - Approximately 29.4% missing data
Challenges with Archaeological Data:
Data about ancient times and archaeological objects can be inherently incomplete:
- Incomplete Records: Ancient cultures may not have documented everything in detail.
- Degradation and Loss: Artifacts may be damaged or lost over time.
- Unidentified Objects: The purpose or function of some objects might be unknown.
So fossil records can inherently have missing data due to:
- Incomplete Fossils: Not all parts of a dinosaur might be recovered during excavation.
- Indeterminate Features: Some fossils may not have clear features to determine diet or type.
- Limited Knowledge: Our understanding of certain dinosaur species might be incomplete.
Checking Data Distribution:
Data distribution refers to how the data points are spread out. We'll examine the distribution of features like "length" and "age" to identify any potential skewness (lopsidedness) or unexpected patterns. This helps us choose appropriate statistical methods for analysis. Balancing Data (if applicable):
The dataset provided contains information about various dinosaur fossils, including details such as their names, diets, types, sizes, and locations where they were found. However, there are some patterns and potential issues that could lead to distortion in the data analysis process. Let's explore these in detail:
Statistical Analysis of Dataset:
The dataset provided includes several variables related to dinosaur fossil records, and it presents a statistical summary for each variable. Let's break down the summary statistics and explain the potential distortions observed in the data.
print('Statistical Analysis of Dataset:')
dinosaurs.describe()
Distortions Explained in Dinosaur Fossil Record Data:
This analysis highlights potential distortions in the dinosaur fossil record data:
Missing Data:
- The "length_m" variable has a lower count (3568) compared to others (4951), indicating missing data points. This can bias results if the missing data isn't randomly distributed. We'll need to investigate the reason for missing values and determine if imputation techniques are appropriate.
Skewness and Outliers:
-
Length (meters): The mean (8.21 meters) is higher than the median (6.7 meters), suggesting a positive skew (distribution favoring larger values). The maximum value (35 meters) is significantly higher than the third quartile (10 meters), indicating outliers (data points far from the typical range). We'll need to carefully examine these outliers with paleontologists to determine if they represent genuine discoveries or data errors.
-
Maximum Age (million years) and Minimum Age (million years): Both have a wide distribution (70.6 to 252.17 and 66 to 247.2, respectively) with a high standard deviation (around 45), suggesting a right skew (distribution favoring older ages). The mean is higher than the median for both, further supporting the right skew. We'll need to consider this skew when choosing statistical methods for analysis.
-
Longitude: The wide range (-153.25 to 565) and high standard deviation indicate outliers. Valid longitude values range from -180 to 180. We'll need to investigate and correct these extreme values, as they likely represent data entry errors.
-
Latitude: The minimum (-84.33) and maximum (78.10) values fall within the expected range, but the high standard deviation (23.96) suggests a wide geographical spread, which is reasonable considering global fossil collection.
Geographical Data Anomalies:
-
The extreme value (565) in longitude is a clear anomaly and likely an error. We'll correct this value to ensure accurate analysis.
-
Latitude values show high variability, but this is likely due to the global distribution of fossil finds. We'll keep this variability in mind when interpreting geographical patterns.
By addressing these distortions, we can ensure the data accurately reflects the dinosaur fossil record and provides reliable insights for further analysis.
# Histograms to show data distributions
dinosaurs.hist(figsize=(15, 10))
plt.suptitle('Histograms of Features')
plt.show()
Duplicate Entries: