PART 1: Understanding Dinosaurs Fossil Data
1.1 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
1.2 Objectives
The objective of this research is to explore the fossil records, find insights and advise the museum on the quality of their data.
Below are some of the questions my analysis aims to answer:
1.How many different dinosaur names are present in the data?
2.Which was the largest dinosaur? What about missing data in the dataset?
3.What dinosaur type has the most occurrences in this dataset?
4.Did dinosaurs get bigger over time?
5.Create an interactive map showing each record using AI.
6.Any other insights found during the analysis.
1.3 Introduction
I am currently an intern at a National Museum where they recently created a database containing all dinosaur records of past field campaigns. I am tasked with the responsibility of exploring the dataset, deriving interesting insights, and advising the management on the quality of the data they have.
1.4 Data Description
The data originates from real data of the Paleobiology Database(source): enriched with data from Wikipedia The following information describes key variables of the data:
Column name | Description |
---|---|
occurence_no | The original occurrence number from the Paleobiology Database. |
name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
diet | The main diet (omnivorous, carnivorous, herbivorous). |
type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
length_m | The maximum length, from head to tail, in meters. |
max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
region | The current region where the fossil record was found. |
lng | The longitude where the fossil record was found. |
lat | The latitude where the fossil record was found. |
class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
family | The taxonomical family of the dinosaur (if known). |
1.5 Exploratory Data Analysis
In order to better understand our data we begin by importing the relevant R packages,converting our data into a dataframe, and then we preview our data. We will also check for missing values and perform basic statistics on the data.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(leaflet)
# Load the data
dinosaurs <- read_csv('data/dinosaurs.csv', show_col_types = FALSE)
# preview the data
print(view(dinosaurs))
# print the information about the non-missing values and corresponding datatype per column
summary(dinosaurs)
1.6 Main Analysis
Time to delve deeper into our data.
- How many Different dinosaur names are present in the data?
There are 1042 unique dinosaur names in the dataset.
# number of different dinosaur names present in the data
num_unique_name <- dinosaurs %>%
distinct(name) %>%
nrow()
num_unique_name
print(paste("The number of unique dinosaur names in the dataset is", num_unique_name))
- Which was the largest dinosaur? What about missing data in the dataset?
The two largest dinosaurs in our dataset with a length of 35 meters are:
- Supersaurus
- Argentinosaurus
# Find the largest dinosaur in length
largest_dinosaur <- dinosaurs %>%
filter(length_m == max(length_m, na.rm = TRUE)) %>%
select(name, length_m)
largest_dinosaur
2.2 Addressing Missing Data in our dataset
Our dataset contains 5592 missing data points.
What can cause missing data?:
- Lack of information
- Data entry error
- intentional omissions by participants
- equipment malfunction
- Data loss
- Privacy concerns
The missing data percentage of the total data set for each column is as follows:
- family - 29%
- length_m- 28%
- diet - 27%
- type - 27%
- region - 0.8%
Missing data can affect our analysis in the following ways:
- Missing data skews the distribution of our data.
- Missing data can lead to poor insights of the data, thereby making us to draw wrong conclusions and make bad decisions.
- Missing data makes our data less representative of the population.
# R code to find missing data
missing_data <- sum(is.na(dinosaurs))
print(missing_data)
missing_values <- colSums(is.na(dinosaurs))
print(missing_values)
missing_percentage <- colMeans(is.na(dinosaurs)) * 100
print(missing_percentage)
3. Which dinosaur type appears most frequently in this dataset?
Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type.
3.1 Which dinosaur type appears most frequently in the dataset?
The most frequently occurring dinosaur type in the dataset is the ornithopod. Ornithopods were a group of ornithischian dinosaurs characterized by their bipedal stance, meaning they walked on two legs. They were one of the most successful herbivorous dinosaur lineages, existing from the late Triassic Period to the Late Cretaceous Period (approximately 229 million to 65.5 million years ago). Ornithopods were similar to present-day ruminants like cattle and deer; they had horny beaks for cropping vegetation and molar-like cheek teeth for grinding food. The Ornithopoda group included several subgroups, such as Fabrosauridae, Heterodontosauridae, Hypsilophodontidae, Iguanodontidae, and Hadrosauridae.