Understanding Dinosaur's Fossils and the Film Industry By Orabueze Blessing

PART 1: Understanding Dinosaurs Fossil Data

1.1 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

1.2 Objectives

The objective of this research is to explore the fossil records, find insights and advise the museum on the quality of their data.

Below are some of the questions my analysis aims to answer:

1.How many different dinosaur names are present in the data?

2.Which was the largest dinosaur? What about missing data in the dataset?

3.What dinosaur type has the most occurrences in this dataset?

4.Did dinosaurs get bigger over time?

5.Create an interactive map showing each record using AI.

6.Any other insights found during the analysis.

1.3 Introduction

I am currently an intern at a National Museum where they recently created a database containing all dinosaur records of past field campaigns. I am tasked with the responsibility of exploring the dataset, deriving interesting insights, and advising the management on the quality of the data they have.

1.4 Data Description

The data originates from real data of the Paleobiology Database(source): enriched with data from Wikipedia The following information describes key variables of the data:

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

1.5 Exploratory Data Analysis

In order to better understand our data we begin by importing the relevant R packages,converting our data into a dataframe, and then we preview our data. We will also check for missing values and perform basic statistics on the data.

library(tidyverse)
library(dplyr)
library(ggplot2)
library(leaflet)

# Load the data
dinosaurs <- read_csv('data/dinosaurs.csv', show_col_types = FALSE)

# preview the data
print(view(dinosaurs))


# print the information about the non-missing values and corresponding datatype per column
summary(dinosaurs)

1.6 Main Analysis

Time to delve deeper into our data.

How many Different dinosaur names are present in the data?

There are 1042 unique dinosaur names in the dataset.

# number of different dinosaur names present in the data 
num_unique_name <- dinosaurs %>%
  distinct(name) %>%
  nrow()
num_unique_name
print(paste("The number of unique dinosaur names in the dataset is", num_unique_name))

Which was the largest dinosaur? What about missing data in the dataset?

The two largest dinosaurs in our dataset with a length of 35 meters are:

Supersaurus
Argentinosaurus

# Find the largest dinosaur in length
largest_dinosaur <- dinosaurs %>%
  filter(length_m == max(length_m, na.rm = TRUE)) %>%
  select(name, length_m)

largest_dinosaur

2.2 Addressing Missing Data in our dataset

Our dataset contains 5592 missing data points.

What can cause missing data?:

Lack of information
Data entry error
intentional omissions by participants
equipment malfunction
Data loss
Privacy concerns

The missing data percentage of the total data set for each column is as follows:

family - 29%
length_m- 28%
diet - 27%
type - 27%
region - 0.8%

Missing data can affect our analysis in the following ways:

Missing data skews the distribution of our data.
Missing data can lead to poor insights of the data, thereby making us to draw wrong conclusions and make bad decisions.
Missing data makes our data less representative of the population.

# R code to find missing data
missing_data <- sum(is.na(dinosaurs))
print(missing_data)
missing_values <- colSums(is.na(dinosaurs))
print(missing_values)
missing_percentage <- colMeans(is.na(dinosaurs)) * 100
print(missing_percentage)

3. Which dinosaur type appears most frequently in this dataset?

Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type.

3.1 Which dinosaur type appears most frequently in the dataset?

The most frequently occurring dinosaur type in the dataset is the ornithopod. Ornithopods were a group of ornithischian dinosaurs characterized by their bipedal stance, meaning they walked on two legs. They were one of the most successful herbivorous dinosaur lineages, existing from the late Triassic Period to the Late Cretaceous Period (approximately 229 million to 65.5 million years ago). Ornithopods were similar to present-day ruminants like cattle and deer; they had horny beaks for cropping vegetation and molar-like cheek teeth for grinding food. The Ornithopoda group included several subgroups, such as Fabrosauridae, Heterodontosauridae, Hypsilophodontidae, Iguanodontidae, and Hadrosauridae.

‌
‌
‌