Skip to content
Understanding Dinosaur's Fossils and the Film Industry By Orabueze Blessing
0
  • AI Chat
  • Code
  • Report
  • PART 1: Understanding Dinosaurs Fossil Data

    1.1 Background

    You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

    1.2 Objectives

    The objective of this research is to explore the fossil records, find insights and advise the museum on the quality of their data.

    Below are some of the questions my analysis aims to answer:

    1.How many different dinosaur names are present in the data?

    2.Which was the largest dinosaur? What about missing data in the dataset?

    3.What dinosaur type has the most occurrences in this dataset?

    4.Did dinosaurs get bigger over time?

    5.Create an interactive map showing each record using AI.

    6.Any other insights found during the analysis.

    1.3 Introduction

    I am currently an intern at a National Museum where they recently created a database containing all dinosaur records of past field campaigns. I am tasked with the responsibility of exploring the dataset, deriving interesting insights, and advising the management on the quality of the data they have.

    1.4 Data Description

    The data originates from real data of the Paleobiology Database(source): enriched with data from Wikipedia The following information describes key variables of the data:

    Column nameDescription
    occurence_noThe original occurrence number from the Paleobiology Database.
    nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
    dietThe main diet (omnivorous, carnivorous, herbivorous).
    typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
    length_mThe maximum length, from head to tail, in meters.
    max_maThe age in which the first fossil records of the dinosaur where found, in million years.
    min_maThe age in which the last fossil records of the dinosaur where found, in million years.
    regionThe current region where the fossil record was found.
    lngThe longitude where the fossil record was found.
    latThe latitude where the fossil record was found.
    classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
    familyThe taxonomical family of the dinosaur (if known).

    1.5 Exploratory Data Analysis

    In order to better understand our data we begin by importing the relevant R packages,converting our data into a dataframe, and then we preview our data. We will also check for missing values and perform basic statistics on the data.

    library(tidyverse)
    library(dplyr)
    library(ggplot2)
    library(leaflet)
    
    # Load the data
    dinosaurs <- read_csv('data/dinosaurs.csv', show_col_types = FALSE)
    
    # preview the data
    print(view(dinosaurs))
    
    # print the information about the non-missing values and corresponding datatype per column
    summary(dinosaurs)

    1.6 Main Analysis

    Time to delve deeper into our data.

    1. How many Different dinosaur names are present in the data?

    There are 1042 unique dinosaur names in the dataset.

    # number of different dinosaur names present in the data 
    num_unique_name <- dinosaurs %>%
      distinct(name) %>%
      nrow()
    num_unique_name
    print(paste("The number of unique dinosaur names in the dataset is", num_unique_name))
    1. Which was the largest dinosaur? What about missing data in the dataset?

    The two largest dinosaurs in our dataset with a length of 35 meters are:

    1. Supersaurus
    2. Argentinosaurus
    # Find the largest dinosaur in length
    largest_dinosaur <- dinosaurs %>%
      filter(length_m == max(length_m, na.rm = TRUE)) %>%
      select(name, length_m)
    
    largest_dinosaur

    2.2 Addressing Missing Data in our dataset

    Our dataset contains 5592 missing data points.

    What can cause missing data?:

    • Lack of information
    • Data entry error
    • intentional omissions by participants
    • equipment malfunction
    • Data loss
    • Privacy concerns

    The missing data percentage of the total data set for each column is as follows:

    • family - 29%
    • length_m- 28%
    • diet - 27%
    • type - 27%
    • region - 0.8%

    Missing data can affect our analysis in the following ways:

    1. Missing data skews the distribution of our data.
    2. Missing data can lead to poor insights of the data, thereby making us to draw wrong conclusions and make bad decisions.
    3. Missing data makes our data less representative of the population.
    # R code to find missing data
    missing_data <- sum(is.na(dinosaurs))
    print(missing_data)
    missing_values <- colSums(is.na(dinosaurs))
    print(missing_values)
    missing_percentage <- colMeans(is.na(dinosaurs)) * 100
    print(missing_percentage)

    3. Which dinosaur type appears most frequently in this dataset?

    Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type.

    3.1 Which dinosaur type appears most frequently in the dataset?

    The most frequently occurring dinosaur type in the dataset is the ornithopod. Ornithopods were a group of ornithischian dinosaurs characterized by their bipedal stance, meaning they walked on two legs. They were one of the most successful herbivorous dinosaur lineages, existing from the late Triassic Period to the Late Cretaceous Period (approximately 229 million to 65.5 million years ago). Ornithopods were similar to present-day ruminants like cattle and deer; they had horny beaks for cropping vegetation and molar-like cheek teeth for grinding food. The Ornithopoda group included several subgroups, such as Fabrosauridae, Heterodontosauridae, Hypsilophodontidae, Iguanodontidae, and Hadrosauridae.