Skip to content

Everyone Can Learn Data Scholarship

1️⃣ Part 1 (Python) - Dinosaur data 🦕

🦕 Challenge 1: You're applying for a summer internship at the national museum of natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find interesting insights, and advise the museum on the quality of the data.

Phase 0. Data Gathering

First, let us begin by acquiring the necessary tools for our analysis and importing our dataset.

# Import neccessary libraries 
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import folium

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

#creating a copy of the dataframe
ds = dinosaurs.copy()

We then proceed to familiarize oursleves with the data while keeping in mind the aim of our analysis (diving into the fossil records to find some interesting insights, and advising the museum on the quality of the data).

During this process, we derive some interesting prompts/questions which could help propel our analysis forward (presented in the Other Insights section).

#Previewing the dataframe to see and acknowledge content
ds.head(15)

Phase 1 - Data Quality Assessment

Data quality assessment is a crucial step in ensuring that our data is fit for use in our specific context

In this phase, we assess the structure and content of our data. We crosscheck for data quality and tidiness issues; beginning with inaccurate data types, missing/incomplete data, duplicate data, inaccurate data, and then subsequently diving into other issues which will be observed as we proceed.

Let us begin this assessment by visualising some statistics about our data

# visualing basic statistics related to our data to see if there are any anomalies or patterns
ds.describe()

After going over the statistics above, the following was noticed:

  • The longitude (lng) column seems to have an erroneous entry. Longitudes range from -180 to +180, but our stats show us that there is a max value of 56.5. This has to be documented and addressed.

  • Additionally, The lengths (length_m) and ages (max_ma and min_ma) seemed quite far apart to me, which picked my curiosity but after verification, they do not seem to be faulty as their values are realistic and concord with other sources like https://www.dinosaurreport.com/longest-dinosaur/ and https://www.oldest.org/animals/dinosaurs/

Let's move on to crosschecking our data for quality issues, beginning with checking the data types of each column.

a. Checking for incorrect data types

#inspecting data types
ds.info()

The data types seem appropriate for each column. Now we proceed to check for null values.

b. Checking for null values