Skip to content
Competition - Everyone Can Learn Data Scholarship
0

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Preview the dataframe
dinosaurs.head(5)

1. How many different dinosaur names are present in the data?

I calculated the number of unique dinosaur names in the dataset using the "name" column. The variable different_name stores this count, and then I print the result with a descriptive message indicating the total number of different dinosaur names in the dataset.

different_name = dinosaurs["name"].nunique()
print("Number of different dinosaurs name = ",different_name)

2. Which was the largest dinosaur? What about missing data in the dataset

We found the information about the largest dinosaur in the dataset based on its length. It identifies the dinosaur's name and length using the idxmax() function to locate the row with the maximum 'length_m' value and then prints the results.

largest_dinosaur = dinosaurs.loc[dinosaurs["length_m"].idxmax(), "name"]
largest_dinosaur_length = dinosaurs.loc[dinosaurs["length_m"].idxmax(), "length_m"]
print("Largest Dinosaurs is :" ,largest_dinosaur, largest_dinosaur_length)

About missing values

We checked for missing values in each column of the 'dinosaurs' dataset and returns the total count of missing values for each column. It provides a quick summary of the dataset, helping you identify which columns have missing data and how many missing values are present in each column.

dinosaurs.isnull().sum()

Filling Missing Values of Column Diet

In this we focused on handling missing values in the 'diet' column of the dinosaur dataset. It identifies rows where the 'diet' is either 'null' or missing, then randomly imputes these missing values with 'omnivorous', 'herbivorous', or 'carnivorous' using numpy's random choice function.

#DIET COLUMN
null_rows = (dinosaurs['diet'] == 'null') | dinosaurs['diet'].isnull()
random_choices = np.random.choice(['omnivorous', 'herbivorous', 'carnivorous'], size=sum(null_rows))
dinosaurs.loc[null_rows, 'diet'] = random_choices