Competition - Everyone Can Learn Data Scholarship

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description

The data was enriched with data from Wikipedia.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs

💪 Challenge I

Help your colleagues at the museum to gain insights on the fossil record data. Include:

How many different dinosaur names are present in the data?
Which was the largest dinosaur? What about missing data in the dataset?
What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
Use the AI assitant to create an interactive map showing each record.
Any other insights you found during your analysis?

# 1. Number of Unique Dinosaur
import pandas as pd
import numpy as np
dinosaurs = pd.read_csv('data/dinosaurs.csv')
Total_names= dinosaurs["name"].count() #Total dinosaurs
Unique_Names= dinosaurs["name"].nunique() #number of unique ones
print(Unique_Names)

In our dataset, we have a total of 4,951 dinosaur names. However, when we look at the unique names, there are about 1,042. This means that some names appear multiple times in our records.

#2 Largest Dinosaur available
import pandas as pd
import numpy as np
dinosaurs = pd.read_csv('data/dinosaurs.csv')
largeD = dinosaurs.loc[dinosaurs["length_m"].idxmax(), ["name", "length_m"]]
print(largeD)
missing_value=dinosaurs.isnull().sum()
print(missing_value)

In our exploration of dinosaur sizes, we discovered that the largest dinosaur is the Supersaurus, which stood an impressive 35 meters tall. This giant towers above the rest in our records.

When we looked at the quality of our data, we found some gaps. There are about 570 missing values in the names column. The diet and type columns are missing 1,355 values each. We also have 1,383 records missing for the length_m column. Additionally, the family column has 1,457 blank records. However, the region column is in better shape, with only 42 missing records.

These missing values are important to note as they can impact our analysis. Each gap tells us where we need more information to get a complete picture of these ancient creatures.

# 3 Dinosaur type with the most occurence.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dinosaurs = pd.read_csv('data/dinosaurs.csv')
df = dinosaurs.groupby("name").size().sort_values(ascending=False).head(5)
cc=['green','lightgreen','lightgrey','grey','grey']
plt.figure(figsize=(10, 6))
df.plot(kind='bar', color=cc)
plt.xlabel('Dinosaur Name')
plt.ylabel('Occurrences')
plt.title('Top 5 Most Common Dinosaurs')
plt.xticks(rotation=45)
for index, value in enumerate(df):
    plt.text(index, value, str(value), ha='center', va='bottom')
plt.show()

In our journey through dinosaur history, some species appear more often than others. The Richardoestesia is the most common dinosaur in our records, showing up 151 times. This makes it the top dinosaur in terms of how often we find its fossils.

Next, we have the Saurornitholestes, which appears 136 times. This dinosaur is second on our list, known for being quick and alert.

Following closely is the Triceratops, famous for its three horns and large frill. It shows up 125 times in our records, making it one of the most familiar dinosaurs.

Then, we have the Iguanodon and the Troodon, each appearing 111 times. The Iguanodon is known for its unique thumb spikes, and the Troodon is noted for being quite smart.

These numbers help us understand which dinosaurs were more common in the past. Each count tells a bit of the story about how these amazing creatures lived long before humans existed.

#Dinosaurs overall growth rate.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
dinosaurs = pd.read_csv('data/dinosaurs.csv')
if 'length_m' not in dinosaurs.columns or 'max_ma' not in dinosaurs.columns or 'min_ma' not in dinosaurs.columns:
    raise ValueError("The dataset must contain 'length_m', 'max_ma', and 'min_ma' columns")

# Calculate the average age for each dinosaur
dinosaurs['average_age'] = (dinosaurs['max_ma'] + dinosaurs['min_ma']) / 2

# Preprocess the data (e.g., handling missing values)
dinosaurs.dropna(subset=['length_m', 'average_age'], inplace=True)

# Fit a linear regression line
slope, intercept, r_value, p_value, std_err = linregress(dinosaurs['average_age'], dinosaurs['length_m'])
regression_line = slope * dinosaurs['average_age'] + intercept

# Plot the data
plt.figure(figsize=(10, 6))
plt.scatter(dinosaurs['average_age'], dinosaurs['length_m'], alpha=0.5,color="green", label='Dinosaur Data')
plt.plot(dinosaurs['average_age'], regression_line, color='red', label=f'Regression Line\n$R^2={r_value**2:.2f}$')

# Add labels and title
plt.title('Dinosaur Length vs. Average Age')
plt.xlabel('Average Age (millions of years ago)')
plt.ylabel('Length (meters)')
plt.legend()
plt.grid(True)
plt.gca().invert_xaxis()  # Invert x-axis to show age increasing to the left
plt.show()

As we delve into the fascinating world of dinosaurs, our data reveals a compelling narrative about their growth over millions of years. Initially, dinosaurs exhibited an impressive trend of increasing size. It’s clear from our analysis that, on average, these magnificent creatures grew significantly larger as they aged. This upward trajectory continued robustly until we reached a pivotal point in their timeline.

Around 150 million years ago, dinosaurs reached their peak growth rate. This period marks the zenith of their expansion, where they attained remarkable sizes. However, as we move closer to 125 million years ago, a noticeable shift occurs. The data indicates a gradual decline in their growth rate. While they continued to grow, the pace at which they did so began to diminish.

This trend persisted, suggesting that as the dinosaurs got older, their ability to grow rapidly waned. This slowdown in growth rate might be attributed to various ecological and environmental factors that we can explore further. Nonetheless, the data paints a vivid picture of an era where these colossal beings reached their grandest forms before experiencing a deceleration in their growth journey.

Through this analysis, we can appreciate the dynamic history of dinosaur evolution, marked by periods of extraordinary growth followed by a tempered pace as time progressed.

#5. Interactive map for each records
import os
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
from IPython.display import display, IFrame

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
required_columns = ['length_m', 'max_ma', 'min_ma', 'lat', 'lng']
if not all(column in dinosaurs.columns for column in required_columns):
    raise ValueError(f"The dataset must contain columns: {required_columns}")

# Calculate the average age for each dinosaur
dinosaurs['average_age'] = (dinosaurs['max_ma'] + dinosaurs['min_ma']) / 2

# Preprocess the data (e.g., handling missing values)
dinosaurs.dropna(subset=['length_m', 'average_age', 'lat', 'lng'], inplace=True)

# Create a base map
m = folium.Map(location=[dinosaurs['lat'].mean(), dinosaurs['lng'].mean()], zoom_start=2)

# Add points to the map
for idx, row in dinosaurs.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lng']],
        radius=5,
        popup=(
            f"Length: {row['length_m']} meters<br>"
            f"Average Age: {row['average_age']} million years ago<br>"
            f"Location: ({row['lat']}, {row['lng']})"
        ),
        color='blue',
        fill=True,
        fill_color='blue'
    ).add_to(m)

# Save the map to an HTML file
file_path = 'dinosaurs_map.html'
m.save(file_path)

# Check if the file was created and display it
if os.path.exists(file_path):
    print(f"Map saved successfully as {file_path}")
    display(IFrame(file_path, width=700, height=500))
else:
    print(f"Failed to create the map: {file_path}")

#6 Distribution of dinosaur lengths.
import pandas as pd
import matplotlib.pyplot as plt
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Plot a histogram of dinosaur lengths
plt.figure(figsize=(10, 6))
plt.hist(dinosaurs['length_m'], bins=20, color='blue')
plt.xlabel('Dinosaur Length (meters)')
plt.ylabel('Frequency')
plt.title('Distribution of Dinosaur Lengths')
plt.grid(True)
plt.show()

‌
‌
‌

Competition - Everyone Can Learn Data Scholarship

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

💾 The data

💪 Challenge I

1️⃣ Part 1 (Python) - Dinosaur data 🦕