Best analysis ever! Andres Rafael Tito

Everyone Can Learn Data Scholarship

📖 Background

The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs.head()

💪 Challenge I

Help your colleagues at the museum to gain insights on the fossil record data. Include:

How many different dinosaur names are present in the data?
Which was the largest dinosaur? What about missing data in the dataset?
What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
Use the AI assitant to create an interactive map showing each record.
Any other insights you found during your analysis?

# 1. How many different dinosaur names are present in the data?

# Number of unique dinosaur names
unique_dinosaur_names = dinosaurs['name'].nunique()
print(f"\n There are {unique_dinosaur_names} unique dinosaur names in the dataset.")

# 2. Which was the largest dinosaur? What about missing data in the dataset?

# Determine the largest dinosaur by length
largest_dinosaur = dinosaurs.loc[dinosaurs['length_m'].idxmax()]
print(f"\n The largest dinosaur by length has {largest_dinosaur['length_m']}.")

# Check for missing data
missing_data = dinosaurs.isnull().sum()
print(f"\n There are {missing_data['occurrence_no']} missing data.")

# 3. What dinosaur type has the most occurrences in this dataset? Create a visualization to display the number of dinosaurs per type. 

# Import the matplotlib and seaborn packages
import matplotlib.pyplot as plt
import seaborn as sns

# Count the occurrences of each dinosaur type
dinosaur_type_counts = dinosaurs['type'].value_counts()

# Create a bar chart
plt.figure(figsize = (10,6))
sns.barplot(x = dinosaur_type_counts.values, y = dinosaur_type_counts.index, palette = 'viridis')
plt.title('Number of dinosaurs per type')
plt.xlabel('Count')
plt.ylabel('Dinosaur type')
plt.show()

# Get the most frequent dinosaur type
most_dinosaur_type = dinosaur_type_counts.idxmax()
most_dinosaur_count = dinosaur_type_counts.max()
print(f"\n The most frequent dinosaur type is {most_dinosaur_type} with {most_dinosaur_count} occurrences.")

# 4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.

# Create a scatterplot 
plt.figure(figsize = (10,6))
sns.scatterplot(data = dinosaurs, x = 'max_ma', y = 'length_m', hue = 'type', palette = 'viridis', alpha = 0.7)
plt.title('Dinosaur length over time')
plt.xlabel('Age (millions of years ago)')
plt.ylabel('Length (meters)')
plt.show()

# Import scipy packages
import numpy as np
from scipy.stats import linregress

# Perform linear regression 
valid_data = dinosaurs.dropna(subset=['max_ma', 'length_m'])
slope, intercept, r_value, p_value, std_error = linregress(valid_data['max_ma'], valid_data['length_m'])
print(f"\n Regression slope: {slope:.2f}")
print(f"\n P-value: {p_value:.2f}")
print(f"\n There is a significant relationship between the age and the length of dinosaurs due to age decreases (which means moving closer to the current time) as length increases. In conclusion, yes, the dinosaurs did indeed get bigger over time.")

# 5. Use the AI assitant to create an interactive map showing each record.

# Import folium package
import folium
from folium.plugins import MarkerCluster

# Calculate average coordinates for centering the map
average_lat = dinosaurs['lat'].mean()
average_lon = dinosaurs['lng'].mean()

# Create a base map
mymap = folium.Map(location = [average_lat, average_lon], zoom_start = 2)

# Initialize a MarkerCluster
marker_cluster = MarkerCluster().add_to(mymap)

# Add markers to the cluster
for idx, row in dinosaurs.iterrows():
    if pd.notnull(row['lat']) and pd.notnull(row['lng']):
        folium.Marker(location = [row['lat'], row['lng']],
                      popup=f"Name: {row['name']}<br>Type: {row['type']}<br>Length: {row['length_m']} m").add_to(marker_cluster)

# Save the map to an HTML file and show it
mymap.save('dinosaur_map.html')
mymap

# 6. Distribution of dinosaur lengths
plt.figure(figsize=(10, 6))
sns.histplot(dinosaurs['length_m'].dropna(), bins=30, kde=True, color='blue')
plt.title('Distribution of Dinosaur Lengths')
plt.xlabel('Length (meters)')
plt.ylabel('Frequency')
plt.show()
print(f"\n The histograms show the frequency of dinosaurs for different length ranges and the KDE line provides a smooth estimate of the distribution, helping to identify trends more clearly.")

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.

💾 The data

You have access to the following table, cinema.films:

Column name	Description
id	Unique movie identifier.
title	The title of the movie.
release_year	The year the movie was released to the public.
country	The country in which the movie was released.
duration	The runtime of the movie, in minutes.
language	The original language the movie was produced in.
certification	The rating the movie was given based on their suitability for audiences.
gross	The revenue the movie generated at the box office, in USD.
budget	The available budget the production had for producing the movie, in USD.

You can click the "Browse tables" button in the upper right-hand corner of the SQL cell below to view the available tables. They will show on the left of the notebook.

The data was sourced from IMDb.

‌
‌
‌

Best analysis ever! Andres Rafael Tito

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Everyone Can Learn Data Scholarship

📖 Background

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

💾 The data

💪 Challenge I

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

💾 The data

You have access to the following table, cinema.films:

Everyone Can Learn Data Scholarship