Skip to content
0

📖 Background

The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

credits: https://unsplash.com/photos/white-and-red-koi-fish-a_WdM0_T_Fs

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column nameDescription
occurence_noThe original occurrence number from the Paleobiology Database.
nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
dietThe main diet (omnivorous, carnivorous, herbivorous).
typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_mThe maximum length, from head to tail, in meters.
max_maThe age in which the first fossil records of the dinosaur where found, in million years.
min_maThe age in which the last fossil records of the dinosaur where found, in million years.
regionThe current region where the fossil record was found.
lngThe longitude where the fossil record was found.
latThe latitude where the fossil record was found.
classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
familyThe taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

SOLUTION TO CHALLENGE 1

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# set empty values to zero(0) and assign dinosaurs variable back to it
dinosaurs = dinosaurs.fillna(0)  
# QUESTION 1
# Get number of unique dinosaurs names
num_unique_names = dinosaurs['name'].nunique()
print("Number of different dinosaur names present:", num_unique_names, '\n')
# QUESTION 2

# Get the largest dinosaurs
largest_dinosaur = dinosaurs.loc[dinosaurs['length_m'].idxmax()]
print("The largest dinosaur is:", largest_dinosaur['name'], '\n')
print("Maximum length (m):", largest_dinosaur['length_m'], '\n')

# Checking for missing data
missing_data_count = dinosaurs.isnull().sum()
print("Missing data in the dataset:")
print(missing_data_count)
# QUESTION 3
import matplotlib.pyplot as plt
import numpy as np

# Count occurrences of each dinosaur type
dinosaur_type_counts = dinosaurs['type'].value_counts()

# Define a colormap
colors = plt.cm.viridis(np.linspace(0, 1, len(dinosaur_type_counts)))

# Plotting the bar chart
plt.figure(figsize=(12, 8))
bars = plt.bar(range(len(dinosaur_type_counts)), dinosaur_type_counts.values, color=colors)

# Add a title and labels to the axes
plt.title('Number of Dinosaurs per Type', fontsize=16)
plt.xlabel('Dinosaur Type', fontsize=14)
plt.ylabel('Number of Occurrences', fontsize=14)

# Set x-tick labels to the dinosaur types
plt.xticks(ticks=range(len(dinosaur_type_counts)), labels=dinosaur_type_counts.index, rotation=45, ha='right', fontsize=12)

# Add a grid for the y-axis
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add value labels on top of each bar
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval, int(yval), va='bottom', ha='center', fontsize=12)

# Tight layout to fit everything nicely
plt.tight_layout()

# Adding a custom legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=colors[i], label=dinosaur_type_counts.index[i]) for i in range(len(dinosaur_type_counts))]
plt.legend(handles=legend_elements, title='Dinosaur Types', bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=12)

# Show the bar chart
plt.show()
# QUESTION 4
import plotly.express as px

# Create an interactive scatter plot using Plotly
fig = px.scatter(
    dinosaurs,
    x='max_ma',
    y='length_m',
    color='length_m',
    size='length_m',
    hover_data=['name', 'type'],  # Assuming 'name' and 'type' are columns in your dataframe
    labels={'max_ma': 'Age (million years)', 'length_m': 'Length (meters)'},
    title='Dinosaur Length vs. Age'
)

# layout for better aesthetics
fig.update_layout(
    title={'x': 0.5, 'xanchor': 'center'},
    xaxis=dict(showgrid=True, gridcolor='lightgray', gridwidth=0.5),
    yaxis=dict(showgrid=True, gridcolor='lightgray', gridwidth=0.5),
    plot_bgcolor='white'
)

# Show the interactive scatter plot
fig.show()
# QUESTION 5
import folium
from folium.plugins import FastMarkerCluster

# Create a map centered at the mean latitude and longitude
map_dinosaurs = folium.Map(location=[dinosaurs['lat'].mean(), dinosaurs['lng'].mean()], zoom_start=3, tiles='Stamen Terrain')

# Create a FastMarkerCluster for better performance
marker_cluster = FastMarkerCluster(data=list(zip(dinosaurs['lat'], dinosaurs['lng']))).add_to(map_dinosaurs)

# Add custom markers for each dinosaur record
for index, row in dinosaurs.iterrows():
    popup_text = f"""
    <b>Name:</b> {row['name']}<br>
    <b>Type:</b> {row['type']}<br>
    <b>Length:</b> {row['length_m']} meters<br>
    <b>Age:</b> {row['max_ma']} million years ago
    """
    folium.Marker(
        location=[row['lat'], row['lng']],
        popup=folium.Popup(popup_text, max_width=300),
        icon=folium.Icon(color='green', icon='info-sign')
    ).add_to(marker_cluster)

# Add a legend for marker colors
legend_html = """
<div style="position: fixed; 
     bottom: 50px; left: 50px; width: 120px; height: 110px; 
     background-color: white; border-radius: 5px; z-index:9999;
     padding: 10px; font-size:14px;">
     <p><b>Legend</b></p>
     <p><i class="fa fa-map-marker fa-2x" style="color:green"></i> Dinosaur</p>
</div>
"""
map_dinosaurs.get_root().html.add_child(folium.Element(legend_html))

# Display the map directly in the notebook
map_dinosaurs

QUESTION 6 SOLUTION

ANALYSIS

  1. It appears that there's a wide variety of dinosaur types represented in the dataset, with sauropods and theropods being particularly common.

  2. The scatter plot showing the relationship between dinosaur length and age suggests that there's no clear trend of dinosaurs getting consistently larger over time. There are instances of both small and large dinosaurs across different ages.

  3. The distribution of dinosaur fossils seems to be quite widespread geographically, indicating the global distribution of these creatures during the Mesozoic era.

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.

credits: https://unsplash.com/photos/person-watching-movie-AtPWnYNDJnM

💾 The data

You have access to the following table, cinema.films:

Column nameDescription
idUnique movie identifier.
titleThe title of the movie.
release_yearThe year the movie was released to the public.
countryThe country in which the movie was released.
durationThe runtime of the movie, in minutes.
languageThe original language the movie was produced in.
certificationThe rating the movie was given based on their suitability for audiences.
grossThe revenue the movie generated at the box office, in USD.
budgetThe available budget the production had for producing the movie, in USD.

You can click the "Browse tables" button in the upper right-hand corner of the SQL cell below to view the available tables. They will show on the left of the notebook.

The data was sourced from IMDb.

SOLUTION TO CHALLENGE 2