Skip to content
0

Everyone Can Learn Data Scholarship

๐Ÿ“– Background

The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

๐Ÿ’ก Learn more

The following DataCamp courses can help review the skills to get started for this challenge:

  • Intermediate Python
  • Introduction to the Tidyverse in R
  • Introduction to SQL

โ„น๏ธ Introduction to Data Science Notebooks

You can skip this section if you are already familiar with data science notebooks.

Data science notebooks

A data science notebook is a document containing text cells (what you're reading now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells and then run a cell by selecting it and then clicking the Run button to the right ( โ–ถ, or Run All on top) or hitting control + enter.

The result will be displayed directly in the notebook.

Try running the Python cell below:

# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 20

Modify any of the numbers and rerun the cell.

You can add a Markdown, Python|R, or SQL cell by clicking on the Add Markdown, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.

๐Ÿค– You can also make use of our AI assistent, by asking it what you want to do. See it in action here.

Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.

1๏ธโƒฃ Part 1 (Python) - Dinosaur data ๐Ÿฆ•

๐Ÿ“– Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

๐Ÿ’พ The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column nameDescription
occurence_noThe original occurrence number from the Paleobiology Database.
nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
dietThe main diet (omnivorous, carnivorous, herbivorous).
typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_mThe maximum length, from head to tail, in meters.
max_maThe age in which the first fossil records of the dinosaur where found, in million years.
min_maThe age in which the last fossil records of the dinosaur where found, in million years.
regionThe current region where the fossil record was found.
lngThe longitude where the fossil record was found.
latThe latitude where the fossil record was found.
classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
familyThe taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Preview the dataframe
dinosaurs
# How many different dinosaur names are present in the data?
count = dinosaurs['name'].unique().size
print("The total number of different dinosaur names in the data = " + str(count) )
# Which was the largest dinosaur? What about missing data in the dataset?
largest_dinosaur = dinosaurs.nlargest(5, ['length_m'])
print(largest_dinosaur)
print("The largest dinosaur is Supersaurus")

dinosaurs[dinosaurs.isnull().any(axis=1)]
dinosaurs_dropped = dinosaurs.dropna()
print(dinosaurs_dropped)
# What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objects as go

dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Group by dinosaur type and count occurrences
dino_counts = dinosaurs['type'].value_counts().reset_index()
dino_counts.columns = ['type', 'occurrence_no']

fig = plt.figure(figsize=(5, 6))
plt.style.use('fivethirtyeight')
plt.bar(
    x=dino_counts["type"],
    height=dino_counts["occurrence_no"],
    color='purple',         # To set the color of the bars
    edgecolor='black',    # To set the color of the bar edges
    linewidth=1.5,        # To set the width of the bar edges
    alpha=0.8             # To set the transparency of the bars
)

plt.xlabel('Dinosaur Type', fontsize=14, fontweight='bold')  # To set the x-axis label with font size and style
plt.ylabel('Count', fontsize=14, fontweight='bold')           # To set the y-axis label with font size and style
plt.title('Number of Dinosaur Per Type', fontsize=16, fontweight='bold')  # To set the chart title with font size and style
plt.xticks(fontsize=12, rotation = 45)    # To set the font size of the x-axis tick labels
plt.yticks(fontsize=12)    # To set the font size of the y-axis tick labels
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.5, color = "black")  # To display gridlines with a dashed style and reduced opacity

plt.show()
# Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.

import matplotlib.pyplot as plt
import seaborn as sns

# Create a new column for the midpoint of the age range
dinosaurs['mid_ma'] = (dinosaurs['max_ma'] + dinosaurs['min_ma']) / 2

# Plot the relationship between dinosaur length and their age
plt.figure(figsize=(12, 6))
sns.scatterplot(data=dinosaurs, x='mid_ma', y='length_m', hue='type', alpha=0.7)
plt.title('Dinosaur Length Over Time')
plt.xlabel('Age (Ma)')
plt.ylabel('Length (m)')
plt.legend(title='Type')
plt.gca().invert_xaxis()  # Invert x-axis to show time progressing from left to right
plt.show()
# Use the AI assitant to create an interactive map showing each record.
import folium
from folium.plugins import MarkerCluster

# Create a base map
m = folium.Map(location=[20, 0], zoom_start=2)

# Create a marker cluster
marker_cluster = MarkerCluster().add_to(m)

# Add markers to the map
for idx, row in dinosaurs.iterrows():
    folium.Marker(
        location=[row['lat'], row['lng']],
        popup=(
            f"<strong>Name:</strong> {row['name']}<br>"
            f"<strong>Type:</strong> {row['type']}<br>"
            f"<strong>Diet:</strong> {row['diet']}<br>"
            f"<strong>Length (m):</strong> {row['length_m']}<br>"
            f"<strong>Age (Ma):</strong> {row['mid_ma']}<br>"
            f"<strong>Region:</strong> {row['region']}<br>"
            f"<strong>Family:</strong> {row['family']}<br>"
        )
    ).add_to(marker_cluster)

# Display the map
m
โ€Œ
โ€Œ
โ€Œ