Everyone Can Learn Data Scholarship
π Background
The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.
The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!
The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.
The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!
π‘ Learn more
The following DataCamp courses can help review the skills to get started for this challenge:
βΉοΈ Introduction to Data Science Notebooks
You can skip this section if you are already familiar with data science notebooks.
Data science notebooks
A data science notebook is a document containing text cells (what you're reading now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells and then run a cell by selecting it and then clicking the Run button to the right ( βΆ, or Run All on top) or hitting control + enter.
The result will be displayed directly in the notebook.
Try running the Python cell below:
# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 20Modify any of the numbers and rerun the cell.
You can add a Markdown, Python|R, or SQL cell by clicking on the Add Markdown, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.
π€ You can also make use of our AI assistent, by asking it what you want to do. See it in action here.
Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.
1οΈβ£ Part 1 (Python) - Dinosaur data π¦
π Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
πΎ The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# 1. How many different dinosaur names are present in the data?
unique_dinosaur_names = dinosaurs['name'].nunique()
print(f"Number of different dinosaur names: {unique_dinosaur_names}")
# 2. Which was the largest dinosaur? What about missing data in the dataset?
largest_dinosaur = dinosaurs.loc[dinosaurs['length_m'].idxmax()]
print(f"Largest dinosaur: {largest_dinosaur['name']} with length {largest_dinosaur['length_m']} meters")
# Visualize missing data
msno.matrix(dinosaurs)
plt.title('Missing Data in Dinosaur Dataset')
plt.show()
# 3. What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type.
dinosaur_type_counts = dinosaurs['type'].value_counts()
print(dinosaur_type_counts)
# Bar chart for dinosaur types
plt.figure(figsize=(10, 6))
sns.barplot(x=dinosaur_type_counts.index, y=dinosaur_type_counts.values, palette='viridis')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
# 4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
plt.figure(figsize=(10, 6))
sns.scatterplot(data=dinosaurs, x='max_ma', y='length_m', hue='type', palette='viridis')
plt.title('Dinosaur Length Over Time')
plt.xlabel('Age (million years ago)')
plt.ylabel('Length (meters)')
plt.show()
# 5. Use the AI assistant to create an interactive map showing each record.
fig = px.scatter_geo(dinosaurs, lat='lat', lon='lng', hover_name='name', color='type',
title='Interactive Map of Dinosaur Fossil Records')
fig.show()
# 6. Any other insights you found during your analysis?
# Example: Distribution of dinosaur lengths
plt.figure(figsize=(10, 6))
sns.histplot(dinosaurs['length_m'].dropna(), kde=True, color='blue')
plt.title('Distribution of Dinosaur Lengths')
plt.xlabel('Length (meters)')
plt.ylabel('Frequency')
plt.show()# Preview the dataframe
dinosaursπͺ Challenge I
Help your colleagues at the museum to gain insights on the fossil record data. Include:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import missingno as msno import plotly.express as px # Load the data dinosaurs = pd.read_csv('data/dinosaurs.csv') # 1. How many different dinosaur names are present in the data? unique_dinosaur_names = dinosaurs['name'].nunique() print(f"Number of different dinosaur names: {unique_dinosaur_names}") # 2. Which was the largest dinosaur? What about missing data in the dataset? largest_dinosaur = dinosaurs.loc[dinosaurs['length_m'].idxmax()] print(f"Largest dinosaur: {largest_dinosaur['name']} with length {largest_dinosaur['length_m']} meters") # Visualize missing data msno.matrix(dinosaurs) plt.title('Missing Data in Dinosaur Dataset') plt.show() # 3. What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. dinosaur_type_counts = dinosaurs['type'].value_counts() print(dinosaur_type_counts) # Bar chart for dinosaur types plt.figure(figsize=(10, 6)) sns.barplot(x=dinosaur_type_counts.index, y=dinosaur_type_counts.values, palette='viridis') plt.title('Number of Dinosaurs per Type') plt.xlabel('Dinosaur Type') plt.ylabel('Count') plt.xticks(rotation=45) plt.show() # 4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this. plt.figure(figsize=(10, 6)) sns.scatterplot(data=dinosaurs, x='max_ma', y='length_m', hue='type', palette='viridis') plt.title('Dinosaur Length Over Time') plt.xlabel('Age (million years ago)') plt.ylabel('Length (meters)') plt.show() # 5. Use the AI assistant to create an interactive map showing each record. fig = px.scatter_geo(dinosaurs, lat='lat', lon='lng', hover_name='name', color='type', title='Interactive Map of Dinosaur Fossil Records') fig.show() # 6. Any other insights you found during your analysis? # Example: Distribution of dinosaur lengths plt.figure(figsize=(10, 6)) sns.histplot(dinosaurs['length_m'].dropna(), kde=True, color='blue') plt.title('Distribution of Dinosaur Lengths') plt.xlabel('Length (meters)') plt.ylabel('Frequency') plt.show()
2οΈβ£ Part 2 (SQL) - Understanding movie data π₯
π Background
You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.
πΎ The data
You have access to the following table, cinema.films:
| Column name | Description |
|---|---|
| id | Unique movie identifier. |
| title | The title of the movie. |
| release_year | The year the movie was released to the public. |
| country | The country in which the movie was released. |
| duration | The runtime of the movie, in minutes. |
| language | The original language the movie was produced in. |
| certification | The rating the movie was given based on their suitability for audiences. |
| gross | The revenue the movie generated at the box office, in USD. |
| budget | The available budget the production had for producing the movie, in USD. |
You can click the "Browse tables" button in the upper right-hand corner of the SQL cell below to view the available tables. They will show on the left of the notebook.
The data was sourced from IMDb.
SELECT *
FROM cinema.films
LIMIT 10πͺ Challenge II
Help your team leader understand the data that's available in the cinema.films dataset. Include:
- How many movies are present in the database?
num_movies = df.shape[0] print(f"Number of movies: {num_movies}")
β
β