Everyone Can Learn Data Scholarship
π Background
The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.
The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!
The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.
The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!
1οΈβ£ Part 1 (Python) -Dinosaur Discovery Insights π¦
π Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
πΎ The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')# Preview the dataframe
dinosaursπͺ Challenge I
Help your colleagues at the museum to gain insights on the fossil record data. Include:
- How many different dinosaur names are present in the data?
- Which was the largest dinosaur? What about missing data in the dataset?
- What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
- Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
- Use the AI assitant to create an interactive map showing each record.
- Any other insights you found during your analysis?
#Loading and Inspecting the Dataset
import pandas as pd
# Load the dataset
data = pd.read_csv('data/dinosaurs.csv')
# Count unique dinosaur names
unique_dinosaur_names = data['name'].nunique()
print(f"There are {unique_dinosaur_names} different dinosaur names in the dataset.")
# Let find the largest dinosaur
largest_dinosaur = data[data['length_m'] == data['length_m'].max()]
# Let check for missing data
missing_data = data.isnull().sum()
print(largest_dinosaur)
print(missing_data)
import matplotlib.pyplot as plt
# Count the number of occurrences per dinosaur type
type_counts = data['type'].value_counts()
# Plotting
plt.figure(figsize=(10, 6))
type_counts.plot(kind='bar', color='skyblue')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type')
plt.ylabel('Occurrences')
plt.xticks(rotation=45)
plt.show()plt.figure(figsize=(10, 6))
plt.scatter(data['max_ma'], data['length_m'], alpha=0.5)
plt.title('Dinosaur Length Over Time')
plt.xlabel('Age in Million Years Ago')
plt.ylabel('Length in Meters')
plt.gca().invert_xaxis()
plt.show()
import folium
# Create a base map
map = folium.Map(location=[20, 0], zoom_start=2)
# Add points
for idx, row in data.iterrows():
folium.Marker(location=[row['lat'], row['lng']], popup=f"{row['name']} ({row['type']})").add_to(map)
# Display the map
map
#step 6 analysis
# Analyzing the relationship between dinosaur diet and type
import seaborn as sns
plt.figure(figsize=(12, 8))
sns.countplot(x='type', hue='diet', data=data)
plt.title('Dinosaur Types by Diet')
plt.xlabel('Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Diet')
plt.show()
# Analyzing temporal trends in dinosaur discoveries
# Assuming 'max_ma' represents the latest time in million years ago when the dinosaur existed
data['DiscoveryPeriod'] = pd.cut(data['max_ma'], bins=[0, 50, 100, 150, 200, 250], labels=['0-50', '51-100', '101-150', '151-200', '201-250'])
plt.figure(figsize=(12, 6))
sns.countplot(x='DiscoveryPeriod', data=data)
plt.title('Dinosaur Discoveries Over Geological Periods')
plt.xlabel('Geological Period (Million Years Ago)')
plt.ylabel('Number of Discoveries')
plt.show()
So i explored more insights, specifically focusing on the relationship between dinosaur types and their diets, as well as any temporal trends in the diversity of dinosaur discoveries for this analysis
2οΈβ£ Part 2 (SQL) - Classic Cinema Analysis π₯
π Background
You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.
β
β