Everyone Can Learn Data Scholarship
๐ Background
The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.
The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!
The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.
The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!
๐ก Learn more
The following DataCamp courses can help review the skills to get started for this challenge:
โน๏ธ Introduction to Data Science Notebooks
You can skip this section if you are already familiar with data science notebooks.
Data science notebooks
A data science notebook is a document containing text cells (what you're reading now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells and then run a cell by selecting it and then clicking the Run button to the right ( โถ, or Run All on top) or hitting control + enter.
The result will be displayed directly in the notebook.
Try running the Python cell below:
# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 20Modify any of the numbers and rerun the cell.
You can add a Markdown, Python|R, or SQL cell by clicking on the Add Markdown, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.
๐ค You can also make use of our AI assistent, by asking it what you want to do. See it in action here.
Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.
Welcome All๐๐
My name is Karmel Hassan Jaradat. I am 19 years old and I live in the State of Palestine, specifically the city of Hebron. I'm a second-year computer science student, and I've delved into the world of front-end development. I have begun a journey of continuous learning to enhance my programming skills.
1๏ธโฃ Part 1 (Python) - Dinosaur data ๐ฆ
๐ Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
import pandas as pd
import numpy as np
try:
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print("Dataset loaded successfully ")
except FileNotFoundError :
print("The dataset file was not found. Please check the dataset file path and name ")
if 'dinosaurs' in locals():
print(dinosaurs.head())
unique_dinosaur_names=dinosaurs['name'].nunique()
print(f"Different dinosaur names: {unique_dinosaur_names}")
largest_dinosaur = dinosaurs.loc[dinosaurs['length_m'].idxmax()]
print(f"largest dinosaur : {largest_dinosaur['name']} with length {largest_dinosaur['length_m']} meters")
missing_data = dinosaurs.isnull().sum()
print("Missing data in each column:")
print(missing_data)
type_counts= dinosaurs['type'].value_counts()
print("Dinosaur type with most occurrences :")
print(type_counts)
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
type_counts.plot(kind='bar' , color='skyblue')
plt.title('Number of Dinosaurs per Type')
plt.xlabel('Dinosaur Type ')
plt.ylabel('Count')
plt.xticks(rotation = 45)
plt.show()
plt.figure(figsize=(10,6))
plt.scatter(dinosaurs['max_ma'], dinosaurs['length_m'] , alpha=0.5)
plt.title('Dinosaur Length vs. Age ')
plt.xlabel('Age (million years ago)')
plt.ylabel('Length (meters)')
plt.show()
import folium
map_center=[dinosaurs['lat'].mean(), dinosaurs['lng'].mean()]
dino_map = folium.Map(location=map_center , zoom_start=2)
for idx, row in dinosaurs.iterrows():
folium.Marker([row['lat'],row['lng']], popup=f"{row['name']} ({row['length_m']}m)").add_to(dino_map)
dino_map.save('dinosaur_map.html')
print("Interactive map saved as 'dinosaur_map.html'")
else :
print("Dataframe 'dinosaur' is not defined . Please check the dataset file path and name")
The largest dinosaur identified from the dataset is the Supersaurus, a herbivorous sauropod. Detailed information about this dinosaur includes:
- Length: 35.0 meters
- Era: Lived approximately between 155.7 and 145.0 million years ago (Late Jurassic period)
- Region: Fossils found in Colorado
- Class: Saurischia
- Family: Diplodocidae
This significant finding highlights the remarkable size of the Supersaurus, which is among the largest dinosaurs ever discovered.
Missing Data Summary
A review of the dataset reveals some gaps that could affect the analysis:
- Diet: 1,355 records are missing dietary information.
- Type: 1,355 records are missing type classification.
- Length: 1,383 records are missing length information.
- Region: 42 records are missing region information.
- Family: 1,457 records are missing family classification.
Based on these requirements, I will clean the data according to the specific questions and my analysis to avoid cleaning the entire CSV file, thereby preventing the loss of a significant number of rows.
The Result:
After the cleaning process, I removed entries with null values in the "type" column. This decision was based on my analysis, as I aimed to avoid cleaning the entire CSV file to prevent losing a significant number of rows.
The ornithopods are the most prevalent, showing up nearly 800 times in the fossils.
Following closely are the large theropods and small theropods, each with over 700 occurrences.
The sauropods also make a notable appearance, with over 600 sightings.
On the other hand, the ceratopsians and armored dinosaurs are less common, with ceratopsians just above 400 and armored dinosaurs slightly under 400 appearances.
This data indicates that the ornithopods and theropods were the most widespread and diverse groups during their era, capturing the spotlight among the dinosaur fossils.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
try:
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print("Dataset loaded successfully")
except FileNotFoundError:
print("The dataset file was not found. Please check the dataset file path and name")
if 'dinosaurs' in locals():
# Calculate the midpoint of the age interval
dinosaurs['mid_ma'] = (dinosaurs['max_ma'] + dinosaurs['min_ma']) / 2
# Set up the plot
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(18, 8))
# Iterate over each dinosaur type
dinosaur_types = dinosaurs['type'].unique()
for dinosaur_type in dinosaur_types:
subset = dinosaurs[dinosaurs['type'] == dinosaur_type]
# Plot the scatter points and the regression line
sns.scatterplot(x='mid_ma', y='length_m', data=subset, label=dinosaur_type, ax=ax1)
sns.regplot(x='mid_ma', y='length_m', data=subset, order=2, scatter=False, ax=ax1)
# Customize the scatter plot
ax1.set_title('Dinosaur Length Over Time by Type with Polynomial Regression')
ax1.set_xlabel('Midpoint Age (Millions of Years Ago)')
ax1.set_ylabel('Length (Meters)')
ax1.invert_xaxis() # Invert the x-axis so older ages (higher values) are on the left
ax1.legend()
# Plot the KDE for all fossils
sns.kdeplot(data=dinosaurs['mid_ma'], bw_adjust=1, color='blue', label='All Fossils', ax=ax2)
# Customize the KDE plot
ax2.set_title('Density Plot of All Fossils Over Time')
ax2.set_xlabel('Midpoint Age (Millions of Years Ago)')
ax2.set_ylabel('Density')
ax2.invert_xaxis() # Invert the x-axis so older ages (higher values) are on the left
ax2.legend()
# Set the same x-axis limits for both subplots
min_x = dinosaurs['mid_ma'].min()
max_x = dinosaurs['mid_ma'].max()
ax1.set_xlim(max_x, min_x)
ax2.set_xlim(max_x, min_x)
plt.tight_layout()
plt.show()
else:
print("DataFrame 'dinosaurs' is not defined. Please check the dataset file path and name.")
Code explanations:
Loading data: The code makes sure that the data is loaded correctly and prints a success or error message. Mid-span calculation: The mid-span (mid_ma) is calculated for each dinosaur record. Preparing graphs: A window is created containing two graphs, the first to illustrate the relationship between the length of the dinosaur and the time period for each species, and the second to illustrate the density of all fossils over time. Customize Graphs: Graphs are customized by specifying titles, axes, and the direction of the x-axis to display older ages on the left. Graphic Display: Charts are displayed well while maintaining the graphics layout.
โ
โ