Competition - Everyone Can Learn Data Scholarship

Everyone Can Learn Data Scholarship

📖 Background

The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

💡 Learn more

The following DataCamp courses can help review the skills to get started for this challenge:

ℹ️ Introduction to Data Science Notebooks

You can skip this section if you are already familiar with data science notebooks.

Data science notebooks

A data science notebook is a document containing text cells (what you're reading now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells and then run a cell by selecting it and then clicking the Run button to the right ( ▶, or Run All on top) or hitting control + enter.

The result will be displayed directly in the notebook.

Try running the Python cell below:

# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 20

Modify any of the numbers and rerun the cell.

You can add a Markdown, Python|R, or SQL cell by clicking on the Add Markdown, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.

🤖 You can also make use of our AI assistent, by asking it what you want to do. See it in action here.

Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.

2 hidden cells

# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium


# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
dinosaur_names = dinosaurs['name']

# How many different dinosaur names are present in the data?
unique_dinosaur_names = dinosaur_names.nunique()
print(f'The number of unique dinosaur names is: {unique_dinosaur_names}')

#Which was the largest dinosaur? What about missing data in the dataset?
print(dinosaurs.head())
dinosaurs.dropna(subset=['name'], inplace=True)

# Drop rows where 'length' is missing
dinosaurs.dropna(subset=['length_m'], inplace=True)

# Ensure the 'length' columns are numeric
dinosaurs['length_m'] = pd.to_numeric(dinosaurs['length_m'], errors='coerce')

# Determine the largest dinosaur based on length
largest_dinosaur_by_length = dinosaurs.loc[dinosaurs['length_m'].idxmax()]
print(f"The largest dinosaur by length is {largest_dinosaur_by_length['name']} with a length of {largest_dinosaur_by_length['length_m']} meters.")

# Count occurrences by dinosaur type
dinosaur_type_counts = dinosaurs['type'].value_counts()

# Create a bar chart
plt.figure(figsize=(12, 8))
bars = plt.bar(dinosaur_type_counts.index, dinosaur_type_counts.values, color='seagreen')

# Add labels and title
plt.xlabel('Dinosaur Type', fontsize=14)
plt.ylabel('Number of Occurrences', fontsize=14)
plt.title('Number of Dinosaurs per type', fontsize=16)

# Customize the x-axis and y-axis labels
plt.xticks(rotation=45, ha='right', fontsize=12)
plt.yticks(fontsize=12)

# Add a grid for better readability
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add the counts above the bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2.0, height, f'{int(height)}', ha='center', va='bottom', fontsize=12)

# Show the plot
plt.tight_layout()
plt.show()


#What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
#Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.

class_to_age = {
    'Saurischia': 150,  # Midpoint for late Jurassic to early Cretaceous
    'Ornithischia': 100, # Midpoint for late Jurassic to late Cretaceous
    'Theropoda': 130,   # Spanning Triassic to Cretaceous
    'Sauropoda': 160    # Mostly in the Jurassic period
}

# Map the dinosaur classes to ages
dinosaurs['age'] = dinosaurs['class'].map(class_to_age)

# Handle missing data
# Drop rows where 'name', 'length', or 'age' is missing
dinosaurs.dropna(subset=['name', 'length_m', 'age'], inplace=True)

# Ensure the 'length' and 'age' columns are numeric
dinosaurs['length_m'] = pd.to_numeric(dinosaurs['length_m'], errors='coerce')
dinosaurs['age'] = pd.to_numeric(dinosaurs['age'], errors='coerce')

# Create a scatter plot to show the relationship between length and age
plt.figure(figsize=(12, 8))
sns.scatterplot(data=dinosaurs, x='age', y='length_m', hue='name', palette='viridis', s=100)

# Add labels and title
plt.xlabel('Age (in millions of years)', fontsize=14)
plt.ylabel('Length (in meters)', fontsize=14)
plt.title('Dinosaur Length vs. Age', fontsize=16)

# Show a legend
plt.legend(title='Dinosaur Name', bbox_to_anchor=(1.05, 1), loc='upper left')

# Show the plot
plt.tight_layout()
plt.show()

#Use the AI assitant to create an interactive map showing each record.
import folium

# Create a map object
m = folium.Map(location=[20, 0], zoom_start=2)

# Add points to the map
for idx, row in dinosaurs.iterrows():
    folium.Marker([row['lat'], row['lng']], popup=f"{row['name']} ({row['age']} million years ago)").add_to(m)



#Other Insights 
# Correlation between Length and Age
correlation = dinosaurs['length_m'].corr(dinosaurs['age'])
print(f'Correlation between Length and Age: {correlation}')

# Box Plot of Dinosaur Lengths
plt.figure(figsize=(10, 6))
sns.boxplot(y=dinosaurs['length_m'], color='green')
plt.title('Box Plot of Dinosaur Lengths')
plt.ylabel('Length (in meters)')
plt.show()

# Count of Dinosaurs by Category
# Example categorical column 'class'
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Example dataset
data = {
    'name': ['Tyrannosaurus', 'Diplodocus', 'Triceratops'],
    'class': ['Theropoda', 'Sauropoda', 'Ornithischia']
}

# Creating the DataFrame
dinosaurs = pd.DataFrame(data)

# Plotting the count of dinosaurs by class
plt.figure(figsize=(10, 6))
sns.countplot(x='class', data=dinosaurs, palette='viridis')
plt.title('Count of Dinosaurs by Class')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()


# Length vs. Age with Regression Line
data = {
    'name': ['Tyrannosaurus', 'Diplodocus', 'Triceratops'],
    'age': [68, 154, 66],  # Age in millions of years
    'length_m': [12.3, 27.0, 9.0]  # Length in meters
}

# Creating the DataFrame
dinosaurs = pd.DataFrame(data)

# Plotting Length vs. Age with Regression Line
plt.figure(figsize=(10, 6))
sns.regplot(x='age', y='length_m', data=dinosaurs, scatter_kws={'s':100}, line_kws={'color':'red'})
plt.title('Dinosaur Length vs. Age with Regression Line')
plt.xlabel('Age (in millions of years)')
plt.ylabel('Length (in meters)')
plt.show()

dinosaurs

💪 Challenge I

Help your colleagues at the museum to gain insights on the fossil record data. Include:

How many different dinosaur names are present in the data?
Which was the largest dinosaur? What about missing data in the dataset?
What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
Use the AI assitant to create an interactive map showing each record.
Any other insights you found during your analysis?

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.

💾 The data

You have access to the following table, cinema.films:

Column name	Description
id	Unique movie identifier.
title	The title of the movie.
release_year	The year the movie was released to the public.
country	The country in which the movie was released.
duration	The runtime of the movie, in minutes.
language	The original language the movie was produced in.
certification	The rating the movie was given based on their suitability for audiences.
gross	The revenue the movie generated at the box office, in USD.
budget	The available budget the production had for producing the movie, in USD.

You can click the "Browse tables" button in the upper right-hand corner of the SQL cell below to view the available tables. They will show on the left of the notebook.

The data was sourced from IMDb.

DataFrameas

go_analysis

variable

-- Query to count the total number of movies
SELECT COUNT(*) AS movie_count
FROM cinema.films;
-- Query to count total rows and rows with missing data in gross and budget columns
SELECT 
    COUNT(*) AS total_rows,
    COUNT(*) FILTER (WHERE gross IS NULL) AS missing_gross,
    COUNT(*) FILTER (WHERE budget IS NULL) AS missing_budget,
    COUNT(*) FILTER (WHERE gross IS NULL OR budget IS NULL) AS rows_with_missing_data
FROM cinema.films;
-- Query for the total number of different certifications or rating present in the Database
SELECT COUNT(DISTINCT certification) AS Number_of_certifications
FROM cinema.films;
-- Query for top five countries in terms of number of movies produced?
SELECT country, COUNT(*) AS movie_count
FROM cinema.films
GROUP BY country
ORDER BY movie_count DESC
LIMIT 5;
-- Query for the average duration of English versus French movies
SELECT language, ROUND(AVG(duration),2) AS average_duration
FROM cinema.films
WHERE language IN ('English', 'French')
GROUP BY language;



--Other Insights include
--Average budget by Country
SELECT country, AVG(budget) AS average_budget
FROM cinema.films
GROUP BY country
ORDER BY average_budget DESC;
--Movie Count by Certification:
SELECT certification, COUNT(*) AS movie_count
FROM cinema.films
GROUP BY certification
ORDER BY movie_count DESC;
--Year with Most Movies Produced
SELECT release_year, COUNT(*) AS movie_count
FROM cinema.films
GROUP BY release_year
ORDER BY movie_count DESC
LIMIT 1;
-- Distribution of Movies by Duration
SELECT 
    CASE 
        WHEN duration < 40 THEN 'Short'
        WHEN duration BETWEEN 40 AND 120 THEN 'Feature'
        ELSE 'Long'
    END AS duration_category, 
    COUNT(*) AS movie_count
FROM cinema.films
GROUP BY duration_category
ORDER BY movie_count DESC;
--Most Common Languages 
SELECT language, COUNT(*) AS movie_count
FROM cinema.films
GROUP BY language
ORDER BY movie_count DESC
LIMIT 10;
-- Average Duration by Certifications 
SELECT certification, ROUND(AVG(duration),2) AS average_duration
FROM cinema.films
GROUP BY certification
ORDER BY average_duration DESC;

💪 Challenge II

Help your team leader understand the data that's available in the cinema.films dataset. Include:

How many movies are present in the database?
There seems to be a lot of missing data in the gross and budget columns. How many rows have missing data? What would you recommend your manager to do with these rows?
How many different certifications or ratings are present in the database?
What are the top five countries in terms of number of movies produced?
What is the average duration of English versus French movies? (Don't forget you can use the AI assistant!)
Any other insights you found during your analysis?

🧑‍⚖️ Judging criteria

Public Upvotes - The top 100 most upvoted entries will be judged according to the criteria below. Entries in position 101 or below will not proceed to the next stage. Only votes made by accounts registered before voting opens will count towards final decisions.

CATEGORY	WEIGHTING	DETAILS
Response quality	65%	Accuracy (20%) - The response must be representative of the original data and free from errors. Clarity (20%) - The response must be easy to understand and clearly expressed. Completeness (15%) - The response must be a full report that responds to the question posed. Insights (10%) - The response must contain some insights based on the data using your own judgment and interpretation.
Storytelling	20%	How well the response is connected to the original data. How the narrative and whole response connects together. The report contains sufficient depth but is also concise. How the response flows from one point to the next.
Presentation	15%	How legible/understandable the response is. How well-formatted the response is. Spelling and grammar.

In the event of a tie, user XP may be used as a tie-breaker.

‌
‌
‌

Competition - Everyone Can Learn Data Scholarship

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Everyone Can Learn Data Scholarship

📖 Background

💡 Learn more

ℹ️ Introduction to Data Science Notebooks

Data science notebooks

💪 Challenge I

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

💾 The data

You have access to the following table, cinema.films:

💪 Challenge II

🧑‍⚖️ Judging criteria

Everyone Can Learn Data Scholarship