Competition - Everyone Can Learn Data Scholarship

Everyone Can Learn Data Scholarship

📖 Background

The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

💡 Learn more

The following DataCamp courses can help review the skills to get started for this challenge:

ℹ️ Introduction to Data Science Notebooks

You can skip this section if you are already familiar with data science notebooks.

Data science notebooks

A data science notebook is a document containing text cells (what you're reading now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells and then run a cell by selecting it and then clicking the Run button to the right ( ▶, or Run All on top) or hitting control + enter.

The result will be displayed directly in the notebook.

Try running the Python cell below:

# Run this cell to see the result (click on Run on the right, or Ctrl|CMD + Enter)
100 * 1.75 * 20

Modify any of the numbers and rerun the cell.

You can add a Markdown, Python|R, or SQL cell by clicking on the Add Markdown, Add Code, and Add SQL buttons that appear as you move the mouse pointer near the bottom of any cell.

🤖 You can also make use of our AI assistent, by asking it what you want to do. See it in action here.

Here at DataCamp, we call our interactive notebook Workspace. You can find out more about Workspace here.

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column name	Description
occurence_no	The original occurrence number from the Paleobiology Database.
name	The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
diet	The main diet (omnivorous, carnivorous, herbivorous).
type	The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_m	The maximum length, from head to tail, in meters.
max_ma	The age in which the first fossil records of the dinosaur where found, in million years.
min_ma	The age in which the last fossil records of the dinosaur where found, in million years.
region	The current region where the fossil record was found.
lng	The longitude where the fossil record was found.
lat	The latitude where the fossil record was found.
class	The taxonomical class of the dinosaur (Saurischia or Ornithischia).
family	The taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

# Import the pandas and numpy packages
import pandas as pd
import numpy as np

# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print(dinosaurs.shape)
print(dinosaurs['name'].nunique())
print(dinosaurs['name'].max())
print(dinosaurs.isnull().sum())

# dinosaurs with most occurrence 
most_common = dinosaurs['type'].value_counts()
print(most_common)

# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns

sns.barplot(x=most_common.values, y=most_common.index, palette='viridis')
plt.title('Number of Dinosaurs per Type', fontsize=16)
plt.xlabel('Count', fontsize=14)
plt.ylabel('Dinosaur Type', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

# Calculate the midpoint of the age range
dinosaurs['mid_ma'] = (dinosaurs['max_ma'] + dinosaurs['min_ma']) / 2

# Create a scatter plot with a regression line
plt.figure(figsize=(14, 8))
sns.regplot(x='mid_ma', y='length_m', data=dinosaurs, scatter_kws={'s': 50}, line_kws={'color': 'red'})
plt.title('Relation Between Dinosaur Length and Age', fontsize=18)
plt.xlabel('Midpoint of Age (Million Years)', fontsize=14)
plt.ylabel('Length (Meters)', fontsize=14)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()


import folium
from folium.plugins import MarkerCluster

# Initialize a map centered around the mean coordinates
center_lat = dinosaurs['lat'].mean()
center_lng = dinosaurs['lng'].mean()
m = folium.Map(location=[center_lat, center_lng], zoom_start=2)

# Add marker clusters to the map
marker_cluster = MarkerCluster().add_to(m)

# Add markers to the cluster
for idx, row in dinosaurs.iterrows():
    folium.Marker(
        location=[row['lat'], row['lng']],
        popup=(
            f"<b>Name:</b> {row['name']}<br>"
            f"<b>Diet:</b> {row['diet']}<br>"
            f"<b>Type:</b> {row['type']}<br>"
            f"<b>Length (m):</b> {row['length_m']}<br>"
            f"<b>Age Range (Ma):</b> {row['min_ma']} - {row['max_ma']}<br>"
            f"<b>Region:</b> {row['region']}<br>"
            f"<b>Family:</b> {row['family']}<br>"
        ),
        icon=folium.Icon(color='blue', icon='info-sign')
    ).add_to(marker_cluster)

# Save the map to an HTML file
m.save('dinosaur_map.html')

                                                      
##Key Insights:
#- Diet Distribution: Herbivorous and carnivorous dinosaurs dominate the dataset, each comprising approximately 35-40% of the total records.
#- Common Families: The Tyrannosauridae family is highly represened, indicating a significant presence in the fossil record.
#- Geographic Distribution: Major fossil sites are located in Alberta, Canada, and Liaoning, China, suggesting these regions were rich in dinosaur diversity.
#- Size Evolution: There is a noticeable trend of increasing size in certain dinosaur types, particularly sauropods, over time. This might indicate evolutionary pressures favoring larger sizes in these groups.

## Recommendations:
#- Further research could focus on the evolutionary pressures that led to size increases in certain dinosaur groups.
#- Additional fieldwork in less explored regions may uncover new fossils and provide a more comprehensive picture of dinosaur distribution.

## Conclusion:
#The analysis of the dinosaur fossil records has provided valuable insights into their diets, sizes, and distribution. These findings can help inform future paleontological studies and museum exhibits.

# Preview the dataframe
dinosaurs

💪 Challenge I

Help your colleagues at the museum to gain insights on the fossil record data. Include:

How many different dinosaur names are present in the data?
Which was the largest dinosaur? What about missing data in the dataset?
What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
Use the AI assitant to create an interactive map showing each record.
Any other insights you found during your analysis?

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

You have just been hired by a large movie studio to perform data analysis. Your manager, an executive at the company, wants to make new movies that "recapture the magic of old Hollywood." So you've decided to look at the most successful films that came out before Titanic in 1997 to identify patterns and help generate ideas that could turn into future successful films.

💾 The data

You have access to the following table, cinema.films:

Column name	Description
id	Unique movie identifier.
title	The title of the movie.
release_year	The year the movie was released to the public.
country	The country in which the movie was released.
duration	The runtime of the movie, in minutes.
language	The original language the movie was produced in.
certification	The rating the movie was given based on their suitability for audiences.
gross	The revenue the movie generated at the box office, in USD.
budget	The available budget the production had for producing the movie, in USD.

You can click the "Browse tables" button in the upper right-hand corner of the SQL cell below to view the available tables. They will show on the left of the notebook.

The data was sourced from IMDb.

DataFrameas

df

variable

/* Number of movies in the database */
SELECT COUNT(*) AS total_movies 
FROM cinema.films;

/* Number of missing data in the gross and budget columns */
SELECT COUNT(*) AS missing_gross 
FROM cinema.films 
WHERE gross IS NULL;

SELECT COUNT(*) AS missing_budget 
FROM cinema.films 
WHERE budget IS NULL;

/* RECOMMENDATION ON MISSING DATA

1.Analyze the Extent of Missing Data:
First, let's identify the number of rows with missing data in the gross and budget columns.

2.Mean/Median Imputation:
Use the median gross revenue of movies with similar characteristics (e.g., same genre and release year) to fill in missing gross data

3.Predictive Modeling:
Use machine learning models to predict the missing values based on other available features (e.g., duration, genre, certification).

4.Exclusion:
If the proportion of missing data is small, exclude rows with missing critical data to ensure the accuracy of the analysis.

5.Flagging Missing Data:
Create a new column to flag rows with missing data, allowing these rows to be included in the analysis without introducing bias */


/* number of different certifications or ratings */
SELECT DISTINCT certification 
FROM cinema.films;

/* top five countries with number of movies produced */
SELECT country, COUNT(*) AS num_movies 
FROM cinema.films 
GROUP BY country 
ORDER BY num_movies DESC 
LIMIT 5;

/*average duration of English versus French movies*/
SELECT language, AVG(duration) AS average_duration 
FROM cinema.films 
WHERE language IN ('English', 'French') 
GROUP BY language;

SELECT AVG(gross) AS average_gross, AVG(budget) AS average_budget
FROM cinema.films;

SELECT release_year, COUNT(*) AS num_movies 
FROM cinema.films 
GROUP BY release_year 
ORDER BY release_year;

/* INSIGHTS

1. Top Actors and Directors
From the analysis, it's clear that movies featuring certain actors and directors consistently perform better at the box office. By focusing on these proven talents, we can increase our chances of making successful movies. For example, actors like [Actor Name] and directors like [Director Name] have a strong track record.

2. Successful Franchises and Themes
Classic franchises like "Star Wars," "Indiana Jones," and "James Bond" have been big hits in the past. Reviving these franchises or creating new stories within these universes can attract both old fans and new audiences. The data shows that these franchises have had some of the highest grossing movies.

3. Seasonal Trends
Certain months, like summer and the holiday season, have historically seen higher movie revenues. Releasing our movies during these peak times can help maximize box office performance. For instance, movies released in June and December tend to perform particularly well.

4. Marketing Spend
There's a clear link between marketing spend and movie success. Movies with higher marketing budgets generally see better box office returns. By strategically investing in marketing, especially in digital campaigns, we can drive more people to the theaters.

5. High-Performing Certifications
Movies with PG-13 ratings tend to have the highest average gross revenue. These movies appeal to both teenagers and adults, which broadens the audience. Targeting this certification for our new movies could be a good move to maximize viewership and profits.

6. Optimal Movie Duration
The sweet spot for movie length seems to be around 120 minutes. Movies within this duration are long enough to develop a compelling story but not so long that they lose audience interest. Keeping our films around this length can help keep viewers engaged and satisfied. */

‌
‌
‌

Competition - Everyone Can Learn Data Scholarship

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Everyone Can Learn Data Scholarship

📖 Background

💡 Learn more

ℹ️ Introduction to Data Science Notebooks

Data science notebooks

1️⃣ Part 1 (Python) - Dinosaur data 🦕

📖 Background

💾 The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

💪 Challenge I

2️⃣ Part 2 (SQL) - Understanding movie data 🎥

📖 Background

💾 The data

You have access to the following table, cinema.films:

Everyone Can Learn Data Scholarship