Skip to content
0

Everyone Can Learn Data Scholarship

๐Ÿ“– Background

The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

1๏ธโƒฃ Part 1 (Python) - Dinosaur data ๐Ÿฆ•

๐Ÿ“– Background

You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

๐Ÿ’พ The data

You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

Column nameDescription
occurence_noThe original occurrence number from the Paleobiology Database.
nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
dietThe main diet (omnivorous, carnivorous, herbivorous).
typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
length_mThe maximum length, from head to tail, in meters.
max_maThe age in which the first fossil records of the dinosaur where found, in million years.
min_maThe age in which the last fossil records of the dinosaur where found, in million years.
regionThe current region where the fossil record was found.
lngThe longitude where the fossil record was found.
latThe latitude where the fossil record was found.
classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
familyThe taxonomical family of the dinosaur (if known).

The data was enriched with data from Wikipedia.

๐Ÿ’ช Challenge I

Help your colleagues at the museum to gain insights on the fossil record data. Include:

  1. How many different dinosaur names are present in the data?
  2. Which was the largest dinosaur? What about missing data in the dataset?
  3. What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
  4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
  5. Use the AI assitant to create an interactive map showing each record.
  6. Any other insights you found during your analysis?

Dinosaur Dataset Exploration

1- Exploring the Dinosaur Dataset

Let's start our report by exploring a fascinating dataset about dinosaurs. Our goal is to uncover insights about their diet, types, lengths, and the regions they inhabited. By starting with a general overview of the data, we set the stage for a deeper exploration into the lives of these incredible creatures.

# Import necessary packages for data analysis and visualization
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dinosaurs dataset from a CSV file
dinosaurs = pd.read_csv('data/dinosaurs.csv')

# Preview the dataframe
dinosaurs

Initial Observations

From our initial look, we can see that the dataset provides a wealth of information that can help us paint a detailed picture of these prehistoric creatures. However, we also noticed that some data points are missing, especially in the diet, type, and family categories. This is a common occurrence in large datasets and will require us to clean the data before diving deeper.

Findings So Far

Comprehensive Data: The dataset is rich with detailed information about various dinosaur species, providing a strong foundation for our analysis.

2- Understanding the Data

To effectively analyze our dataset, it's essential to first understand its structure and completeness. We need to identify any columns with missing values so we can address them appropriately in our analysis. Below is a summary of our initial findings based on the dataset's completeness and structure:

# Display information about the dataset
print(dinosaurs.info())

Insights from the Data

We found several key insights:

  1. Dataset Size and Structure: The dataset comprises 4951 rows and 12 columns, providing a substantial amount of data for our analysis. The columns include a mix of data types: integers, floats, and objects.

  2. Completeness of Data:

    • The columns occurrence_no, name, max_ma, min_ma, region, lng, lat, and class are complete with no missing values.
    • The columns diet, type, length_m, and family contain missing values. Specifically:
      • diet: 3596 non-null entries, indicating missing values.
      • type: 3596 non-null entries, indicating missing values.
      • length_m: 3568 non-null entries, indicating missing values.
      • family: 3494 non-null entries, indicating missing values.
  3. Memory Usage: The dataset uses approximately 464.3 KB of memory, which is manageable for our analysis.

  4. Data Cleaning: The presence of missing values in several columns highlights the need for data cleaning before we proceed with further analysis. Addressing these missing values will ensure the accuracy and reliability of our insights.

By understanding the completeness and structure of the dataset, we can plan our data cleaning and preprocessing steps effectively. This foundational understanding will help us explore the diverse and fascinating world of dinosaurs more accurately.

3- Statistical Summary of the Dinosaur Dataset

Understanding the distribution and characteristics of our dinosaur dataset is crucial. The statistical summary provides a comprehensive overview of the dataset, offering insights into the diversity and scale of these prehistoric creatures.

# Describe the dataset
print(dinosaurs.describe(include='all'))

Insights from the Statistical Summary

The statistical summary of the dinosaur dataset reveals several important insights:

  • Diversity: The dataset includes 1042 unique dinosaur names, highlighting the extensive diversity of species covered.
  • Class Representation: There are two main classes, with "Saurischia" being the most common, appearing 3074 times.
  • Family Representation: The dataset lists 75 different families, with "Dromaeosauridae" being the most frequent, appearing 450 times.
  • Numerical Distribution: The occurrence number ranges from 1 to 1,365,954, with an average of approximately 683,832.

These observations showcase the dataset's significant diversity and provide a comprehensive view of dinosaur taxonomy. This understanding will aid in exploring the varied and fascinating world of dinosaurs in greater detail.

4- Checking for and Calculating Missing Data Percentages

To ensure the accuracy and reliability of our analysis, it is essential to identify and address any missing values in our dinosaur dataset. We will check for missing values across all columns and calculate the percentage of missing values for each column. This will help us understand the extent of the missing data and prioritize which columns need the most attention during the data cleaning process.

โ€Œ
โ€Œ
โ€Œ