Skip to content
Dinosaurs and Movie - A data analyst's version of Netflix and Chill
0
  • AI Chat
  • Code
  • Report
  • Everyone Can Learn Data Scholarship

    📖 Background

    The second "Everyone Can Learn Data" Scholarship from DataCamp is now open for entries.

    The challenges below test your coding skills you gained from beginner courses on either Python, R, or SQL. Pair them with the help of AI and your creative thinking skills and win $5,000 for your future data science studies!

    The scholarship is open to secondary and undergraduate students, and other students preparing for graduate-level studies (getting their Bachelor degree). Postgraduate students (PhDs) or graduated students (Master degree) cannot apply.

    The challenge consist of two parts, make sure to complete both parts before submitting. Good luck!

    1️⃣ Part 1 (Python) - Dinosaur data 🦕

    📖 Background

    You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

    💾 The data

    You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

    Column nameDescription
    occurence_noThe original occurrence number from the Paleobiology Database.
    nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
    dietThe main diet (omnivorous, carnivorous, herbivorous).
    typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
    length_mThe maximum length, from head to tail, in meters.
    max_maThe age in which the first fossil records of the dinosaur where found, in million years.
    min_maThe age in which the last fossil records of the dinosaur where found, in million years.
    regionThe current region where the fossil record was found.
    lngThe longitude where the fossil record was found.
    latThe latitude where the fossil record was found.
    classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
    familyThe taxonomical family of the dinosaur (if known).

    The data was enriched with data from Wikipedia.

    # Import the pandas and numpy packages
    import pandas as pd
    import numpy as np
    # Load the data
    dinosaurs = pd.read_csv('data/dinosaurs.csv')
    # Preview the dataframe
    dinosaurs

    💪 Challenge I

    Help your colleagues at the museum to gain insights on the fossil record data. Include:

    1. How many different dinosaur names are present in the data?
    2. Which was the largest dinosaur? What about missing data in the dataset?
    3. What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
    4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
    5. Use the AI assitant to create an interactive map showing each record.
    6. Any other insights you found during your analysis?

    Answer to Question 1:

    There are 1,042 unique dinosaurs in our dataset. THe most common is Richardoestesia with 151 entries.

    Richardoestesia is best friends Mayslowtesia and Jeremyspeedasaurous. (Top Gear Reference)

    most_freq_dinosaur = dinosaurs['name'].value_counts().idxmax()
    most_freq_dinosaur

    Answer to Question 2:

    The largest dinosaur (in terms of length) was the Supersaurus measuring in at 35 meters head to toe. It was an American Herbivorous that lived 145 million to 156 million years ago.

    You don't see many american herbivorous nowadays though

    largest_dinosaur = dinosaurs['name'].loc[dinosaurs['length_m'].idxmax()]
    largest_dinosaur

    There seems to be significant(>25%) missing data in the following columns:

    1. diet
    2. type
    3. length_m
    4. family
    count_missing = dinosaurs.isna().agg('sum')
    
    percentage_missing = (dinosaurs.isna().agg('sum') / len(dinosaurs) * 100).round(2)
    
    missing_summary = pd.DataFrame({
        'Count Missing': count_missing,
        'Percentage Missing': percentage_missing
    })
    
    missing_summary
    

    Answer to Question 3:

    The most common dinosaur type is ornithopod with 811 entries.

    The lack of armored dinosaur is why the comet killed off most of them.