Skip to content
Competition - Everyone Can Learn Data Scholarship
0
  • AI Chat
  • Code
  • Report
  • Everyone Can Learn Data Scholarship - John

    Introduction

    Hello there, I am John Elomunait, and today, we will delve into fascinating data insights from two distinct datasets. Our first part focuses on dinosaur fossil records from a national museum, while the second part explores movie data to uncover the secrets of old Hollywood's success. Let's embark on this journey of discovery and learning.

    1️⃣ Part 1 (Python) - Dinosaur data 🦕

    📖 Background

    You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.

    National Museum of Natural History: Dinosaur Fossil Insights

    The National Museum of Natural History showcases dinosaur skulls, skeletons, models, murals, and fossils, acquired through various means. Dinosaurs attract significant public interest, boosting museum attendance. This report explores the museum's new dinosaur fossil database to uncover insights and assess data quality, enhancing accessibility and preserving these resources for future generations.

    💾 The data

    You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):

    Column nameDescription
    occurence_noThe original occurrence number from the Paleobiology Database.
    nameThe accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil).
    dietThe main diet (omnivorous, carnivorous, herbivorous).
    typeThe dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur).
    length_mThe maximum length, from head to tail, in meters.
    max_maThe age in which the first fossil records of the dinosaur where found, in million years.
    min_maThe age in which the last fossil records of the dinosaur where found, in million years.
    regionThe current region where the fossil record was found.
    lngThe longitude where the fossil record was found.
    latThe latitude where the fossil record was found.
    classThe taxonomical class of the dinosaur (Saurischia or Ornithischia).
    familyThe taxonomical family of the dinosaur (if known).

    The data was enriched with data from Wikipedia.

    # Import the pandas and numpy packages
    import pandas as pd
    import numpy as np
    # Load the data
    dinosaurs = pd.read_csv('data/dinosaurs.csv')
    # Preview the dataframe
    dinosaurs

    💪 Challenge I

    Help your colleagues at the museum to gain insights on the fossil record data. Include:

    1. How many different dinosaur names are present in the data?
    2. Which was the largest dinosaur? What about missing data in the dataset?
    3. What dinosaur type has the most occurrences in this dataset? Create a visualization (table, bar chart, or equivalent) to display the number of dinosaurs per type. Use the AI assistant to tweak your visualization (colors, labels, title...).
    4. Did dinosaurs get bigger over time? Show the relation between the dinosaur length and their age to illustrate this.
    5. Use the AI assitant to create an interactive map showing each record.
    6. Any other insights you found during your analysis?

    1. Counting Different Dinosaur Names:

    There are 1042 different dinosaur names in the data.

    # Count unique dinosaur names
    unique_dinosaur_names = dinosaurs['name'].nunique()
    print(f"There are {unique_dinosaur_names} different dinosaur names in the data.")

    2. Finding the Largest Dinosaur and Handling Missing Data:

    The largest dinosaur by length:

    The largest dinosaur in the dataset is Supersaurus, with a length of 35.0 meters.

    # Finding the largest dinosaur by length
    largest_dinosaur = dinosaurs.loc[dinosaurs['length_m'].idxmax()]
    print(f"The largest dinosaur is {largest_dinosaur['name']} with a length of {largest_dinosaur['length_m']} meters.")

    Missing data

    The table below summarizes the missing values in the dataset and shows the percentage of missing data for each column:

    # Handling missing data
    # Check for missing values in the dataset
    missing_data = dinosaurs.isnull().sum()
    
    # Calculate the percentage of missing data for each column
    missing_percentage = (missing_data / len(dinosaurs)) * 100
    
    missing_data_df = pd.DataFrame({'Missing Values': missing_data, 'Percentage': missing_percentage})
    missing_data_df