notebook.ipynb.
"Welcome aboard! I'm thrilled to present to you my Data Analyst Notebook, where you're about to embark on an exciting journey into the realm of data analysis. Get ready to dive into the vibrant world of Python and SQL, as we uncover the art of clearing, cleaning, and manipulating data. This notebook isn't just about learning; it's about embracing the thrill of being at the forefront of the data analyst universe. So buckle up and get ready to unleash your analytical prowess!"
1️⃣ Part 1 (Python) - Dinosaur data 🦕
📖 Background
As the museum's newly appointed data analyst, I'm excited to embark on a journey into the rich world of dinosaur fossils through the recently curated database of past field campaigns. My role is to dive deep into these fossil records, extracting insights that illuminate the prehistoric past while also ensuring the data's accuracy and reliability. Join me as we uncover fascinating discoveries and advise the museum on optimizing the management and utilization of its invaluable dinosaur records.
💾 The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
| Column name | Description |
|---|---|
| occurence_no | The original occurrence number from the Paleobiology Database. |
| name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
| diet | The main diet (omnivorous, carnivorous, herbivorous). |
| type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
| length_m | The maximum length, from head to tail, in meters. |
| max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
| min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
| region | The current region where the fossil record was found. |
| lng | The longitude where the fossil record was found. |
| lat | The latitude where the fossil record was found. |
| class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
| family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
Introduction
In this analysis, I evaluate the quality of a dataset containing dinosaur fossil records. The dataset comprises 4951 entries with 12 attributes each. The aim is to ensure the data's accuracy, completeness, and consistency before proceeding to any further analysis.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import folium
from sklearn.impute import SimpleImputer
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
print(dinosaurs)Initial Data Inspection
The initial inspection of the dataset shows the following structure and summary statistics:
The .info() and .head() functions in pandas are essential for getting a quick overview and summary of a DataFrame. Here’s what each of them does:
print(dinosaurs.info()).info()
The .info() method provides a concise summary of a DataFrame, including:
- The class type of the DataFrame.
- The number of entries (rows).
- The column labels and data types.
- The number of non-null values in each column.
- The memory usage of the DataFrame.
print(dinosaurs.head()).head()
The .head() method returns the first n rows of a DataFrame (by default, it returns the first 5 rows). This is useful for quickly inspecting the data.
Data Quality Asseement and Cleaning
Missing Values Analysis
The initial inspection revealed missing values in several columns. Here is the summary of missing values:
#Check if there are missing values
missing_val = dinosaurs.isnull().sum()
print("Missing Values are:\n\n", missing_val)Results:
- Columns
diet,type,length_m,region, andfamilyhave missing values.
To clean the dataset, we can use tow approaches
-
Delete Rows with Missing Values, but it may lead to loss of valuable data. This can be done using this line of code:
cleaned_dinosaurs = dinosaurs.dropna() -
Impute with Averages or Midpoints by Filling the missing values with mean, median, or mode. I used the second approach in order not to lose many types of dinosaurs, because I've already dropped the missing values but there was huge gap in the results.
# Solving missing values issue with imputation
# For numerical columns, we can use mean or median imputation
num_cols = dinosaurs.select_dtypes(include=np.number).columns
imputer_num = SimpleImputer(strategy='mean')
dinosaurs[num_cols] = imputer_num.fit_transform(dinosaurs[num_cols])
# For categorical columns, we can use the most frequent value imputation
cat_cols = dinosaurs.select_dtypes(include='object').columns
imputer_cat = SimpleImputer(strategy='most_frequent')
dinosaurs[cat_cols] = imputer_cat.fit_transform(dinosaurs[cat_cols])
# Assign the cleaned data to cleaned_dinosaurs
cleaned_dinosaurs = pd.DataFrame(dinosaurs, columns=dinosaurs.columns)
# Check again if there are still missing values
missing_val_after_imputation = cleaned_dinosaurs.isnull().sum()
print("Missing Values After Imputation:\n\n", missing_val_after_imputation)