Everyone Can Learn Data Scholarship
1️⃣ Part 1 (Python) - Dinosaur data 🦕
📖 Background
You're applying for a summer internship at a national museum for natural history. The museum recently created a database containing all dinosaur records of past field campaigns. Your job is to dive into the fossil records to find some interesting insights, and advise the museum on the quality of the data.
💾 The data
You have access to a real dataset containing dinosaur records from the Paleobiology Database (source):
Column name | Description |
---|---|
occurence_no | The original occurrence number from the Paleobiology Database. |
name | The accepted name of the dinosaur (usually the genus name, or the name of the footprint/egg fossil). |
diet | The main diet (omnivorous, carnivorous, herbivorous). |
type | The dinosaur type (small theropod, large theropod, sauropod, ornithopod, ceratopsian, armored dinosaur). |
length_m | The maximum length, from head to tail, in meters. |
max_ma | The age in which the first fossil records of the dinosaur where found, in million years. |
min_ma | The age in which the last fossil records of the dinosaur where found, in million years. |
region | The current region where the fossil record was found. |
lng | The longitude where the fossil record was found. |
lat | The latitude where the fossil record was found. |
class | The taxonomical class of the dinosaur (Saurischia or Ornithischia). |
family | The taxonomical family of the dinosaur (if known). |
The data was enriched with data from Wikipedia.
# Import the pandas and numpy packages
import pandas as pd
import numpy as np
# Load the data
dinosaurs = pd.read_csv('data/dinosaurs.csv')
# Preview the dataframe
dinosaurs
dinosaurs.info()
dinosaurs.describe()
dinosaurs.isnull().sum()
Imputation Strategy
Handling missing values in a dataset is crucial to maintaining the integrity and usability of the data for analysis or modeling.
-
Diet Column
We will develop a predictive model to impute the missing values for the "diet" column. We'll explore different approaches and choose a suitable model for this task.
Here's the plan:
- Data Preparation: We'll create a dataset with non-missing values for "diet" and use it to train a model.
- Feature Engineering: We'll use relevant features such as "name," "max_ma," "min_ma," "region," and others to predict "diet."
- Model Selection: We'll explore various machine learning algorithms to find the one that best fits this task.
- Evaluation: We'll evaluate the model's performance using appropriate metrics.
- Imputation: Once we have a trained model, we'll impute the missing values in the "diet" column.
We examined two models - Random Forest Classifier and Decision Tree Classifier
The Decision Tree model performs slightly better than the Random Forest in terms of accuracy and F1-score.
We will proceed with using this models to impute the missing values in the "diet" column of the original dataset.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
data = dinosaurs
# Filtering the dataset to include only rows with non-missing values for "diet"
data_diet = data[~data["diet"].isnull()]
# Selecting relevant features and the target variable
features = ['max_ma', 'min_ma', 'region', 'lng', 'lat', 'class', 'name']
target = 'diet'
X = data_diet[features]
y = data_diet[target]
# Splitting the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Column transformer for preprocessing
categorical_features = ['region', 'class', 'name']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical_features)
], remainder='passthrough')
# Creating a pipeline with preprocessor and Decision Tree model
pipeline = Pipeline([
('preprocessor', preprocessor),
('model', DecisionTreeClassifier(random_state=42))
])
# Training the model
pipeline.fit(X_train, y_train)
# Making predictions
y_pred = pipeline.predict(X_test)
# Evaluating the model's performance
report = classification_report(y_test, y_pred)
report
# Now that we have a well-performing model, let's use it to impute the missing values in the "diet" column
# Filtering rows with missing values for "diet"
data_missing_diet = data[data["diet"].isnull()]
# Selecting relevant features
X_missing_diet = data_missing_diet[features]
# Predicting the missing values
imputed_diets = pipeline.predict(X_missing_diet)
# Updating the original dataset with the imputed values
data.loc[data["diet"].isnull(), "diet"] = imputed_diets
# Checking how many missing values are left in the "diet" column
data["diet"].isnull().sum()
dinosaurs.diet.value_counts()
2. "Type" Column
We will develop a predictive model for imputing the missing values in the "type" column, incorporating the "diet" column as one of the features. We'll explore several models and evaluate which one performs the best.
Let's go through the following steps:
- Data Preparation: Create a dataset with non-missing values for "type" and split it into training and test sets.
- Feature Engineering: Preprocess relevant features, including the newly imputed "diet" column.
- Model Training: Train different models using the prepared dataset.
- Evaluation: Evaluate each model's performance using appropriate metrics.
from sklearn.ensemble import RandomForestClassifier
# Filtering the dataset to include only rows with non-missing values for "type"
data_type = data[~data["type"].isnull()]
# Selecting relevant features and the target variable
features_type = ['max_ma', 'min_ma', 'region', 'lng', 'lat', 'class', 'name', 'diet']
target_type = 'type'
X_type = data_type[features_type]
y_type = data_type[target_type]
# Splitting the dataset into training and test sets
X_train_type, X_test_type, y_train_type, y_test_type = train_test_split(X_type, y_type, test_size=0.25, random_state=42)
# Creating transformers for categorical features for type prediction
categorical_features_type = ['region', 'class', 'name', 'diet']
categorical_transformer_type = OneHotEncoder(handle_unknown='ignore')
# Column transformer to apply appropriate transformations to different columns
preprocessor_type = ColumnTransformer(
transformers=[
('cat', categorical_transformer_type, categorical_features_type)
], remainder='passthrough')
# Defining a function to train and evaluate different models
def train_and_evaluate_model_type(model, model_name):
# Creating a pipeline with preprocessor and the given model
pipeline_type = Pipeline([
('preprocessor', preprocessor_type),
('model', model)
])
# Training the model
pipeline_type.fit(X_train_type, y_train_type)
# Making predictions
y_pred_type = pipeline_type.predict(X_test_type)
# Evaluating the model
report_type = classification_report(y_test_type, y_pred_type)
print(f"Performance of {model_name}:\n{report_type}\n")
# Evaluating a Decision Tree model
train_and_evaluate_model_type(DecisionTreeClassifier(random_state=42), "Decision Tree")
# Evaluating a Random Forest model
train_and_evaluate_model_type(RandomForestClassifier(random_state=42), "Random Forest")
The models performed as follows in predicting the "type" column:
Decision Tree Model:
- Accuracy: 98%
- Precision: The model's precision for each class ranges from 96% to 100%.
- Recall: The model's recall for each class ranges from 96% to 100%.
- F1-score: The model's F1-score for each class ranges from 96% to 100%.
Random Forest Model:
- Accuracy: 99%
- Precision: The model's precision for each class ranges from 97% to 100%.
- Recall: The model's recall for each class ranges from 97% to 100%.
- F1-score: The model's F1-score for each class ranges from 97% to 100%.
The Random Forest model performed slightly better than the Decision Tree model, achieving a 99% accuracy compared to 98% for the Decision Tree.
Let's proceed to use the Random Forest model to impute the missing values in the "type" column of the original dataset.