Skip to content
0

Can you estimate the age of an abalone?

๐Ÿ“– Background

You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.

Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.

๐Ÿ’พ The data

You have access to the following historical data (source):

Abalone characteristics:
  • "sex" - M, F, and I (infant).
  • "length" - longest shell measurement in mm.
  • "diameter" - perpendicular to the length (mm).
  • "height" - measured with meat in the shell (mm).
  • "whole_wt" - whole abalone weight in grams.
  • "shucked_wt" - the weight of abalone meat (grams).
  • "viscera_wt" - gut-weight (grams).
  • "shell_wt" - the weight of the dried shell (grams).
  • "rings" - number of rings in a shell cross-section.
  • "age" - the age of the abalone: the number of rings + 1.5.

Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).

import pandas as pd
abalone = pd.read_csv('./data/abalone.csv')
abalone

๐Ÿ’ช Competition challenge

Create a report that covers the following:

  1. How does weight change with age for each of the three sex categories?
  2. Can you estimate an abalone's age using its physical characteristics?
  3. Investigate which variables are better predictors of age for abalones.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error as MSE

Preliminary EDA

Before answering to any questions we'll dive a bit into the data. We can look at the correlation between the different features to better understand their relationships.

To simplify the heatmap we will only visualise half of the matrix.

Feature correlation

corr_matrix = abalone.corr()

cmap = sns.diverging_palette(h_neg=240, h_pos=10, as_cmap=True)

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Simplify the heat map by deleting half of the duplicated matrix.
sns.heatmap(corr_matrix, cmap=cmap, mask=mask, annot=True, fmt=".2f")
plt.show()

We can observe many features with high correlations (> 0.9), including:

Length and diameter: We can see almost a perfect correlation (0.99) showing that we can easily know one of the features if we know the other one.

The different weight features are also highly correlated, especially all the abalone part weights (meat, gut and shell) with whole_weight which is expected as it is a combination of the individual weights.

Rings and age: The correlation coefficient is 1 which is expected since the feature 'age' is a linear expression of 'rings' (age = rings + 1.5).

In general we can observe high correlations among most features except for age (and thus, rings) where the highest correlation is with the shell_weight (0.63). This tells us in advance that to predict the age we will need more than one feature.

Abalone age distribution

plt.figure(figsize=(10, 6))
sns.histplot(data=abalone, x="age")
plt.show()

We can see that age is a unimodal distribution with a mode of 11 years and right skewed. It's not normally distributed but is close to a normal distribution.

When we look at the cumulative plot for the different abalone sex categories (below) we see up to 80% of abalones live up to 15 years. This is slightly higher for Males than Females but differences are small.

plt.figure(figsize=(6, 4))
sns.ecdfplot(data=abalone, x="age", hue="sex")
plt.show()

1. How does weight change with age for each of the three sex categories?

We will start doing some EDA to understand how weight changes with age and according to their sex.

We start looking how the whole abalone weight changes with age. We can see a high concentration of infants have lower weight with a mean value around 400 miligrams. Males and females have a very similar distribution with a mean close to 1 gram.

sns.pairplot(data=abalone, vars=["whole_wt", "age"], hue="sex")

We also look at how age changes according to different weight types: shucked (meat), viscera and shell. We can see that although the weight scale is different the relationship with age is similar. First, we can see this relationship non-linear in all cases. Further, for all weight types we can see a similar trend for Males and Females where the weight is larger for abalone between 10 and 20 years and smaller for abalones younger than 10 and older than 20.

Infants have lower weights than males and females. A boxplot at the bottom shows more clearly the differences in weight between the infant and adult populations.

โ€Œ
โ€Œ
โ€Œ