Can you estimate the age of an abalone?
๐ Background
You are working as an intern for an abalone farming operation in Japan. For operational and environmental reasons, it is an important consideration to estimate the age of the abalones when they go to market.
Determining an abalone's age involves counting the number of rings in a cross-section of the shell through a microscope. Since this method is somewhat cumbersome and complex, you are interested in helping the farmers estimate the age of the abalone using its physical characteristics.
๐พ The data
You have access to the following historical data (source):
Abalone characteristics:
- "sex" - M, F, and I (infant).
- "length" - longest shell measurement.
- "diameter" - perpendicular to the length.
- "height" - measured with meat in the shell.
- "whole_wt" - whole abalone weight.
- "shucked_wt" - the weight of abalone meat.
- "viscera_wt" - gut-weight.
- "shell_wt" - the weight of the dried shell.
- "rings" - number of rings in a shell cross-section.
- "age" - the age of the abalone: the number of rings + 1.5.
Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).
Importing packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoderExploratory Data Analysis
#Read the data
abalone = pd.read_csv('./data/abalone.csv')
#Looking for missing value
display(abalone.isna().any())
#Display informations
display(abalone.info())
There are no missing value. But sex is an object variable. It will be necessary to convert this variable into numerical values.
Let's look at some statistics : The sex attribute is a categorical variable for which the possibles values are: M for Male, F for Female and I of Infant. We analyzed the count of each category with a bar plot, the dataset is balanced.
Female and Male group have similar mean in weight. Infant group prensents lower weight.
sns.countplot(abalone.sex)
plt.show()
#Calcul of weight means
display(abalone.groupby('sex')['age', 'whole_wt', 'shucked_wt', 'viscera_wt', 'shell_wt'].mean())Now, let's focus on the influence of age on weight in the three sex groups. There are several types of weight measurements (whole_wt, shucked_wt, viscera_wt and shell_wt). We will first look at these 4 measures according to age and sex.
sns.jointplot(data=abalone, x='age', y='whole_wt', hue='sex', s=10)
sns.jointplot(data=abalone, x='age', y='shucked_wt', hue='sex', s=10)
sns.jointplot(data=abalone, x='age', y='viscera_wt', hue='sex', s=10)
sns.jointplot(data=abalone, x='age', y='shell_wt', hue='sex', s=10)
plt.show()We notice a shift between Infant and adults (F & M). Moreover, it seems that there is no difference in M & F distribution.
To simplify the following analysis, we will focus on the whole weight.
abalone_F = abalone[abalone['sex'] == 'F']
abalone_M = abalone[abalone['sex'] == 'M']
abalone_I = abalone[abalone['sex'] == 'I']
sns.lmplot(data=abalone, x="age", y="whole_wt", hue="sex", col="sex", height=6, aspect=1)
plt.show()
slope_M, intercept_M = np.polyfit(x=abalone_M["age"], y=abalone_M["whole_wt"], deg=1)
slope_F, intercept_F = np.polyfit(x=abalone_F["age"], y=abalone_F["whole_wt"], deg=1)
slope_I, intercept_I = np.polyfit(x=abalone_I["age"], y=abalone_I["whole_wt"], deg=1)
print('Male slope : ', round(slope_M, 2))
print('Female slope : ', round(slope_F, 2))
print('Infant slope : ', round(slope_I, 2))By plotting the whole weight VS the age, we observe that the regression line of Male and Female groups is more flatten than the Infant regression line.
By measuring the slope, we notice that the change in weight for every increase of age is more important in Infant group (0.08) compare to Male group (0.06) or Female group (0.04). There is a greater increase in weight in the Infant group.
corr_F = abalone_F['age'].corr(abalone_F['whole_wt'])
corr_M = abalone_M['age'].corr(abalone_M['whole_wt'])
corr_I = abalone_I['age'].corr(abalone_I['whole_wt'])
print('Age VS Whole_wt : ')
print('Female Group : ', round(corr_F, 2))
print('Male Group : ', round(corr_M, 2))
print('Infant Group : ', round(corr_I, 2))โ
โ