💾 The data description
Abalone characteristics:
- "sex" - M, F, and I (infant).
- "length" - longest shell measurement.
- "diameter" - perpendicular to the length.
- "height" - measured with meat in the shell.
- "whole_wt" - whole abalone weight.
- "shucked_wt" - the weight of abalone meat.
- "viscera_wt" - gut-weight.
- "shell_wt" - the weight of the dried shell.
- "rings" - number of rings in a shell cross-section.
- "age" - the age of the abalone: the number of rings + 1.5.
Acknowledgments: Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J Cawthorn, and Wes B Ford (1994) "The Population Biology of Abalone (Haliotis species) in Tasmania. I. Blacklip Abalone (H. rubra) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48 (ISSN 1034-3288).
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
abalone = pd.read_csv('./data/abalone.csv')
abalone.head()
#let's look at some summary stats across variables:
abalone.describe().round(4) #get rid of too many zeros but still possible to see all values without loss
From this stats we can see the distribution of values for each variable and their bounds.
1. How does weight change with age for each of the three sex categories?
I'm gonna show the weight both of whole abalone and just it's meat to see if there any significant difference. For this I created scatter plots that show weight change with ages, where each color represents corresponding sex category - Male, Female or Infant, results of that you can see below:
fig, ax = plt.subplots(1, 2, sharex=True)
sex_hue={'F':'salmon', 'M':'lightskyblue', 'I':'bisque'}
sns.scatterplot(abalone.age, abalone.whole_wt, hue=abalone.sex, alpha=0.3, s=10, palette=sex_hue, ax=ax[0])
sns.scatterplot(abalone.age, abalone.shucked_wt, hue=abalone.sex, alpha=0.3, s=10, palette=sex_hue, ax=ax[1])
#Making legeng, titles and y-axis lables for both charts
for i, x in enumerate(['whole abalone', 'abalone meat']):
ax[i].legend(title="Sexes", fontsize=8, title_fontsize=9, handletextpad=0.3)
ax[i].set_title(f"Weignt of {x} by ages", fontsize = 10)
ax[i].set_ylabel(f"Weight of {x}")
#scale charts to be able to compare them correctly
ax[i].set_ylim([0, 3])
#this automatically maintains the proper space between plots
fig.tight_layout()
As we see - weight of whole abalone changes more significantly over time but just a meat doesn't increase mass so much.
2. Estimating an abalone's age using its physical characteristics
We already used weight characteristics to see how they change over age and since whole abalone becomes heavier and it's meat don't so much. Now we can think about a shell - it grows in sizes and weight over time as well as number of rings and that may be better characteristic to estimate abalone's age (except rings).
To check if any of the variables has a significant level of correlation I plotted charts: for length, diameter, height, weight (without indicating sexes), meat weight, viscera weight, shell weight and rings respectively:
fig, ax = plt.subplots(2, 4)
#I use the for loop to create four charts. Name of y-axis variable store in this array
y=["length", "diameter", "height", "whole_wt", "shucked_wt", "viscera_wt", "shell_wt", "rings"]
s=0
colors=['salmon', 'sandybrown', 'khaki', 'y', 'mediumaquamarine','lightskyblue','plum', 'lightgrey']
for i in range(2): #this helps me to add all details to every chart from the grid (and saves much space from useless coding)
for j in range(4):
ax[i,j].scatter(abalone["age"],abalone[y[s]], color=colors[s], s=10, alpha=0.1)
ax[i,j].set_title(f"{y[s]} vs age", fontsize = 10)
ax[i,j].set_ylabel(y[s])
ax[i,j].set_xlabel("age")
s+=1
ax[0, 1].set_ylim([0, 0.8])
fig.tight_layout()
Let's check if any pair of variables together can give us also good result:
sns.pairplot(abalone, hue="age", vars=y[0:6])
3. Which variables are better predictors of Abalone's age?
Certainly the best variable for abalone's age prediction is the only number of rings. But some other variables can bring good prediction too: length and diameter are perfect for this - alone, together or in pair with other variables they will have the most consistent result.
The weaker is relationship between variable and age - the harder to get meaningful prediction.
Let's look closer to a chart:
sns.scatterplot(data=abalone, x="diameter", y="length", hue="age", alpha=0.4, size="age")
plt.title("Abalone diameter and length estimating the age")
This two variable, length and diameter, have strong relationship with each other and moreover we can see they both related to age. Together they show us that with increasing diameter and length age increases as well!
Thank you for reading this! Good luck!