Skip to content

Find good quality wines via unsupervised learning

Here explore the chemical composition of a wines with the goals of correlating wine quality to their chemical traits. We use an unsupervised learning technique, k-means clustering, to uncover groups of wines according to these traits.

by jortega

# load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn import metrics
# load wine quality data
wine = pd.read_csv("./winequality.csv")
wine.info()

Exploratory Data Analysis

# check nulls
print(wine.isnull().sum())

This dataset has two measures of wine quality: A binary classification on whether the wine is good or not (0 = not good, 1 = good) A numeric scale up to 10 with a score of 10 for the best quality wines.

# check quality vs good
pd.crosstab(wine['good'], wine['quality'])

In this cross-tab we see that a quality score of seven and above correspond to "good" wines.

# check distribution for chemical components
cc = [col for col in wine.columns if col not in ["good", "quality", "color"]]
wine[cc].hist(figsize=(10, 10), color='green', alpha=0.7)
plt.show()
# check outliers with boxplot charts
sns.set_theme()

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, figsize = (10, 8))
axs = [ax1,ax2,ax3,ax4]
chemical = ['residual sugar', 'chlorides', 'free sulfur dioxide', 'density']

for i, name in enumerate(chemical):
    sns.boxplot(data = wine, y = name, x = 'good', ax = axs[i], hue = 'good')
    axs[i].set_xlabel(name)
plt.show()
# remove outliers
Q1 = wine[chemical].quantile(0.25)
Q3 = wine[chemical].quantile(0.75)
IQR = Q3 - Q1

# calculate upper bound
upper = Q3 + 1.5 *IQR
print("Upper bound:\n",upper)

# remove outliers
for name in chemical:
  mask = np.where(wine[name] < upper[name])[0]
  wine = wine.iloc[mask]

# reset index
wine.reset_index(inplace = True, drop = True)
# check for correlated measures
corr = wine[cc].astype(float).corr()
mask = np.tril(corr)
plt.figure(figsize=(7,7))
plt.title('Feature correlation', size=15)
sns.heatmap(corr, vmax=1.0, square=True, annot=True, cmap = "coolwarm", fmt = '.1g', mask = mask)
plt.show()

Alcohol and density seem highly correlated. Let's dig deeper into this relationship.

# visualize relationship between alcohol and density
sample = np.random.randint(0, wine.shape[0], size=500)

sns.relplot(wine.loc[sample,], x = "density", y = "alcohol", hue = "good", alpha = 0.7)
plt.show()

Data pre-processing

Next step is to use the chemical components to create clusters and visualize the relationship of each cluster to wine quality. The first step is to transform the data and standardize by centering it on the median value.