Skip to content
Global Wine Market
  • AI Chat
  • Code
  • Report
  • Global Wine Markets 2015

    📖 Background

    With the end of year holidays approaching, many people like to relax or party with a glass of wine. That makes wine an important industry in many countries. Understanding this market is important to the livelihood of many people.

    You work at a multinational consumer goods organization that is considering entering the wine production industry. Managers at your company would like to understand the market better before making a decision.

    💾 The data

    This dataset is a subset of the University of Adelaide's Annual Database of Global Wine Markets.

    The dataset consists of a single CSV file, data/wine.csv.

    Each row in the dataset represents the wine market in one country. There are 34 metrics for the wine industry covering both the production and consumption sides of the market.

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    sns.set()
    
    wine = pd.read_csv("data/wine.csv")
    wine.head()
    wine.rename(columns = {wine.columns[2] : 'Vine Area'}, inplace=True)

    We will conduct several standard studies of our data. Let's see how many missing values and look at the data types.

    wine.drop_duplicates()
    display(wine.info())
    display(wine.describe())
    
    def percent_hbar(df, old_threshold=None):
        percent_of_nulls = (df.isnull().sum()/len(df)*100).sort_values().round(2)
        threshold = percent_of_nulls.mean()
        ax = percent_of_nulls.plot(kind='barh', figsize=(20, 16), title='% of NaN (from {} lines)'.format(len(df)), 
                                   color='#86bf91', legend=False, fontsize=17)
        ax.set_xlabel('Count of NaN')
        dict_percent = dict(percent_of_nulls)
        i = 0
        for k in dict_percent:
            color = 'blue'
            if dict_percent[k] > 0:
                if dict_percent[k] > threshold:
                    color = 'red'
                ax.text(dict_percent[k]+0.1, i + 0.09, str(dict_percent[k])+'%', color=color, 
                        fontweight='bold', fontsize='large')
            i += 0.98
        if old_threshold is not None:
            plt.axvline(x=old_threshold,linewidth=1, color='r', linestyle='--')
            ax.text(old_threshold+0.3, .10, '{0:.2%}'.format(old_threshold/100), color='r', fontweight='bold', fontsize='large')
            plt.axvline(x=threshold,linewidth=1, color='green', linestyle='--')
            ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='green', fontweight='bold', fontsize='large')
        else:
            plt.axvline(x=threshold,linewidth=1, color='r', linestyle='--')
            ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='r', fontweight='bold', fontsize='large')
        ax.set_xlabel('')
        return ax, threshold
    
    plot, threshold = percent_hbar(wine)
    
    variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])
    
    for i, var in enumerate(wine.columns):
        variables.loc[i] = [var, wine[var].nunique(), wine[var].unique().tolist()]
    variables.set_index('Variable', inplace=True)    
    variables
    wine.fillna(-1,inplace=True)
    

    Let's look at the correlation between our data

    plt.figure(figsize=(30, 22))
    mask = np.triu(np.ones_like(wine.corr(), dtype=np.bool))
    heatmap = sns.heatmap(wine.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='Blues')
    heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':14}, pad=18);