Skip to content
Tell me who makes and drinks wine
  • AI Chat
  • Code
  • Report
  • Global Wine Markets 2014 to 2016

    📖 Background

    With the end of year holidays approaching, many people like to relax or party with a glass of wine. That makes wine an important industry in many countries. Understanding this market is important to the livelihood of many people.

    You work at a multinational consumer goods organization that is considering entering the wine production industry. Managers at your company would like to understand the market better before making a decision.

    💾 The data

    This dataset is a subset of the University of Adelaide's Annual Database of Global Wine Markets.

    The dataset consists of a single CSV file, data/wine.csv.

    Each row in the dataset represents the wine market in one country. There are 34 metrics for the wine industry covering both the production and consumption sides of the market.

    💪 Challenge

    Explore the dataset to understand the global wine market. Your published notebook should contain a short report on the state of the market, including summary statistics, visualizations, and text describing any insights you found.

    Introduction

    The dataset used for this analysis comprises information on 52 countries, covering 36 features related to the wine market. Key features include:

    • Region: The geographic region where the country is located (e.g., Western Europe, North America).
    • Country: The name of the country.
    • Vine Area ('000 ha): The area of land dedicated to vineyards in thousands of hectares.
    • Cropland under vines (%): The percentage of total cropland that is under vine cultivation.
    • Wine produced (ML): The volume of wine produced in millions of liters.
    • Wine consumed (ML): The volume of wine consumed in millions of liters.
    • Wine consumed (l/capita): The volume of wine consumed per capita in liters.
    • Wine expenditure (US$m 2015): The total expenditure on wine in 2015 US dollars.
    • Per capita wine expenditure (US$ 2015): The per capita expenditure on wine in 2015 US dollars.
    • Population (millions): The population of the country in millions.
    • GDP (billion US$ real 1990): The Gross Domestic Product of the country in billion US dollars, adjusted to 1990 values.
    • GDP per capita ('000 US$): The GDP per capita in thousand US dollars.
    • Wine export vol. (ML): The volume of wine exported in millions of liters.
    • Wine import vol. (ML): The volume of wine imported in millions of liters.
    • Value of wine exports (US$ mill): The total value of wine exports in million US dollars.
    • Value of wine imports (US$ mill): The total value of wine imports in million US dollars.
    • Bottled still wine exports (ML): The volume of bottled still wine exported in millions of liters.
    • Bottled still wine imports (ML): The volume of bottled still wine imported in millions of liters.
    • Sparkling wine exports (ML): The volume of sparkling wine exported in millions of liters.
    • Sparkling wine imports (ML): The volume of sparkling wine imported in millions of liters.
    • Bulk wine exports (ML): The volume of bulk wine exported in millions of liters.
    • Bulk wine imports (ML): The volume of bulk wine imported in millions of liters.
    • Unit value exports (US$/litre): The average value per liter of wine exports in US dollars.
    • Unit value imports (US$/litre): The average value per liter of wine imports in US dollars.
    • % of global prod'n volume: The percentage of the global production volume contributed by the country.
    • % of global cons'n volume: The percentage of the global consumption volume contributed by the country.
    • % of '15 global wine expend.: The percentage of the global wine expenditure in 2015 contributed by the country.
    • Wine as % of alcohol cons'n volume: The percentage of wine in the total alcohol consumption volume.
    • Exports as % of prod'n volume: The percentage of wine production volume that is exported.
    • Imports as % of cons'n volume: The percentage of wine consumption volume that is imported.
    • Wine self- suff. (%): The percentage of wine consumption that is met by domestic production.
    • % of world export volume: The percentage of the world export volume contributed by the country.
    • % of world export value: The percentage of the world export value contributed by the country.
    • % of world import volume: The percentage of the world import volume contributed by the country.
    • % of world import value: The percentage of the world import value contributed by the country.
    • Index of wine comp. advant.: An index measuring the comparative advantage of the country in wine production and trade.

    This dataset also includes demographic and economic indicators such as population size, GDP, and GDP per capita, providing additional context for understanding the wine market dynamics.

    The analysis will proceed with data preparation, exploratory data analysis (EDA), market segmentation, geographic analysis, and competitive analysis. The goal is to uncover trends, patterns, and insights that can inform strategic decisions regarding the potential entry into the wine production industry.

    The aim is to identify key market segments, understand regional differences in wine production and consumption, and evaluate the competitive landscape. This will enable the organization to make informed decisions and develop targeted strategies for entering and succeeding in the global wine market.

    Imports

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score
    import geopandas as gpd
    import folium
    from folium.plugins import HeatMap
    
    
    pd.set_option('display.float_format', '{:,.2f}'.format)
    wine = pd.read_csv("data/wine.csv")

    Data Preparation

    # Load the dataset
    file_path = 'data/wine.csv'
    wine_data = pd.read_csv(file_path)
    
    # Display the initial dataset information
    print("Initial Dataset Information:")
    wine_data.info()
    
    # Handle missing values
    # Use SimpleImputer to fill missing numerical values with the mean
    imputer = SimpleImputer(strategy='mean')
    wine_data_imputed = pd.DataFrame(imputer.fit_transform(wine_data.select_dtypes(include=['float64'])), columns=wine_data.select_dtypes(include=['float64']).columns)
    
    # For categorical columns, we will fill missing values with the mode (most frequent value)
    wine_data_categorical = wine_data.select_dtypes(include=['object']).apply(lambda x: x.fillna(x.mode()[0]))
    
    # Combine the imputed numerical data and categorical data
    wine_data_cleaned = pd.concat([wine_data_categorical, wine_data_imputed], axis=1)
    
    # Duplicated to preserve nominal figures
    df = wine_data_cleaned.copy()
    
    # Standardize numerical features
    scaler = StandardScaler()
    numerical_features = wine_data_cleaned.select_dtypes(include=['float64']).columns
    wine_data_cleaned[numerical_features] = scaler.fit_transform(wine_data_cleaned[numerical_features])
    
    # Save the cleaned dataset for future use
    cleaned_file_path = 'data/cleaned_wine_data.csv'
    wine_data_cleaned.to_csv(cleaned_file_path, index=False)
    

    EDA

    summary_stats = df.describe()
    summary_stats
    frequency_distribution = wine_data_cleaned.describe(include=['object'])
    print("\nFrequency Distribution for Categorical Columns:")
    print(frequency_distribution)
    fig, axs = plt.subplots(2, 1, figsize = (16, 6), sharex = True)
    
    sns.histplot(df['Wine produced (ML)'], kde=True, color='blue', ax = axs[0])
    axs[0].set_title('Distribution of Wine Production (ML)')
    
    sns.boxplot(data = df, x = 'Wine produced (ML)', color='blue', ax = axs[1])
    axs[1].set_title('Box Plot of Wine Production (ML)')
    
    plt.tight_layout()
    # Histogram and boxplot of wine production
    plt.figure(figsize=(16, 4))
    tmp = df.groupby('Region')['Wine produced (ML)'].count().sort_values(ascending = False).reset_index()
    tmp.columns = ['Region', 'Count']
    sns.barplot(data = tmp, x = 'Region', y = 'Count', color='blue')
    plt.title('Frequency of Regions')
    plt.tight_layout()