Decoding Customer Feedback: Predictive Analytics for Java June's Review Enhancement

Objectives

Java June is a company that owns coffee shops in a number of locations in Europe. The company knows that stores with more reviews typically get more new customers. This is because new customers consider the number of reviews when picking between two shops. They want to get more insight into what leads to more reviews. They are also interested in whether there is a link between the number of reviews and rating. They want a report to answer these questions.

Import & Cleaning

Data Loading and Preliminary Inspection

The first step in my analysis was to load the dataset into a Pandas DataFrame. This is a crucial process as Pandas DataFrames provide a robust and flexible structure ideal for handling and analyzing structured data efficiently.

Upon loading the data, my immediate task was to understand its structure and identify any missing values. This step is essential to ensure the integrity and reliability of my analysis. Using df.info(), I observed that there are missing values in the 'Ratings' and 'Reviews' columns, with 2 missing entries in each. More notably, the 'Dine-in' and 'Takeout' options columns have 60 and 56 missing values, respectively. Additionally, these columns are not in the appropriate binary format, which is crucial for accurately representing the nature of these features.

import pandas as pd

# Load the data into a DataFrame
df = pd.read_csv('coffee.csv')

#  Display information about the dataframe to better understand data
df.info()

Handling Missing Values

The dataset contains several columns, each with its unique characteristics and potential missing values. The following table outlines how I will address missing values in each column:

Column Name	Criteria
Region	Nominal. Store location from 10 possible regions (A to J). Missing values should be replaced with “Unknown”.
Place name	Nominal. The name of the store. Missing values should be replaced with “Unknown”.
Place type	Nominal. Type of coffee shop (Coffee shop, Cafe, Espresso bar, Others). Missing values should be replaced with “Unknown”.
Rating	Ordinal. Average rating of the store from reviews on a 5-point scale. Missing values should be replaced with 0.
Reviews	Nominal. The number of reviews given to the store. Missing values should be replaced with the overall median number.
Price	Ordinal. The price range of products in the store ($, $$, $$$). Missing values should be replaced with “Unknown”.
Delivery Option	Nominal. Indicates if delivery is available (True or False). Missing values should be replaced with False.
Dine in Option	Nominal. Indicates if dine-in is available (True or False). Missing values should be replaced with False.
Takeaway Option	Nominal. Indicates if take away is available (True or False). Missing values should be replaced with False.

To implement this, I defined a dictionary mapping each column to its respective fill value. Notably, for the 'Reviews' column, I used the median of existing values, ensuring a statistically representative fill value. Here’s the code segment illustrating this process:

# Define a dictionary for filling missing values
fill_values = {
    'Rating': 0,
    # Fill Reviews with median only after checking for missing values
    'Dine in option': False,
    'Takeout option': False
}

# Fill missing values according to the dictionary and ensure correct data types
df.fillna(fill_values, inplace=True)

# Now fill the 'Reviews' column with the median
df['Reviews'].fillna(df['Reviews'].median(), inplace=True)

# Display the first few rows of the dataframe to confirm filling of missing values
display(df.head())

# Display information about the dataframe to confirm data types and non-null counts
df.info()

Exploratory Data Analysis

Analysis of Store Ratings

To gain insights into customer satisfaction, I visualized the distribution of store ratings using a bar plot. The plot revealed two key observations:

No Low Ratings: There are no ratings below 3.9, indicating a generally high level of customer satisfaction across Java June stores.

Concentration of Higher Ratings: A significant portion of stores received ratings of 4.3 or higher, with 4.6 and 4.7 being particularly common.

This distribution suggests that customers who chose to leave ratings generally had positive experiences. The absence of lower ratings could indicate a tendency for dissatisfied customers to refrain from rating, or it might reflect an overall high standard across the stores. However, this observation warrants further investigation to understand the complete customer experience spectrum.

import matplotlib.pyplot as plt

# Create a count plot for the 'Rating' column
plt.figure(figsize=(10,6))
df['Rating'].value_counts().sort_index().plot(kind='bar', color='skyblue', edgecolor='black')

# Adding title and labels
plt.title('Number of Stores Given Each Rating', fontsize=15)
plt.xlabel('Ratings', fontsize=12)
plt.ylabel('Number of Stores', fontsize=12)
plt.xticks(rotation=0)  # Rotates X-Axis Ticks by 45-degrees
plt.grid(axis='y', linestyle='--', linewidth=0.7)

# Show plot
plt.tight_layout()
plt.show()

‌
‌
‌