Brewing Insights: Unveiling the Coffee Connection

Coffee Shop Ratings and Reviews

Java June is a company that owns coffee shops in a number of locations in Europe.

The company knows that stores with more reviews typically get more new customers. This is because new customers consider the number of reviews when picking between two shops.

They want to get more insight into what leads to more reviews.

They are also interested in whether there is a link between the number of reviews and rating.

They want a report to answer these questions.

Task 1

Before you start your analysis, you will need to make sure the data is clean.

The table below shows what the data should look like.

Create a cleaned version of the dataframe.

You should start with the data in the file "coffee.csv".
Your output should be a dataframe named clean_data.
All column names and values should match the table below.

Column Name	Criteria
Region	Nominal. Where the store is located. One of 10 possible regions (A to J). Missing values should be replaced with “Unknown”.
Place name	Nominal. The name of the store. Missing values should be replaced with “Unknown”.
Place type	Nominal. The type of coffee shop. One of “Coffee shop”, “Cafe”, “Espresso bar”, and “Others”. Missing values should be replaced with “Unknown”.
Rating	Ordinal. Average rating of the store from reviews. On a 5 point scale. Missing values should be replaced with 0.
Reviews	Nominal. The number of reviews given to the store. Missing values should be replaced with the overall median number.
Price	Ordinal. The price range of products in the store. One of '$', '$$' or '$$$'. Missing values should be replaced with ”Unknown”.
Delivery Option	Nominal. If delivery is available. Either True or False. Missing values should be replaced with False.
Dine in Option	Nominal. If dine in is available. Either True or False. Missing values should be replaced with False.
Takeaway Option	Nominal. If take away is available. Either True or False. Missing values should be replaced with False.

# Import relevant modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline

# Read-in csv file and clean data frame by imputation of N/A values
df = pd.read_csv('coffee.csv')
df['Rating'] = df['Rating'].fillna(0)
df['Reviews'] = df['Reviews'].fillna(np.median(df['Reviews'].dropna()))
df['Dine in option'] = df['Dine in option'].fillna(False)
df['Takeout option'] = df['Takeout option'].fillna(False)
clean_data = df
clean_data

Task 2

The team at Java June believe that the number of reviews changes depending on the rating.

Producing a table showing the difference in the median number of reviews by rating along with the minimum and maximum number of reviews to investigate this question for the team.

You should start with the data in the file 'coffee.csv'.
Your output should be a data frame named reviews_by_rating.
It should include the three columns rating, med_review, min_review, max_review.
Your answers should be rounded to 1 decimal place.

# Remove outlier for data point with number of reviews > 17000
df = df[df['Reviews'] < 17000]
# Make seaborn boxplot of rating vs. # of reviews 
sns.boxplot(data=df,x='Rating',y='Reviews',palette='gist_ncar_r',hue='Rating',legend=False)
# Compile median, min and max values for # of reviews per rating
reviews_by_rating = df.groupby('Rating')['Reviews'].agg(['median','min','max'])
reviews_by_rating

Task 3

Fit a baseline model to predict the number of reviews a store will get.

Fit your model using the data contained in “train.csv”
Use “validation.csv” to predict new values based on your model. You must return a dataframe named base_result, that includes Place name and rating. The rating column must be your predicted values.

# Import training data (train.csv) for baseline model and clean up data frame
from sklearn.metrics import mean_squared_error as MSE
train = pd.read_csv('train.csv')
train['Rating'] = train['Rating'].fillna(0)
train['Reviews'] = train['Reviews'].fillna(np.median(train['Reviews'].dropna()))
train.columns = train.columns.str.replace('.',' ')
train['Dine in option'] = train['Dine in option'].fillna(False)
train['Takeout option'] = train['Takeout option'].fillna(False)
# Drop place name column for model fitting and separate features and # of reviews into X_train and y_train, respectively. We take the logarithm of # of reviews to avoid negative predicted values.
train = train.drop(columns=['Place name'])
features = train.drop(columns=['Reviews'])
X_train = pd.get_dummies(features)
y_train = np.log(train['Reviews'])
# Fit baseline linear regression model
lr = LinearRegression()
lr.fit(X_train,y_train)
# Import validation data (validation.csv) and clean up data frame
validation = pd.read_csv('validation.csv')
validation.columns = validation.columns.str.replace('.',' ')
validation['Rating'] = validation['Rating'].fillna(0)
validation['Dine in option'] = validation['Dine in option'].fillna(False)
validation['Takeout option'] = validation['Takeout option'].fillna(False)
# Extract place name column for reallocation post-prediction
names = validation['Place name']
# Drop place name column for model prediction and prepare data for prediction, then take exponent of predicted values to obtain logical review numbers.
validation = validation.drop(columns=['Place name'])
X_test = pd.get_dummies(validation)
X_test = X_test.reindex(columns=X_train.columns, fill_value=0)  # Reorder columns to match X_train
y_pred_lr = lr.predict(X_test)
rating = np.exp(y_pred_lr)
base_result = pd.DataFrame({'Place name':names,'rating':rating})
base_result

Task 4

Fit a comparison model to predict the number of reviews a store will get.

Fit your model using the data contained in “train.csv”
Use “validation.csv” to predict new values based on your model. You must return a dataframe named compare_result, that includes Place name and rating. The rating column must be your predicted values.

# Fit comparison model to training data using Random Forest Regressor from sklearn
rf = RandomForestRegressor()
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)
rating_rf = np.exp(y_pred_rf)
compare_result = pd.DataFrame({'Place name':names,'rating':rating_rf})
compare_result
# Drop possible duplicates of place names
y_test = clean_data[clean_data['Place name'].isin(names)].drop_duplicates(subset=['Place name','Place type'],keep='first')
remover = (y_test['Place name'] == 'Dim Kavy') & (y_test['Place type'] == 'Others')
y_test = y_test[~remover]['Reviews'].values
y_test
# Calculate root mean squared error of both linear regression and random forest models (RF performs slightly better)
rmse_lr = np.sqrt(MSE(y_test,rating))
rmse_rf = np.sqrt(MSE(y_test,rating_rf))
print(f"The Root Mean Squared Error (RMSE) for the Linear Regression model is {rmse_lr} and for the Random Forest Regressor is {rmse_rf}")