Coffee Shop Ratings and Reviews
Java June is a company that owns coffee shops in a number of locations in Europe.
The company knows that stores with more reviews typically get more new customers. This is because new customers consider the number of reviews when picking between two shops.
They want to get more insight into what leads to more reviews.
They are also interested in whether there is a link between the number of reviews and rating.
They want a report to answer these questions.
Task 1
Before you start your analysis, you will need to make sure the data is clean.
The table below shows what the data should look like.
Create a cleaned version of the dataframe.
-
You should start with the data in the file "coffee.csv".
-
Your output should be a dataframe named
clean_data. -
All column names and values should match the table below.
| Column Name | Criteria |
|---|---|
| Region | Nominal. Where the store is located. One of 10 possible regions (A to J). Missing values should be replaced with “Unknown”. |
| Place name | Nominal. The name of the store. Missing values should be replaced with “Unknown”. |
| Place type | Nominal. The type of coffee shop. One of “Coffee shop”, “Cafe”, “Espresso bar”, and “Others”. Missing values should be replaced with “Unknown”. |
| Rating | Ordinal. Average rating of the store from reviews. On a 5 point scale. Missing values should be replaced with 0. |
| Reviews | Nominal. The number of reviews given to the store. Missing values should be replaced with the overall median number. |
| Price | Ordinal. The price range of products in the store. One of '$', '$$' or '$$$'. Missing values should be replaced with ”Unknown”. |
| Delivery Option | Nominal. If delivery is available. Either True or False. Missing values should be replaced with False. |
| Dine in Option | Nominal. If dine in is available. Either True or False. Missing values should be replaced with False. |
| Takeaway Option | Nominal. If take away is available. Either True or False. Missing values should be replaced with False. |
# Import relevant modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
%matplotlib inline
# Read-in csv file and clean data frame by imputation of N/A values
df = pd.read_csv('coffee.csv')
df['Rating'] = df['Rating'].fillna(0)
df['Reviews'] = df['Reviews'].fillna(np.median(df['Reviews'].dropna()))
df['Dine in option'] = df['Dine in option'].fillna(False)
df['Takeout option'] = df['Takeout option'].fillna(False)
clean_data = df
clean_dataTask 2
The team at Java June believe that the number of reviews changes depending on the rating.
Producing a table showing the difference in the median number of reviews by rating along with the minimum and maximum number of reviews to investigate this question for the team.
-
You should start with the data in the file 'coffee.csv'.
-
Your output should be a data frame named
reviews_by_rating. -
It should include the three columns
rating,med_review,min_review,max_review. -
Your answers should be rounded to 1 decimal place.
# Remove outlier for data point with number of reviews > 17000
df = df[df['Reviews'] < 17000]
# Make seaborn boxplot of rating vs. # of reviews
sns.boxplot(data=df,x='Rating',y='Reviews',palette='gist_ncar_r',hue='Rating',legend=False)
# Compile median, min and max values for # of reviews per rating
reviews_by_rating = df.groupby('Rating')['Reviews'].agg(['median','min','max'])
reviews_by_ratingTask 3
Fit a baseline model to predict the number of reviews a store will get.
-
Fit your model using the data contained in “train.csv”
-
Use “validation.csv” to predict new values based on your model. You must return a dataframe named
base_result, that includesPlace nameandrating. The rating column must be your predicted values.
# Import training data (train.csv) for baseline model and clean up data frame
from sklearn.metrics import mean_squared_error as MSE
train = pd.read_csv('train.csv')
train['Rating'] = train['Rating'].fillna(0)
train['Reviews'] = train['Reviews'].fillna(np.median(train['Reviews'].dropna()))
train.columns = train.columns.str.replace('.',' ')
train['Dine in option'] = train['Dine in option'].fillna(False)
train['Takeout option'] = train['Takeout option'].fillna(False)
# Drop place name column for model fitting and separate features and # of reviews into X_train and y_train, respectively. We take the logarithm of # of reviews to avoid negative predicted values.
train = train.drop(columns=['Place name'])
features = train.drop(columns=['Reviews'])
X_train = pd.get_dummies(features)
y_train = np.log(train['Reviews'])
# Fit baseline linear regression model
lr = LinearRegression()
lr.fit(X_train,y_train)
# Import validation data (validation.csv) and clean up data frame
validation = pd.read_csv('validation.csv')
validation.columns = validation.columns.str.replace('.',' ')
validation['Rating'] = validation['Rating'].fillna(0)
validation['Dine in option'] = validation['Dine in option'].fillna(False)
validation['Takeout option'] = validation['Takeout option'].fillna(False)
# Extract place name column for reallocation post-prediction
names = validation['Place name']
# Drop place name column for model prediction and prepare data for prediction, then take exponent of predicted values to obtain logical review numbers.
validation = validation.drop(columns=['Place name'])
X_test = pd.get_dummies(validation)
X_test = X_test.reindex(columns=X_train.columns, fill_value=0) # Reorder columns to match X_train
y_pred_lr = lr.predict(X_test)
rating = np.exp(y_pred_lr)
base_result = pd.DataFrame({'Place name':names,'rating':rating})
base_resultTask 4
Fit a comparison model to predict the number of reviews a store will get.
-
Fit your model using the data contained in “train.csv”
-
Use “validation.csv” to predict new values based on your model. You must return a dataframe named
compare_result, that includesPlace nameandrating. The rating column must be your predicted values.
# Fit comparison model to training data using Random Forest Regressor from sklearn
rf = RandomForestRegressor()
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)
rating_rf = np.exp(y_pred_rf)
compare_result = pd.DataFrame({'Place name':names,'rating':rating_rf})
compare_result
# Drop possible duplicates of place names
y_test = clean_data[clean_data['Place name'].isin(names)].drop_duplicates(subset=['Place name','Place type'],keep='first')
remover = (y_test['Place name'] == 'Dim Kavy') & (y_test['Place type'] == 'Others')
y_test = y_test[~remover]['Reviews'].values
y_test
# Calculate root mean squared error of both linear regression and random forest models (RF performs slightly better)
rmse_lr = np.sqrt(MSE(y_test,rating))
rmse_rf = np.sqrt(MSE(y_test,rating_rf))
print(f"The Root Mean Squared Error (RMSE) for the Linear Regression model is {rmse_lr} and for the Random Forest Regressor is {rmse_rf}")import seaborn as sns
# Set seaborn style
sns.set(style="darkgrid")
# Plot the ratings for linear regression and random forest regressor models vs. real data using seaborn
plt.figure(figsize=(10, 6))
ax = sns.lineplot(data=pd.DataFrame({'rating': rating, 'rating_rf': rating_rf, 'y_test': y_test}))
plt.xlabel('Shop Index')
plt.ylabel('Rating')
plt.title('Comparison of Regression Ratings')
leg = plt.legend(['Linear Regression', 'Random Forest Regressor', 'Coffee Data'])
# Fix problem of green dotted line not showing up in legend
ax.legend_.legendHandles[1].set_color('g')
ax.legend_.legendHandles[1].set_height(0.1)
ax.legend_.legendHandles[1].set_linestyle(':')
leg_lines = leg.get_lines()
print(leg_lines)
plt.show()