Sample Exam: Coffee Shops
Java June is a company that owns coffee shops in a number of locations in Europe.
The company knows that stores with more reviews typically get more new customers. This is because new customers consider the number of reviews when picking between two shops.
They want to get more insight into what leads to more reviews.
They are also interested in whether there is a link between the number of reviews and rating.
They want a report to answer these questions.
Task 1
Before you start your analysis, you will need to make sure the data is clean.
The table below shows what the data should look like.
Create a cleaned version of the dataframe.
-
You should start with the data in the file "coffee.csv".
-
Your output should be a dataframe named
clean_data. -
All column names and values should match the table below.
| Column Name | Criteria |
|---|---|
| Region | Nominal. Where the store is located. One of 10 possible regions (A to J). Missing values should be replaced with “Unknown”. |
| Place name | Nominal. The name of the store. Missing values should be replaced with “Unknown”. |
| Place type | Nominal. The type of coffee shop. One of “Coffee shop”, “Cafe”, “Espresso bar”, and “Others”. Missing values should be replaced with “Unknown”. |
| Rating | Ordinal. Average rating of the store from reviews. On a 5 point scale. Missing values should be replaced with 0. |
| Reviews | Nominal. The number of reviews given to the store. Missing values should be replaced with the overall median number. |
| Price | Ordinal. The price range of products in the store. One of '$', '$$' or '$$$'. Missing values should be replaced with ”Unknown”. |
| Delivery Option | Nominal. If delivery is available. Either True or False. Missing values should be replaced with False. |
| Dine in Option | Nominal. If dine in is available. Either True or False. Missing values should be replaced with False. |
| Takeaway Option | Nominal. If take away is available. Either True or False. Missing values should be replaced with False. |
import pandas as pd
df = pd.read_csv('coffee.csv')
df.columns
df['Region'].fillna('Unknown',inplace=True)
df['Place name'].fillna('Unknown', inplace=True)
df['Place type'].fillna('Unknown', inplace=True)
df['Rating'].fillna(0, inplace=True)
reviews_median = df['Reviews'].median() # Corrected line
df['Reviews'].fillna(reviews_median, inplace=True)
df['Price'].fillna('Unknown', inplace=True)
df['Delivery option'].fillna('False', inplace=True)
df['Dine in option'].fillna('False', inplace=True)
df['Takeout option'].fillna('False', inplace=True)
df.to_csv('clean_data.csv', index=False)reviews_by_rating = df.groupby('Rating')['Reviews'].agg(['median', 'min', 'max']).round(1).reset_index() reviews_by_rating.columns = ['rating', 'med_review', 'min_review', 'max_review'] reviews_by_rating
# Use this cell to write your code for task 2
df.Reviews.describe()
reviews_by_rating = df.groupby('Rating')['Reviews'].agg(['median','min','max']).round(1).reset_index()
reviews_by_rating.columns = ['rating', 'med_review', 'min_review', 'max_review']
reviews_by_rating
# print(df.Reviews.unique())
# reviews_by_rating['med_review']=271.5
# reviews_by_rating['min_review']=3.0
# reviews_by_rating['max_review']=17937.0
# print(reviews_by_rating.head())
# print(df.Rating.unique())
# pd.set_option('display.max_columns',None)
# print(df.head())
Task 3
Fit a baseline model to predict the number of reviews a store will get.
-
Fit your model using the data contained in “train.csv”
-
Use “validation.csv” to predict new values based on your model. You must return a dataframe named
base_result, that includesPlace nameandrating. The rating column must be your predicted values.
import sklearn
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
# Prepare the data for training the model. The x or independent variable is rating and the value we want to predict is Reviews.
X_train = traindf[['Rating']]
y_train = traindf['Reviews']
missing_values = X_train.isnull().sum()
print(missing_values)
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_train.dropna(inplace = True)
y_train = y_train[X_train.index]
# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Prepare the validation data
X_valid = validdf[['Rating']]
# Impute missing values in the validation data
X_valid_imputed = imputer.transform(X_valid)
# Predict the number of reviews for the validation data
y_pred = model.predict(X_valid_imputed)
# Create the base_result dataframe
base_result = validdf[['Place.name', 'Rating']].copy()
base_result['Predicted number of reviews'] = y_pred
base_result
Task 4
Fit a comparison model to predict the number of reviews a store will get.
-
Fit your model using the data contained in “train.csv”
-
Use “validation.csv” to predict new values based on your model. You must return a dataframe named
compare_result, that includesPlace nameandrating. The rating column must be your predicted values.
# Prepare the data for training the model
X_train = train_df[['Rating']]
y_train = train_df['Reviews']
# Check for missing values in y_train
missing_values_y = y_train.isnull().sum()
print("Missing values in y_train:", missing_values_y)
# Handle missing values in y_train
y_train.dropna(inplace=True)
X_train = X_train.loc[y_train.index]
# Handle missing values in X_train
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
# Fit the linear regression model
model = LinearRegression()
model.fit(X_train_imputed, y_train)
# Load the validation data
valid_df = pd.read_csv("validation.csv")
# Prepare the validation data
X_valid = valid_df[['Rating']]
# Handle missing values in X_valid
X_valid_imputed = imputer.transform(X_valid)
# Predict the number of reviews for the validation data
y_pred = model.predict(X_valid_imputed)
# Create the compare_result dataframe
compare_result = valid_df[['Place.name', 'Rating']].copy()
compare_result['rating'] = y_pred
# Display the result or return the dataframe as needed
print(compare_result)