Project: SAMPLE EXAM Data Scientist Associate Practical

Sample Exam: Coffee Shops

Java June is a company that owns coffee shops in a number of locations in Europe.

The company knows that stores with more reviews typically get more new customers. This is because new customers consider the number of reviews when picking between two shops.

They want to get more insight into what leads to more reviews.

They are also interested in whether there is a link between the number of reviews and rating.

They want a report to answer these questions.

Task 1

Before you start your analysis, you will need to make sure the data is clean.

The table below shows what the data should look like.

Create a cleaned version of the dataframe.

You should start with the data in the file "coffee.csv".
Your output should be a dataframe named clean_data.
All column names and values should match the table below.

Column Name	Criteria
Region	Nominal. Where the store is located. One of 10 possible regions (A to J). Missing values should be replaced with “Unknown”.
Place name	Nominal. The name of the store. Missing values should be replaced with “Unknown”.
Place type	Nominal. The type of coffee shop. One of “Coffee shop”, “Cafe”, “Espresso bar”, and “Others”. Missing values should be replaced with “Unknown”.
Rating	Ordinal. Average rating of the store from reviews. On a 5 point scale. Missing values should be replaced with 0.
Reviews	Nominal. The number of reviews given to the store. Missing values should be replaced with the overall median number.
Price	Ordinal. The price range of products in the store. One of '$', '$$' or '$$$'. Missing values should be replaced with ”Unknown”.
Delivery Option	Nominal. If delivery is available. Either True or False. Missing values should be replaced with False.
Dine in Option	Nominal. If dine in is available. Either True or False. Missing values should be replaced with False.
Takeaway Option	Nominal. If take away is available. Either True or False. Missing values should be replaced with False.

import pandas as pd
df = pd.read_csv('coffee.csv') 
df.columns
df['Region'].fillna('Unknown',inplace=True)
df['Place name'].fillna('Unknown', inplace=True)
df['Place type'].fillna('Unknown', inplace=True)
df['Rating'].fillna(0, inplace=True)
reviews_median = df['Reviews'].median()  # Corrected line
df['Reviews'].fillna(reviews_median, inplace=True)
df['Price'].fillna('Unknown', inplace=True)
df['Delivery option'].fillna('False', inplace=True)
df['Dine in option'].fillna('False', inplace=True)
df['Takeout option'].fillna('False', inplace=True)
df.to_csv('clean_data.csv', index=False)

reviews_by_rating = df.groupby('Rating')['Reviews'].agg(['median', 'min', 'max']).round(1).reset_index() reviews_by_rating.columns = ['rating', 'med_review', 'min_review', 'max_review'] reviews_by_rating

# Use this cell to write your code for task 2
df.Reviews.describe()
reviews_by_rating = df.groupby('Rating')['Reviews'].agg(['median','min','max']).round(1).reset_index()
reviews_by_rating.columns = ['rating', 'med_review', 'min_review', 'max_review']
reviews_by_rating
# print(df.Reviews.unique())
# reviews_by_rating['med_review']=271.5
# reviews_by_rating['min_review']=3.0
# reviews_by_rating['max_review']=17937.0
# print(reviews_by_rating.head())
# print(df.Rating.unique())
# pd.set_option('display.max_columns',None)
# print(df.head())

Task 3

Fit a baseline model to predict the number of reviews a store will get.

Fit your model using the data contained in “train.csv”
Use “validation.csv” to predict new values based on your model. You must return a dataframe named base_result, that includes Place name and rating. The rating column must be your predicted values.

import sklearn
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression

# Prepare the data for training the model. The x or independent variable is rating and the value we want to predict is Reviews. 
X_train = traindf[['Rating']]
y_train = traindf['Reviews']
missing_values = X_train.isnull().sum()
print(missing_values)

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_train.dropna(inplace = True)
y_train = y_train[X_train.index]


# Fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Prepare the validation data
X_valid = validdf[['Rating']]

# Impute missing values in the validation data 
X_valid_imputed = imputer.transform(X_valid) 

# Predict the number of reviews for the validation data 
y_pred = model.predict(X_valid_imputed)

# Create the base_result dataframe
base_result = validdf[['Place.name', 'Rating']].copy()
base_result['Predicted number of reviews'] = y_pred
base_result

Task 4

Fit a comparison model to predict the number of reviews a store will get.

Fit your model using the data contained in “train.csv”
Use “validation.csv” to predict new values based on your model. You must return a dataframe named compare_result, that includes Place name and rating. The rating column must be your predicted values.

# Prepare the data for training the model
X_train = train_df[['Rating']]
y_train = train_df['Reviews']

# Check for missing values in y_train
missing_values_y = y_train.isnull().sum()
print("Missing values in y_train:", missing_values_y)

# Handle missing values in y_train
y_train.dropna(inplace=True)
X_train = X_train.loc[y_train.index]

# Handle missing values in X_train
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)

# Fit the linear regression model
model = LinearRegression()
model.fit(X_train_imputed, y_train)

# Load the validation data
valid_df = pd.read_csv("validation.csv")

# Prepare the validation data
X_valid = valid_df[['Rating']]

# Handle missing values in X_valid
X_valid_imputed = imputer.transform(X_valid)

# Predict the number of reviews for the validation data
y_pred = model.predict(X_valid_imputed)

# Create the compare_result dataframe
compare_result = valid_df[['Place.name', 'Rating']].copy()
compare_result['rating'] = y_pred

# Display the result or return the dataframe as needed
print(compare_result)

Project: SAMPLE EXAM Data Scientist Associate Practical

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Sample Exam: Coffee Shops

Task 1

Task 3

Task 4

Sample Exam: Coffee Shops