1. Introduction to the Notebook
This workbook is created in preparation for the Associate DS practical exam. Just trying things out and explore Datacamp workspace using the available practice case study about coffee shops.
1.1 Introduction to the practical exam, company background
Java June is a coffee franchise looking to expand its business to a new market. Their strategy of rapid and sustainable growth is to get the most number of reviews in one year after a new coffee shop opens. Based on their data, all coffee franchises in other markets can get over 450 reviews on average after one year.
1.2 Customer Question
The practical examn poses a question from the perspective Java June's marketing manager:
"Can you predict whether a newly opened coffee shop can get over 450 reviews based on its characteristics?"
1.2.1 First thoughts on Customer Question
Our audience, Java June's Marketing manager, is asking whether a newly opened coffee shop can or cannot get more than 450 reviews. This seems like a (binary) classification task, either the newly opened coffee shop can or cannot get more than 450 reviews (yes or no). Another option might be a regression task, predicting the number of reviews for newly opened coffee shops with a particular confidence interval. Let's have a look at the data and see which route we take.
1.3 Dataset
The dataset contains information about coffee shops after 1 year of opening in this new market. The data is available in a DataCamp Workspace, which you can find from the certification dashboard.
The dataset needs to be validated based on the description below:
| Column Name | Criteria |
|---|---|
| Region | Character, one of 10 possible regions (A to J) where coffee shop is located. |
| Place name | Character, name of the shop. |
| Place type | Character, the type of coffee shop, one of “Coffee shop”, “Cafe”, “Espresso bar”, and “Others”. |
| Rating | Numeric, coffee shop rating (on a 5 point scale). Remove the rows if the rating is missing. |
| Enough Reviews | Binary, whether the number of reviews is over 450 or not, either True or False. |
| Price | Character, price category, one of 3 categories. |
| Delivery option | Binary, describing whether there is a delivery option, either True or False. |
| Dine in option | Binary, describing whether there is a dine-in option, either True or False. Replace missing values with False. |
| Takeout option | Binary, describing whether there is a takeout option, either True or False. Replace missing values with False. |
1.4 Thoughts on solution
We find several characteristics from which information could be gained, Region, Place type, Rating, Price, Delivery Option, Dine in option, and Takeout option. Each row is an instance, identified by a Place name. The column we're going to curate our target variable on is 'Enough Reviews', which is a binary field.
Given the data type of our target variable, a classification task will be the way forward.
1.5 Submission requirements
The following requirements must be met.
- You are going to create a written report summarizing your findings. Use the project task list provided below for guidance in the tasks you should complete and information to include in the report.
- You will need to use DataCamp Workspace to complete your analysis, write up your findings and share visualizations.
- You must use the data we provide for the analysis.
- Use the grading rubric provided below to check your work before submitting the report.
1.6 Project tasks
The following tasks must be completed for this project
1.6.1 Data validation
- Check the data against the criteria in the data dictionary.
- For each column in the data, describe the validation tasks you complete and what you found. Have you made any changes to the data to enable further analysis?
1.6.2 Exploratory Analysis
- Explore the characteristics of the numerical and categorical variables.
- Create at least two different data visualizations that include only a single variable.
- Create at least one data visualization that includes two or more variables.
- Describe what you found in the exploratory analysis. Have you made any changes to those variables to enable model fitting?
1.6.3 Model fitting
- Describe what category of machine learning models are suitable to address the problem (e.g. regression, classification, clustering).
- Choose and fit a baseline model.
- Choose and fit a comparison model.
- Explain the reason for choosing the two models above.
1.6.4 Model evaluation
- Evaluate the performance of two models by appropriate metrics.
- Compare the evaluation results between two models and describe what that means for addressing the business problem.
1.7 Before we get started
There are grading criteria available and an example solution as well but I won't drop them in here, I'll use to them when I need them instead.
2. Importing the data and libraries/packages
The practice practical exam provides a .csv file 'coffeeDSA.csv'. I've uploaded the file to this workspace
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression
# reading the csv to a Pandas dataframe
df = pd.read_csv('coffeeDSA.csv')
# print top 10 rows to see if we were successful
print(df.head(10))The csv file is now successfully read to a Pandas dataframe, let's get some info on the data and compare them to the aforementioned criteria (chapter 1.3 of this Notebook).
3. Data validation
df.info()Character fields
- Region
- Place name
- Place type
- Price
The above fields are supposed to be Dtype character/object. However, in the above info we find that fields 'Dine in option', 'Takeout option'. These two fields are supposed to be Dtype bool/Binary as per the description.
[Action: Transform the 'Dine in option', 'Takeout option' into boolean variables]
Numeric fields
- Rating
The price variable is of Dtype float64, which is exactly what it needs to be according to the set criteria.
Binary fields
- Enough Reviews
- Delivery option
- Dine in option
- Takeout option
These fields are supposed to be of Dtype Boolean, however, only the fields 'Enough Reviews', and 'Delivery option' are indeed of Dtype Boolean. We already have an action defined for the fields 'Dine in option' and 'Takeout option' which are not yet of Dtype Boolean.
Further Criteria
[Action: Inspect for missing values and replace or remove as per the requirements]
3.1 Search for missing values and act according to the defined criteria
# Count missing dine in
num_missing_dine_in = df['Dine in option'].isnull().sum()
# Count Takeout
num_missing_takeout = df['Takeout option'].isnull().sum()
# FillNa with False value
df['Dine in option'] = df['Dine in option'].fillna(False)
df['Takeout option'] = df['Takeout option'].fillna(False)For the variables 'Dine in option' and 'Takeout option' we find 60 and 56 missing values, respectively. This means we expect to get a similar number of instances where Dine in and takeout are not an option.
# Check for how many instances the rating is missing
missing_ratings = df['Rating'].isnull().sum()
print(missing_ratings)Only 2 missing ratings out of 200 so we will delete them, as it won't have too much impact.
# Drop instances with missing values, we don't need to specify any subset since we have dealt with all other missing values in Action 1.
df.dropna(subset = None , inplace = True)3.2 Transform Dtypes as per our findings
# As per action 1, defined above, we transform the two object types to Boolean as per the criteria. If more fields had to be transfered we could have used a for loop after defining a variable 'columns_to_convert' as a list of column names in our df. Then, something like for col in coluns_to_convert: df[col] = df[col].astype(bool).
df['Dine in option'] = df['Dine in option'].astype(bool)
df['Takeout option'] = df['Takeout option'].astype(bool)
# Test, the for loop way.
columns_to_test = ['Dine in option', 'Takeout option']
for col in columns_to_test:
print(f"{col}: {df[col].dtype}")