Data Analyst: Michele Bedin (www.michelebedin.com)
- Step 1 - Project proposal
- Step 2 - Understand the data
- Step 3 - EDA
- Step 4 - Statistical tests
- Step 5 - Regression modelling
- Step 6 - Machine learning models (current)
- Step 7 - Work delivery
Introduction
You are a data professional at a data analysis company called Automatidata. Their client, the New York City Taxi & Limousine Commission (TLC), was impressed by your work and asked you to create a machine learning model to predict whether a customer will not tip. They wanted to use the model in an app that would alert taxi drivers to customers who are unlikely to tip, since taxi drivers depend on tips.
Step 6: Build a machine learning model
In this activity you will practise using tree-based modelling techniques to make predictions about a binary target class.
The aim of this model is to find ways to generate more revenue for taxi drivers.
The aim of this model is to predict whether or not a customer is generous with tips.
This task is divided into three parts
Task 1: Ethical Considerations
-
Consider the ethical implications of the request
-
Should the objective of the model be changed?
Task 2: Feature Engineering
- Perform feature selection, extraction and transformation to prepare the data for modelling.
Task 3: Modelling
- Build models, evaluate them and recommend next steps.
framework PACE
In this notebook, reference is made to the PACE problem-solving framework. The following components of the notebook are labelled with the respective PACE phase: Planning, Analysing, Constructing and Executing.
Pace: Plan
At this stage, consider the following questions:
What are you asked to do?
Predict whether a customer will not leave a tip.
What are the ethical implications of the model? What are the consequences of a possible error in your model? What is the likely effect of the model when it predicts a false negative (i.e. when the model says a customer will tip, but in fact will not)? What is the likely effect of the model when it predicts a false positive (i.e. when the model says a customer will not tip, but in fact will)?
False negative errors could erode drivers' trust in the app. False positives could restrict access to the taxi service for customers wrongly labelled as bad tippers. Both scenarios could damage the company's reputation and raise ethical questions about fairness and discrimination.
**Do the benefits of this model outweigh the potential problems?
The risks of discrimination and loss of trust seem to outweigh the potential benefits of extra earnings for drivers.
Would you proceed with the request to create this model? Why yes or why no?
No. Limiting equal access to taxis is ethically questionable and carries high risks.
**Could the objective be modified to make it less problematic?
We could develop a model that identifies the most generous customers, to help drivers increase revenue without unfairly excluding certain categories of people.
Suppose we modify the modelling objective so that, instead of predicting people who will not leave a tip, we predict that people will be particularly generous when they leave a tip of 20% or more? Consider the following questions:
What characteristics do you need to make this prediction?
Ideally, we would need the behavioural history of each customer to know the tips left in the past. I would also include times, dates and places of departure and arrival, estimated fares and method of payment.
What would be the target variable?
The target variable would be binary (1 or 0), indicating whether the customer is expected to leave a tip ≥ 20%.
What metrics should you use to evaluate your model? Do you have enough information to decide now?
This is a supervised learning classification task. We could use accuracy, precision, recall, F-score or area under the ROC curve. However, we currently do not have enough information to decide which metric is most appropriate; we need to know the class balance of the target variable.