Clipboard Health - Pricing Case Study
You’re launching a ride-hailing service that matches riders with drivers for trips between the Toledo Airport and Downtown Toledo. It’ll be active for only 12 months. You’ve been forced to charge riders $30 for each ride. You can pay drivers what you choose for each individual ride.
The supply pool (“drivers”) is very deep. When a ride is requested, a very large pool of drivers see a notification informing them of the request. They can choose whether or not to accept it. Based on a similar ride-hailing service in the same market, you have some data on which ride requests were accepted and which were not. (The PAY column is what drivers were offered and the ACCEPTED column reflects whether any driver accepted the ride request.)
The demand pool (“riders”) can be acquired at a cost of $30 per rider at any time during the 12 months. There are 10,000 riders in Toledo, but you can’t acquire more than 1,000 in a given month. You start with 0 riders. “Acquisition” means that the rider has downloaded the app and may request rides. Requested rides may or may not be accepted by a driver. In the first month that riders are active, they request rides based on a Poisson distribution where lambda = 1. For each subsequent month, riders request rides based on a Poisson distribution where lambda is the number of rides that they found a match for in the previous month. (As an example, a rider that requests 3 rides in month 1 and finds 2 matches has a lambda of 2 going into month 2.) If a rider finds no matches in a month (which may happen either because they request no rides in the first place based on the Poisson distribution or because they request rides and find no matches), they leave the service and never return.
Submit a written document that proposes a pricing strategy to maximize the profit of the business over the 12 months. You should expect that this singular document will serve as a proposal for
- A quantitative executive team that wants to know how you’re thinking about the problem and what assumptions you’re making but that does not know probability theory
- Your data science peers so they can push on your thinking
Please submit any work you do, code or math, with your solution.
1. Strategy introduction
Since there is no control over customer acquisition, or behavior, and the cost of each ride is fixed at $30, the proposed strategy is based on fostering customer retention.
The pricing algorithm will operate under the assumption that a driver should be incentivized if and when, by rejecting a ride, they can cause a customer to drop out straight away, or cause a negative ripple effect over the following months.
An example of such an effect is a customer's lambda (the number of rides that will be requested the following month) being lowered beyond a useful threshold, which in turn damages the relationship of that customer with the company for the remaining months of the simulation.
On the other hand, excessive incentivization will erode much of the profit, even if applied aggressively only over the first months.
If a balance can be found, then this strategy becomes viable.
And indeed a very good balance will be found, that yields a 111.00% increase in net profit over the baseline.
2. Imports
# Base modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Extra modules
import scipy.stats as stats
from collections import Counter
import time
# Type hinting
from typing import Optional, Callable, Union, List
# ML
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report3. Setup
# Reproducibilty
seed = 42
np.random.seed(seed)
# Plots
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['figure.autolayout'] = True
plot_suptitile_font_size = 14
plot_title_fontsize = 12
plot_labels_fontsize = 12
plot_legend_ticklabels_font_size = 9
sns.set()4. Custom functions
def subset_no_yes(df: pd.DataFrame) -> tuple:
"""Subsets source DataFrame according to outcome (no / yes).
Args:
df: Pandas DataFrame containing 'PAY' column and 'ACCEPTED' column
Returns:
Tuple: [pay for negative outmcomes, pay for positive outmcomes]
"""
pay_for_negative_outcomes = df.loc[df['ACCEPTED'] == 0, 'PAY']
pay_for_positive_outcomes = df.loc[df['ACCEPTED'] == 1, 'PAY']
return pay_for_negative_outcomes, pay_for_positive_outcomes
def plot_histograms(pay_no: pd.Series, pay_yes: pd.Series) -> None:
"""Calculates bin size and plots histograms for two Pandas Series.
Args:
pay_no: Pay data for negative outcomes
pay_yes: Pay data for positive outcomes
"""
# Binning
bins_pay_no = int(np.sqrt(len(pay_no)))
bins_pay_yes = int(np.sqrt(len(pay_yes)))
# Histograms
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 4))
ax.hist(pay_no, bins=bins_pay_no, label='No', color='C0', histtype='stepfilled', alpha=0.5, density=False)
ax.hist(pay_yes, bins=bins_pay_yes, label='Yes', color='C1', histtype='stepfilled', alpha=0.5, density=False)
plt.suptitle('Ride Acceptance', fontsize=14, y=0.95)
ax.set_title('Histograms', fontsize=plot_title_fontsize)
ax.set_xlabel('Pay $', fontsize=plot_labels_fontsize)
ax.set_ylabel('Number of rides', fontsize=plot_labels_fontsize)
ax.tick_params(labelsize=plot_legend_ticklabels_font_size)
ax.legend(fontsize=plot_legend_ticklabels_font_size)
plt.show()
return None
def find_current_intersection(pay_no: pd.Series, pay_yes: pd.Series, report: bool = False) -> float:
"""Finds intersections between two Gaussian functions.
The results are restricted within an interval between lowest minumim and highest maximum of the two Series.
Args:
pay_no: Pay data for negative outcomes
pay_yes: Pay data for positive outcomes
report: Whether to print the intersection value
Notes:
This function has been adapted from the code found at the followng link:
https://stackoverflow.com/questions/22579434/python-finding-the-intersection-point-of-two-gaussian-curves
Returns:
Float: intersection value in dollars
"""
pay_no_mean = pay_no.mean()
pay_yes_mean = pay_yes.mean()
pay_no_std = pay_no.std()
pay_yes_std = pay_yes.std()
minimum = min(pay_no.min(), pay_yes.min())
maximum = max(pay_no.max(), pay_yes.max())
coef_1 = 1 / (2 * pay_no_std ** 2) - 1 / (2 * pay_yes_std ** 2)
coef_2 = pay_yes_mean / (pay_yes_std ** 2) - pay_no_mean / (pay_no_std ** 2)
coef_3 = pay_no_mean ** 2 / (2 * pay_no_std ** 2) - pay_yes_mean ** 2 / \
(2 * pay_yes_std ** 2) - np.log(pay_yes_std / pay_no_std)
intersections = [np.around(root, 2) for root in np.roots([coef_1, coef_2, coef_3]) if minimum <= root <= maximum]
if len(intersections) != 1:
raise ValueError('WARNING: multiple intersections found')
output = intersections[0]
if report:
print(f"Current intersection: $ {output}\n")
return output
def plot_pdf_cdf(pay_no: pd.Series, pay_yes: pd.Series, intersection: float) -> None:
"""Calculates normal distribution from data, then plots KDE & Normal PDF, and ECDF & Normal CDF
Args:
pay_no: Pay data for negative outcomes
pay_yes: Pay data for positive outcomes
intersection: Current point of intersection between the two distributions
"""
# Generate normal distributions based on data
normal_dist_from_pay_no = stats.norm(loc=pay_no.mean(), scale=pay_no.std())
normal_dist_from_pay_yes = stats.norm(loc=pay_yes.mean(), scale=pay_yes.std())
# Calculate x & y for Empirical CDF
pay_no_ecdf_x, pay_no_ecdf_y = ecdf(pay_no)
pay_yes_ecdf_x, pay_yes_ecdf_y = ecdf(pay_yes)
# KDE Plots & Normal PDF, ECDF & Normal CDF
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12.8, 4.8))
sns.lineplot(ax=axes[0], x=pay_no, y=normal_dist_from_pay_no.pdf(pay_no), linestyle='--', label='No (normal)',
color='C0', linewidth=1)
sns.lineplot(ax=axes[0], x=pay_yes, y=normal_dist_from_pay_yes.pdf(pay_yes), linestyle='--', label='Yes (normal)',
color='C1', linewidth=1)
sns.kdeplot(pay_no, ax=axes[0], fill=True, label='No', color='C0')
sns.kdeplot(pay_yes, ax=axes[0], fill=True, label='Yes', color='C1')
axes[0].vlines(x=intersection, ymin=0, ymax=normal_dist_from_pay_no.pdf(intersection),
linestyles=':', colors='black', label=f"Int.: $ {intersection}", linewidth=1)
sns.lineplot(ax=axes[1], x=pay_no, y=normal_dist_from_pay_no.cdf(pay_no), label='No (normal)', color='black',
linewidth=1, linestyle='--')
sns.lineplot(ax=axes[1], x=pay_yes, y=normal_dist_from_pay_yes.cdf(pay_yes), label='Yes (normal)', color='black',
linewidth=1, linestyle='-')
sns.lineplot(ax=axes[1], x=pay_no_ecdf_x, y=pay_no_ecdf_y, linestyle='', marker='.', label='No', color='C0',
markeredgewidth=0, markersize=8, alpha=0.5)
sns.lineplot(ax=axes[1], x=pay_yes_ecdf_x, y=pay_yes_ecdf_y, linestyle='', marker='.', label='Yes', color='C1',
markeredgewidth=0, markersize=8, alpha=0.5)
axes[1].vlines(x=intersection, ymin=0, ymax=1, linestyles=':', colors='black', label=f"Int.: $ {intersection}", linewidth=1)
plt.suptitle('Ride Acceptance', fontsize=14, y=0.95)
axes[0].set_title('KDE Plots & Normal PDF', fontsize=plot_title_fontsize)
axes[0].set_xlabel('Pay $', fontsize=plot_labels_fontsize)
axes[0].set_ylabel('Density', fontsize=plot_labels_fontsize)
axes[0].tick_params(labelsize=plot_legend_ticklabels_font_size)
axes[0].legend(fontsize=plot_legend_ticklabels_font_size)
axes[1].set_title('ECDF & Normal CDF', fontsize=plot_title_fontsize)
axes[1].set_xlabel('Pay $', fontsize=plot_labels_fontsize)
axes[1].set_ylabel('Fraction of data', fontsize=plot_labels_fontsize)
axes[1].tick_params(labelsize=plot_legend_ticklabels_font_size)
axes[1].legend(fontsize=plot_legend_ticklabels_font_size)
plt.show()
return None
def statistical_overview(series: pd.Series, label: str = 'Series', of: float = 1.5, evf: float = 3.0,
summary: bool = False, full_report: bool = False) -> tuple:
"""Calculates outlier fences and extreme value fences for a numeric Pandas Series.
Optionally displays a comprehensive statistical description of the data.
Args:
series: A Pandas Series object
label: A label for the Series
of: Outlier Factor (for inner fences)
evf: Extreme Value Factor (for outer fences)
summary: Whether to display only outliers count and extreme values count
full_report: Whether to display the full statistical description
Returns:
Tuple: outliers count, extreme values count,
[lower outer fence, lower inner fence, upper inner fence, upper outer fence]
"""
total_values_inc_nan = series.size
total_values_exc_nan = series.count()
q1 = np.around(series.quantile(0.25), 2)
q3 = np.around(series.quantile(0.75), 2)
iqr = np.around(q3 - q1)
lower_outer_fence = np.around(q1 - evf * iqr, 2)
lower_inner_fence = np.around(q1 - of * iqr, 2)
upper_inner_fence = np.around(q3 + of * iqr, 2)
upper_outer_fence = np.around(q3 + evf * iqr, 2)
outliers_count = series[((lower_outer_fence < series) & (series <= lower_inner_fence)) |
((upper_inner_fence < series) & (series <= upper_outer_fence))].count()
non_outliers_count = total_values_inc_nan - outliers_count
extreme_values_count = series[(series < lower_outer_fence) | (series > upper_outer_fence)].count()
non_extreme_values_count = total_values_inc_nan - extreme_values_count
if full_report:
print(f"SERIES: {label}\n")
print(f"Size: {total_values_inc_nan}")
print(f"Count: {total_values_exc_nan}")
print(f"NaN: {total_values_inc_nan - total_values_exc_nan}")
print(f"Min: {np.around(np.min(series), 2)}")
print(f"Max: {np.around(np.max(series), 2)}")
print(f"Mean: {np.around(np.nanmean(series), 2)}")
print(f"Std: {np.around(np.nanstd(series), 2)}")
print(f"Median: {np.around(np.nanmedian(series), 2)}")
print(f"Q1: {q1}")
print(f"Q3: {q3}")
print(f"IQR: {iqr}\n")
print(f"Outlier Factor: {of}")
print(f"Extreme Value Factor: {evf}\n")
print(f"Lower outer fence: {lower_outer_fence}")
print(f"Lower inner fence: {lower_inner_fence}")
print(f"Upper inner fence: {upper_inner_fence}")
print(f"Upper outer fence: {upper_outer_fence}\n")
print(f"Outliers: {outliers_count}")
print(f"Non-outliers: {non_outliers_count}")
print(f"Extreme values: {extreme_values_count}")
print(f"Non-extreme values: {non_extreme_values_count}\n")
print(f"Unbiased skew: {np.around(series.skew())}\n")
if summary:
print(f"SERIES: {label}")
print(f"Outliers: {outliers_count}")
print(f"Extreme values: {extreme_values_count}\n")
return outliers_count, extreme_values_count, [lower_outer_fence, lower_inner_fence, upper_inner_fence, upper_outer_fence]
def remove_outliers_and_extreme_values(df: pd.DataFrame, fences_no: list, fences_yes: list, report: bool = False) -> pd.DataFrame:
"""Removes outliers and extreme values from a Pandas DataFrame.
Args:
df: Source DataFrame
fences_no: Fence values for the positive-outcome data
fences_yes: Fence values for the negative-outcome data
report: Whether to print confirmation that DataFrame has been cleaned
Returns:
Pandas DataFrame: clean DataFrame, without outliers or extreme values
"""
mask = ((df['ACCEPTED'] == 0) & (df['PAY'] > fences_no[1]) & (df['PAY'] < fences_no[2])) | \
((df['ACCEPTED'] == 1) & (df['PAY'] > fences_yes[1]) & (df['PAY'] < fences_yes[2]))
df = df.loc[mask]
if report:
print(f"DataFrame has been cleaned: {np.invert(mask).sum()} values removed\n")
return df
def ecdf(data: np.ndarray):
"""Compute ECDF for a one-dimensional array of values."""
n = len(data)
x = np.sort(data)
y = np.arange(1, n + 1) / n
return x, y
def generate_train_val_test(df: pd.DataFrame, test_fraction: float = 0.3, seed: int = 42) -> tuple:
"""Generates train & validation subset, and test subset.
Args:
df: DataFrame to split
test_fraction: Fraction of the DataFrame to randomly sample, for the test set
seed: Controls reproducibility
Returns:
Tuple of NumPy arrays: X_train_val, X_test, y_train_val, y_test
"""
seed = seed
np.random.seed(seed)
df_test = df.sample(frac=test_fraction, random_state=seed, axis=0)
X_test = df_test['PAY'].to_numpy().reshape(-1, 1)
y_test = df_test['ACCEPTED'].to_numpy()
df_train_val = df[~df.index.isin(df_test.index)]
X_train_val = df_train_val['PAY'].to_numpy().reshape(-1, 1)
y_train_val = df_train_val['ACCEPTED'].to_numpy()
return X_train_val, X_test, y_train_val, y_test
def tune_train_test_logreg_svm(x_train_val: np.ndarray, x_test: np.ndarray, y_train_val: np.ndarray, y_test: np.ndarray,
cv: int = 10, seed: int = 42, full_reports: bool = False, plots: bool = False) -> list:
"""Tunes and evaluates a LogisticRegression model and a SupportVectorClassifier model.
Args:
x_train_val: feature - combined train and validation
x_test: feature - test
y_train_val: target - combined train and validation
y_test: target - test
cv: Number of folds for `cross_val_score` and `GridSearchCV`
seed: Controls reproducibility
full_reports: Whether to print accuracy scores, confusion matrix and classification report for each model
plots: Whether to plot confusion matrix and ROC curve
Returns:
List: [best Log Reg model, best SVM model]
"""
seed = seed
np.random.seed(seed)
models = {'Log Reg': LogisticRegression(random_state=seed, max_iter=2000),
'SVC': SVC(probability=True, random_state=seed)}
param_grids = {'Log Reg': {'C': [0.01, 0.1, 1, 10, 100],
'solver': ['liblinear', 'saga'],
'penalty': ['l1', 'l2']},
'SVC': {'C': [0.01, 0.1, 1, 10],
'gamma': [0.001, 0.01, 0.1, 1],
'kernel': ['sigmoid', 'rbf']}
}
best_models = []
for name, mod in models.items():
# initialize model, cross validate on train & validation subset
model = mod
cv_model = cross_val_score(estimator=model, X=x_train_val, y=y_train_val, cv=cv, n_jobs=-1)
accuracy_train_val_untuned = np.around(np.mean(cv_model) * 100, 2)
# tune hyperparams and cross validate on train & validation subset
grid_cv_model = GridSearchCV(estimator=model, param_grid=param_grids[name], cv=cv, n_jobs=-1)
grid_cv_model.fit(x_train_val, y_train_val)
accuracy_train_val_tuned = np.around(grid_cv_model.best_score_ * 100, 2)
# get best tuned model and best params
best_model = grid_cv_model.best_estimator_
best_params = grid_cv_model.best_params_
best_models.append(best_model)
# cross validate on test subset
cv_model_test = cross_val_score(estimator=best_model, X=x_test, y=y_test, cv=cv, n_jobs=-1)
accuracy_test_tuned = np.around(np.mean(cv_model_test) * 100, 2)
# confusion matrix & classification report
y_pred = best_model.predict(x_test)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
if full_reports:
print(f"MODEL: {name}\n")
print(f"• Mean accuracy - train & val set (untuned): {accuracy_train_val_untuned}%")
print(f"• Best accuracy - train & val set (tuned): {accuracy_train_val_tuned}%")
print(f"• Mean accuracy - test set (tuned): {accuracy_test_tuned}%\n")
print(f"• Best parameters: {best_params}\n")
print(f"• Confusion Matrix:\n{conf_matrix}\n")
print(f"• Classification Report:\n{class_report}\n")
if plots:
# Get probabilities for positive outcome and calculate AUC Score
y_pred_proba = best_model.predict_proba(X_test)[:, 1]
auc_score = np.around(roc_auc_score(y_test, y_pred_proba) * 100, 2)
# Get False Positive Rate and True Positive Rate (no need for Thresholds)
fpr, tpr, _ = roc_curve(y_test, y_pred_proba, drop_intermediate=True)
# Confusion matrix, ROC Curve
fig, axes = plt.subplots(nrows=1, ncols=2)
sns.heatmap(ax=axes[0], data=conf_matrix, fmt='g', cmap='cividis', square=True, linewidths=2,
annot=True, annot_kws={'size': plot_legend_ticklabels_font_size}, cbar_kws={'shrink':0.8})
cbar = axes[0].collections[0].colorbar
cbar.ax.tick_params(labelsize=plot_legend_ticklabels_font_size-1)
plt.suptitle(f"{name}", fontsize=plot_suptitile_font_size, y=0.95)
axes[0].set_title('Confusion Matrix', fontsize=plot_title_fontsize)
axes[0].set_xlabel('Predicted outcome', fontsize=plot_labels_fontsize)
axes[0].set_ylabel('True outcome', fontsize=plot_labels_fontsize)
axes[0].xaxis.set_ticklabels(['No', 'Yes'], fontsize=plot_legend_ticklabels_font_size)
axes[0].yaxis.set_ticklabels(['No', 'Yes'], fontsize=plot_legend_ticklabels_font_size, rotation=0)
axes[1].plot([0, 1], [0, 1], linestyle='--', color='black', alpha=0.75, label='Baseline')
axes[1].plot(fpr, tpr, color='green', linewidth=5, solid_capstyle='round',
marker='.', markersize=7, markerfacecolor='white', label=f"{name}")
axes[1].set_title('ROC Curve', fontsize=plot_title_fontsize)
axes[1].set_xlabel('False Positive Rate', fontsize=plot_labels_fontsize)
axes[1].set_ylabel('True Positive Rate', fontsize=plot_labels_fontsize)
axes[1].tick_params(labelsize=plot_legend_ticklabels_font_size)
axes[1].annotate(text=f"• Accuracy: {accuracy_test_tuned}%\n• AUC Score: {auc_score}",
xy=(0.25, 0.675), alpha=0.5, fontsize=plot_legend_ticklabels_font_size)
axes[1].legend(fontsize=plot_legend_ticklabels_font_size, loc='lower right')
plt.show()
return best_models
def get_best_model(logreg_model: object, svm_model: object, intersection: float, segment: int = 1, seed: int = 42,
report: bool = False, show_performance_df: bool = False, show_trimmed: bool = True) -> object:
"""Find best model between LogisticRegression and SupportVectorClassifier by using a custom metric.
• Custom metric: the distance in dollars between each model's first positive outcome and the point of
intersection between the curves, using synthetic data. The shortest distance wins.
Args:
logreg_model: Tuned and trained instance of the LogisticRegression() class.
svm_model: Tuned and trained instance of the SVC() class.
intersection: The point of intersection between the Gaussian functions
segment: Interval in dollars that establishes lower/upper bound around the point of intersection
seed: Controls reproducibility
report: Whether to display custom metric for each model with indication of best model
show_performance_df: Whether to display the performance DataFrame
show_trimmed: Whether to display a trimmed version of the performance DataFrame, centered around the first outcomes
Returns:
Model (LogisticRegression or SupportVectorClassifier)
"""
seed = seed
np.random.seed(seed)
start = np.around(intersection - segment / 2, 2)
stop = np.around(intersection + segment / 2, 2)
step = 0.01
num_values = np.ceil((stop - start) / step)
# Evaluate model performance around point of intersection on synthetic data, at one-cent ($ 0.01) level
synthetic_data = np.arange(start, stop, step).reshape(-1, 1)
pred_logreg = logreg_model.predict(synthetic_data)
pred_svm = svm_model.predict(synthetic_data)
performance_df = pd.DataFrame({'Synthetic Data': synthetic_data.ravel(),
'Log Reg': pred_logreg.ravel(),
'SVM': pred_svm.ravel()})
if performance_df['Log Reg'].sum() == 0 and performance_df['SVM'].sum() == 0:
raise ValueError(f"WARNING: no positive values found between $ {start} and $ {stop}. "
f"Check 'segment' value (currently: {segment})")
if performance_df['Log Reg'].sum() == num_values and performance_df['SVM'].sum() == num_values:
raise ValueError(f"WARNING: only positive values found between $ {start} and $ {stop}. "
f"Raise 'segment' value (currently: {segment})")
pay_first_positive_outcome_logreg = np.around(performance_df.loc[performance_df['Log Reg'] == 1, 'Synthetic Data'].iloc[0], 2)
pay_first_positive_outcome_svm = np.around(performance_df.loc[performance_df['SVM'] == 1, 'Synthetic Data'].iloc[0], 2)
# Calculate distance of first positive outcome from point of intersection
delta_logreg = np.around(abs(intersection - pay_first_positive_outcome_logreg), 2)
delta_svm = np.around(abs(intersection - pay_first_positive_outcome_svm), 2)
# Select best model to use
model = svm if delta_logreg >= delta_svm else logreg
if report:
print(f"Distance of first positive outcome from point of intersection ({intersection}):")
print(f" • Log Reg: {delta_logreg} ({pay_first_positive_outcome_logreg}) {'<-' if model is logreg else ''}")
print(f" • SVM: {delta_svm} ({pay_first_positive_outcome_svm}) {'<- best model' if model is svm else ''}\n")
if show_performance_df:
if show_trimmed:
index_first_positive_outcome_logreg = performance_df.loc[performance_df['Log Reg'] == 1, 'Log Reg'].idxmin()
index_first_positive_outcome_svm = performance_df.loc[performance_df['SVM'] == 1, 'SVM'].idxmin()
display(performance_df.iloc[index_first_positive_outcome_logreg: index_first_positive_outcome_svm + 1])
else:
display(performance_df)
return model
def generate_rides_dictionary(lam: int = 1, size=None) -> dict:
"""Generates a dictionary of rides (as keys) and number of customers (as values) with the Poisson distribution.
Example for lam=1 and size=100: {0: 39, 1: 37, 2: 17, 3: 4, 4: 3}
Args:
lam: Lambda value, expectation of interval
size: Samples to be drawn from the distribution
Returns:
A dictionary: the keys are the number of events (rides) and the values are the relative frequency (customers).
"""
requested_rides = np.random.poisson(lam=lam, size=size)
num_of_rides, num_of_customers = np.unique(requested_rides, return_counts=True)
return dict(zip(num_of_rides, num_of_customers))
5. Ingest Data
# Ingest data, create master DataFrame (`master_df`) and working DataFrame `df`
csv_path = 'driverAcceptanceData.csv'
master_df = pd.read_csv(csv_path, index_col=0)
df = master_df.copy()
# Display a few rows of the DataFrame
display(df.sample(n=8))
# Display stats
display(df.describe())
# Check for missing values
print(f"Missing values:\n{df.isna().sum()}")
# Subset 'PAY' according to 'ACCEPTED': 0 -> `pay_no`, 1 -> `pay_yes`
pay_no, pay_yes = subset_no_yes(df=df)6. EDA & Outliers
Exploratory Data Analysis is a fundamental step in every Machine Learning pipeline.
It is invaluable to understand and investigate the dataset, and to discover underlying patterns and statistical properties.
6.1. Histograms