Skip to content

Case Study Automatidata

Data Analyst: Michele Bedin (www.michelebedin.com)

Introduction

Data consulting firm Automatidata recently hired you as a new member of their data analysis team. Their new client, the NYC Taxi and Limousine Commission (New York City TLC), wants the Automatidata team to build a multiple linear regression model to predict taxi fares using existing data collected over the course of a year. The team is nearing completion of the project, having completed an initial action plan, initial Python coding work, EDA and A/B testing.

The Automatidata team has reviewed the results of the A/B tests. Now it is time to work on the prediction of taxi fares. You have impressed your colleagues at Automatidata with your hard work and attention to detail. The data team believes you are ready to build the regression model and update the New York City TLC client on your progress.

Step 5: Build a multiple linear regression model

In this activity, you will construct a multiple linear regression model. As you have learnt, multiple linear regression helps you estimate the linear relationship between a continuous dependent variable and two or more independent variables. For data science professionals, this is a useful skill because it allows you to consider more than one variable in relation to the variable you are measuring. This allows you to complete a much deeper and more flexible analysis.

Completing this activity will help you practise planning and constructing a multiple linear regression model based on a specific business need. The structure of this activity is designed to emulate the proposals you are likely to be assigned in your career as a data professional. Completing this activity will help you prepare for these situations.

The purpose of this project is to demonstrate knowledge of EDA and a multiple linear regression model.

The goal is to build a multiple linear regression model and evaluate the model.
*This activity consists of three parts

Part 1: EDA and testing of model assumptions *What are the purposes of EDA before building a multiple linear regression model?

Part 2: Model construction and evaluation

  • What resources do you find yourself using as you complete this step?

Part 3: Interpretation of model results

  • What key insights have emerged from your models?

  • What business recommendations do you propose on the basis of the models built?

framework PACE

In this notebook (as in those of the previous steps) reference is made to the PACE problem-solving framework: Plan, Analyse, Construct and Execute.

Pace: Plan

Task 1: Import and loading

Import the packages required to build the linear regression model.

# Imports
# Packages for numerics + dataframes
import pandas as pd
import numpy as np

# Packages for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Packages for date conversions for calculating trip durations
from datetime import datetime
from datetime import date
from datetime import timedelta

# Packages for OLS, MLR, confusion matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics # For confusion matrix
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
# Load dataset into dataframe 
df0=pd.read_csv("data/2017_Yellow_Taxi_Trip_Data.csv") # index_col parameter specified to avoid "Unnamed: 0" column when reading in data from csv