Case Study Automatidata

Data Analyst: Michele Bedin (www.michelebedin.com)

Step 1 - Project proposal
Step 2 - Understand the data
Step 3 - EDA (current)
Step 4 - Statistical tests
Step 5 - Regression modelling
Step 6 - Machine learning models
Step 7 - Work delivery

Introduction

You are the new data professional in a fictitious consulting company: Automatidata. The team is still at the beginning of the project, having just completed an initial action plan and some initial coding work in Python.

Luana Rodriquez, the senior analyst at Automatidata, is pleased with the work you have already done and asks you to assist her with some EDA and data visualisation work for the New York City Taxi and Limousine Commission (New York City TLC) project, in order to gain a general understanding of what taxi passengers look like. The management team is requesting a Python notebook showing the structuring and cleanliness of the data, as well as any matplotlib/seaborn visualisations to help understand the data. At the very least, a box plot of the run duration and some time series graphs, such as a breakdown by quarter or month, should be included.

In addition, the management team has recently asked all EDAs to include Tableau visualisations. For this taxi data, create a Tableau dashboard showing a map of New York City with taxi/limo rides by month. Make sure it is easy to understand for someone who is not data savvy and remember that the assistant director of the New York TLC is a visually impaired person.

Step 3: Exploratory data analysis (EDA)

In this activity you will examine the data provided and prepare it for analysis. You will also design a professional data visualisation that tells a story and helps make data-driven decisions for business needs.

The purpose of this project is to conduct an exploratory data analysis on a provided data set. Your mission is to continue the investigation started in Phase 2 and perform further EDA on this data with the goal of learning more about the variables.

The objective is to clean the dataset and create a visualisation.

This activity consists of 4 tasks:

Task 1: importing, linking and loading
Task 2: data exploration (data cleaning)
Task 3: constructing visualisations
Task 4: evaluating and sharing results

PACE

PACE problem-solving framework: Plan, Analyse, Construct and Execute.

Pace: Planning

Identify outliers:

What are the best methods to identify outliers?
There are several techniques for identifying outliers in data. A common method is statistical analysis, using numpy functions to examine the mean() and median() of the data to understand the range of values in the data. In addition, visualising the data through a boxplot or histogram can help to visually identify outliers.

How do you decide to keep or exclude outliers from any future model?
The decision to keep or exclude outliers depends on several factors, including the nature of the data and the assumptions of the model being built. If it is certain that the outliers are errors or anomalies, and the data will be used for modelling or machine learning, it may be appropriate to remove the outliers. If the data set is small, it may be more appropriate to derive new values to replace those of the outliers. Finally, if the dataset is only expected to be used for exploratory data analysis, or for a model that is resistant to outliers, it may make more sense to leave the outliers in the data.

Task 1: Importing, linking and loading

For the data EDA, import the most useful data and packages, such as pandas, numpy and matplotlib. Then, import the dataset.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import seaborn as sns

df=pd.read_csv('data/2017_Yellow_Taxi_Trip_Data.csv')

‌
‌
‌