Data Analyst: Michele Bedin (www.michelebedin.com)
- Step 1 - Project proposal
- Step 2 - Understand the data (current)
- Step 3 - EDA
- Step 4 - Statistical tests
- Step 5 - Regression modelling
- Step 6 - Machine learning models
- Step 7 - Work delivery
Introduction
Welcome to the Automatidata project!
You have just started working as a data professional at Automatidata, a fictitious consulting company. Their client, the New York City Taxi and Limousine Commission (New York City TLC), hired the Automatidata team because of their reputation for helping clients develop data-driven solutions.
The team is still in the early stages of the project. Earlier (step 1), your supervisor, DeShawn Washington, had asked you to complete a project proposal. You have received the news that your project proposal has been approved and that the New York TLC has granted the Automatidata team access to its data. To get precise information, the New York TLC's data must be analysed, key variables must be identified, and the dataset must be ready for analysis.
Step 2 - Understand the data
The PACE problem-solving framework will be used to carry out this project.
Objective: identifying relevant data types and variables using Python
Pace: Planning
Task 1: Understanding the situation
- How can we best prepare ourselves to understand and organise the information provided on taxis?
Start by exploring the data set and considering reviewing the Data Dictionary. Reading the data fields and their impacts can help one prepare to understand the information. Reading the fact sheet can also be helpful. However, the main objective is to enter the data into Python, examine it, and give DeShawn initial observations. The next step is to gather background information to learn more about the data and check for anomalies.
Precisely, to be better prepared to understand and organise the information provided on taxis, the following steps could be followed:
- Examine the dataset: First, it is important to examine the dataset to get a general idea of the data. This may include viewing the first few rows, understanding the number of rows and columns, and identifying the types of data present. Understanding variables: Each column in the dataset represents a different variable. Understanding what each variable means and how it can be used in the analysis is essential. The variable descriptions provided can be beneficial in this respect.
- Identify missing data: Missing data can influence the data analysis. Tossing data and deciding how to identify any mi is essential to handle it.
- Planning the organisation of data: after understanding the variables and identifying missing data, you can plan how to organise the data for analysis. This may include deciding which variables to use in the analysis, handling missing data, and structuring the data for analysis. Use data analysis tools: Python and Pandas can be beneficial for organising and analysing data. These tools can help you manipulate data, perform calculations, and create visualisations.
- Documenting the process: Finally, it is important to document the process of understanding and organising data. This may include recording observations, decisions, and process steps. This documentation can be useful for future reference and for communicating the analysis results to others.
pAce: Analyze