Initial Data Analysis of Motorcycle Parts Sales Data
📖 Background
You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.
You’ve recently learned data manipulation and plotting, and suggest helping your colleague analyze past sales data. Your colleague wants to capture sales by payment method. She also needs to know the average unit price for each product line.
💪 Challenge
Create a report to answer your colleague's questions. Include:
- What are the total sales for each payment method?
- What is the average unit price for each product line?
- Create plots to visualize findings for questions 1 and 2.
- [Optional] Investigate further (e.g., average purchase value by client type, total purchase value by product line, etc.)
- Summarize your findings.
Environment Set-up
First we load the libraries we'll need to conduct the analysis -- pandas for data inspection, cleaning and aggregation; matplotlib and seaborn for visualization.
We then load in the sales dataset "sales_data.csv" containing three months of sales data for the company. We use the parse_dates option to ensure that the "date" column is loaded correctly.data.
# Importing the pandas module
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Reading in the sales data
df = pd.read_csv('data/sales_data.csv', parse_dates=['date'])Inspecting the Data
💾 The data
The sales data has the following fields:
- "date" - The date, from June to August 2021.
- "warehouse" - The company operates three warehouses: North, Central, and West.
- "client_type" - There are two types of customers: Retail and Wholesale.
- "product_line" - Type of products purchased.
- "quantity" - How many items were purchased.
- "unit_price" - Price per item sold.
- "total" - Total sale = quantity * unit_price.
- "payment" - How the client paid: Cash, Credit card, Transfer.
We use .head() to view the first few rows of the file.
df.head()We then do a cursory inspection of the data to get some basic information and see if there might be any issues that need to be addressed.
- We use .shape to see the size of the dataset in terms of number of rows (observations) and columns (features).
- We use .describe() to get summary data on the numerical variables: count, mean, standard deviation, min, max, quartiles.
- We use .dtypes to look at the data types assigned to each variable to see if any are assigned an inappropriate type (e.g., a numerical variable assigned as text/object).
- We use .isna().sum() to look for variables with missing data
- We use .drop_duplicates(inplace=True) to drop any duplicate rows (inplace is used to drop the rows directly inside the dataframe without creating new objects.
- We use .unique() on the categorical variables to see the unique/distinct categories in the features, but to also look for any clean-up that may need to be done such as spelling errors, two of the same variable spelled differently, categories that could be "collapsed"
- We use "min" and "max" on the date variable to find the first and last date in the the dataset to check for out of range dates that may need to be fixed or removed.
It looks like our dataset is pretty clean. That's great news!
Based on the information below, we found that:
- Our data set contains 1000 rows and 8 columns
- All of our variables look to be appropriate data types and no changes are needed
- Date is in datetime
- warehouse, client_type, product_line, and payment are object (text)
- quantity is an integer
- unit_price and total (unit_price * quantity) are float
- None of the categorical variables seem to have any text cleanup issues in the names or need to be collapsed
- There don't appear to be any out-of-range dates
print("The size of the data set is: ", df.shape)print("Summary of the data set")
df.describe()print("The data types for the data set")
df.dtypesprint("Check for Missing Values")
df.isna().sum()print("Drop duplicate rows, then recheck size of data set")
df.drop_duplicates(inplace=True)
df.shape