Online Retailer - Data Analytics Project
Introduction
The purpose of this project is to apply what I've been learning about Data Science and Python to a business's data. The dataset is from Kraggle, and contains about 1 and a half year's worth of data for a Brazilian retailer, including sales, customers, products, reviews and more.
I will be using Python and Machine Learning to clean, examine and analyze the data, to see what insights I can extract from it. I plan to continue to revisit this dataset and project as I learn more, and I'm hoping it showcases some of the skills I have developed, and show what I can do with a business's data.
Importing Basic Libraries and Data
#importing basic libraries
import pandas as pd
import numpy as np
Importing Data from CSV files
*The individual data sheets will be show at the end.
#Customer Information
cust = pd.read_csv('olist_customers_dataset.csv')
#Order Item Information
item = pd.read_csv('olist_order_items_dataset.csv')
#Payment Information
pay = pd.read_csv('olist_order_payments_dataset.csv')
#Reviews Data
review = pd.read_csv('olist_order_reviews_dataset.csv')
#All available products
prod = pd.read_csv('olist_products_dataset.csv')
#Seller information
seller = pd.read_csv('olist_sellers_dataset.csv')
#Translations for category names
trans = pd.read_csv('product_category_name_translation.csv')
#Order Information, including shipping
orders = pd.read_csv('olist_orders_dataset.csv')
#Making all of the date datetime format
orders[['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date']] = orders[['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date']].apply(pd.to_datetime)
#The dataset only contains partial data after August, so for the sake of the project, I'm just removing it.
orders = orders[orders['order_purchase_timestamp'] < '8/22/2018 11:59']
Data Cleaning
This data will be used for data analysis and machine learning, so it will need to be cleaned and consolidated. I will make a couple general, clean tables, such as sales, and customers, and then tables specific to the visuals and analysis.
General Sales
The idea behind this is to consolidate many of the above tables into one general 'sales' table, where each row is one order, and contains general information such as the customer, and costs, while removing any detail such as specific items, or payment types.
#Starting off by combining orders and customer data
sales = pd.merge(orders, cust, how = 'left', on = 'customer_id')
#Dropping extra date columns
sales = sales.drop(columns = ['order_approved_at', 'order_delivered_carrier_date','order_delivered_customer_date', 'order_estimated_delivery_date'])
#Adding in items info to get price and freight value data
sales = pd.merge(sales, item, how = 'left', on = 'order_id')
#dropping unnecessary rows
sales = sales.drop(columns = ['seller_id', 'shipping_limit_date', 'order_item_id', 'product_id'])
#grouping by the order id and adding up price and freight value, to leave one row per order
sales = sales.groupby(by = ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state'], as_index = False).sum()
#adding 'total_price' column for easier comparison
sales['total_price'] = sales['price'] + sales['freight_value']
#adding in payment infor to get payment made so far
sales = pd.merge(sales, pay, on = 'order_id')
#dropping excess data
sales = sales.drop(columns = ['payment_sequential', 'payment_type', 'payment_installments'])
#adding up all payments, again leaving one row per order
sales = sales.groupby(by = ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state', 'price', 'freight_value','total_price'], as_index = False).sum()
#Adding a 'balanace_remaining' column
sales['balance_remaining'] = sales['total_price'] - sales['payment_value']
#Adding in review data
sales = pd.merge(sales, review, on = 'order_id', how = 'left')
#I only want the review_score, as all other data can be obtained if needed
sales = sales.drop(columns = ['review_comment_title', 'review_comment_message', 'review_creation_date', 'review_answer_timestamp', 'review_id'])
#renaming the columns
sales.rename(columns = {'order_purchase_timestamp':'order_date', 'customer_unique_id':'customer_un_id', 'customer_zip_code_prefix':'zip_prefix','customer_city':'city', 'customer_state':'state'}, inplace = True)
sales
Customer Lifetime Value (CLV) table