Skip to content
Online Retail Store - Data Cleaning, Analysis, Machine Learning
  • AI Chat
  • Code
  • Report
  • Online Retailer - Data Analytics Project

    Introduction

    The purpose of this project is to apply what I've been learning about Data Science and Python to a business's data. The dataset is from Kraggle, and contains about 1 and a half year's worth of data for a Brazilian retailer, including sales, customers, products, reviews and more.

    I will be using Python and Machine Learning to clean, examine and analyze the data, to see what insights I can extract from it. I plan to continue to revisit this dataset and project as I learn more, and I'm hoping it showcases some of the skills I have developed, and show what I can do with a business's data.

    Importing Basic Libraries and Data

    #importing basic libraries
    import pandas as pd
    import numpy as np
    Importing Data from CSV files

    *The individual data sheets will be show at the end.

    #Customer Information
    cust = pd.read_csv('olist_customers_dataset.csv')
    
    #Order Item Information
    item = pd.read_csv('olist_order_items_dataset.csv')
    
    #Payment Information
    pay = pd.read_csv('olist_order_payments_dataset.csv')
    
    #Reviews Data
    review = pd.read_csv('olist_order_reviews_dataset.csv')
    
    #All available products
    prod = pd.read_csv('olist_products_dataset.csv')
    
    #Seller information
    seller = pd.read_csv('olist_sellers_dataset.csv')
    
    #Translations for category names
    trans = pd.read_csv('product_category_name_translation.csv')
    #Order Information, including shipping
    orders = pd.read_csv('olist_orders_dataset.csv')
    
    #Making all of the date datetime format
    orders[['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date']] = orders[['order_purchase_timestamp', 'order_approved_at', 'order_delivered_carrier_date', 'order_delivered_customer_date', 'order_estimated_delivery_date']].apply(pd.to_datetime)
    
    #The dataset only contains partial data after August, so for the sake of the project, I'm just removing it.
    orders = orders[orders['order_purchase_timestamp'] < '8/22/2018 11:59']

    Data Cleaning

    This data will be used for data analysis and machine learning, so it will need to be cleaned and consolidated. I will make a couple general, clean tables, such as sales, and customers, and then tables specific to the visuals and analysis.

    General Sales

    The idea behind this is to consolidate many of the above tables into one general 'sales' table, where each row is one order, and contains general information such as the customer, and costs, while removing any detail such as specific items, or payment types.

    #Starting off by combining orders and customer data
    sales = pd.merge(orders, cust, how = 'left', on = 'customer_id')
    
    #Dropping extra date columns
    sales = sales.drop(columns = ['order_approved_at', 'order_delivered_carrier_date','order_delivered_customer_date', 'order_estimated_delivery_date'])
    
    #Adding in items info to get price and freight value data
    sales = pd.merge(sales, item, how = 'left', on = 'order_id')
    
    #dropping unnecessary rows
    sales = sales.drop(columns = ['seller_id', 'shipping_limit_date', 'order_item_id', 'product_id'])
    
    #grouping by the order id and adding up price and freight value, to leave one row per order
    sales = sales.groupby(by = ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state'], as_index = False).sum()
    
    #adding 'total_price' column for easier comparison
    sales['total_price'] = sales['price'] + sales['freight_value']
    
    #adding in payment infor to get payment made so far
    sales = pd.merge(sales, pay, on = 'order_id')
    
    #dropping excess data
    sales = sales.drop(columns = ['payment_sequential', 'payment_type', 'payment_installments'])
    
    #adding up all payments, again leaving one row per order
    sales = sales.groupby(by = ['order_id', 'customer_id', 'order_status', 'order_purchase_timestamp', 'customer_unique_id', 'customer_zip_code_prefix', 'customer_city', 'customer_state', 'price', 'freight_value','total_price'], as_index = False).sum()
    
    #Adding a 'balanace_remaining' column
    sales['balance_remaining'] = sales['total_price'] - sales['payment_value']
    
    #Adding in review data
    sales = pd.merge(sales, review, on = 'order_id', how = 'left')
    
    #I only want the review_score, as all other data can be obtained if needed
    sales = sales.drop(columns = ['review_comment_title', 'review_comment_message', 'review_creation_date', 'review_answer_timestamp', 'review_id'])
    
    #renaming the columns
    sales.rename(columns = {'order_purchase_timestamp':'order_date', 'customer_unique_id':'customer_un_id', 'customer_zip_code_prefix':'zip_prefix','customer_city':'city', 'customer_state':'state'}, inplace = True)
    
    sales
    

    Customer Lifetime Value (CLV) table