Skip to content
certification case study
  • AI Chat
  • Code
  • Report
  • # import necessary packages
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np 
    import seaborn as sns
    sns.set()
    %matplotlib inline 
    
    from datetime import datetime as dt
    import plotly.express as px
    # load the dataset 
    df=pd.read_csv('office_supplies.csv')
    df.head()

    Customer Question

    The management would like you to answer the following:
    ● Are there products that do not sell as well in some locations?
    ● Are there any other patterns over time in each region that you can find in the data

    ASSESSING AND CLEANING

    
    
    df.info()
    df.describe()
    # CHECKING FOR DUPLICATE VALUES 
    df.duplicated().any()
    df[df.duplicated(keep=False)]

    ASSESSMENT REPORT

    1. column names cotains spaces and capital letters which should be corrected for easier accessing
    2. The Profit column which has a lot of null values (1993) which signifies some unknown error
    3. The 'Order Date' column is in string type instead of date
    4. There is a duplicated row in the dataset

    CLEANING BASED ON ASSESSMENT REPORT

    ISSUE 1
    # ACTION: replace spaces from all column names with underscore and format them as all lower case
    #CODE :
    df.columns = df.columns.str.lower().str.replace(' ','_')
    #TEST 
    display(df.columns.str.contains(' ').any(), df.columns)
    ISSSUE 2
    Dealing with null values in the Price column:

    There are 2 choices: fill with zero or drop them
    if i drop these rows, it will reduce the dataset and maybe orders that answers the questions im asking will be missing. though the question of this analysis is about sales of products in some regions, its safe to assume that there might be a correlation between low sales and 'null' profit so dropping these null profits might just mean dropping some important orders .
    if i fill with 0 theres a chance that ill be giving the wrong information as these null values might indicate that information about profits made hasn't been received yet and that doesnt necessarily mean a zero profit transaction

    the best option is to fill rows with null values in the Price column with 0 and not include the Price column too much in my analysis

    df[df.profit.isna() & df.discount >0] 
    
    # ACTION:  fill rows with null values in the Price column with 0 
    
    #CODE:
    df.profit.fillna(value=0,inplace=True)
    
    # TEST:
    df.profit.isnull().any()