Skip to content
# import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
sns.set()
%matplotlib inline 

from datetime import datetime as dt
import plotly.express as px
# load the dataset 
df=pd.read_csv('office_supplies.csv')
df.head()

Customer Question

The management would like you to answer the following:
● Are there products that do not sell as well in some locations?
● Are there any other patterns over time in each region that you can find in the data

ASSESSING AND CLEANING



df.info()
df.describe()
# CHECKING FOR DUPLICATE VALUES 
df.duplicated().any()
df[df.duplicated(keep=False)]

ASSESSMENT REPORT

  1. column names cotains spaces and capital letters which should be corrected for easier accessing
  2. The Profit column which has a lot of null values (1993) which signifies some unknown error
  3. The 'Order Date' column is in string type instead of date
  4. There is a duplicated row in the dataset

CLEANING BASED ON ASSESSMENT REPORT

ISSUE 1
# ACTION: replace spaces from all column names with underscore and format them as all lower case
#CODE :
df.columns = df.columns.str.lower().str.replace(' ','_')
#TEST 
display(df.columns.str.contains(' ').any(), df.columns)
ISSSUE 2
Dealing with null values in the Price column:

There are 2 choices: fill with zero or drop them
if i drop these rows, it will reduce the dataset and maybe orders that answers the questions im asking will be missing. though the question of this analysis is about sales of products in some regions, its safe to assume that there might be a correlation between low sales and 'null' profit so dropping these null profits might just mean dropping some important orders .
if i fill with 0 theres a chance that ill be giving the wrong information as these null values might indicate that information about profits made hasn't been received yet and that doesnt necessarily mean a zero profit transaction

the best option is to fill rows with null values in the Price column with 0 and not include the Price column too much in my analysis

df[df.profit.isna() & df.discount >0] 

# ACTION:  fill rows with null values in the Price column with 0 

#CODE:
df.profit.fillna(value=0,inplace=True)

# TEST:
df.profit.isnull().any()