Skip to content

Data Description

About the dataset:

  • row_id: unique row identifier
  • order_id: unique order identifier
  • order_date: date the order was placed
  • ship_date: date the order was shipped
  • ship_mode: how the order was shipped
  • customer_id: unique customer identifier
  • customer_name: customer name
  • segment: segment of product
  • country: country of customer
  • city: city of customer
  • state: state of customer
  • postal_code: postal code of customer
  • region: Superstore region represented
  • product_id: unique product identifier
  • category: category of product
  • sub_category: subcategory of product
  • product_name: name of product
  • sales: total sales of that product in the order
  • quantity: total units sold of that product in the order
  • discount: percent discount applied for that product in the order
  • profit: total profit for that product in the order
!pip install -q seaborn --upgrade
!pip install -q pandas --upgrade
import datetime as dt
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

from operator import attrgetter
import calendar
df = pd.read_csv("capstone_superstore.csv")
df

Data Cleaning

df.dtypes
df.drop("Unnamed: 0", axis=1, inplace=True)
df["Order Date"] = pd.to_datetime(df["Order Date"])
df["Ship Date"] = pd.to_datetime(df["Ship Date"])
df["Postal Code"] = df["Postal Code"].astype(str)
df.dtypes
df.isnull().sum()
df[df.duplicated()]
df[df["Postal Code"] == "nan"]
df.loc[df["Postal Code"] == "nan", "Postal Code"] = "05049"
df[df["Postal Code"] == "nan"]