Sometimes, things that once worked perfectly suddenly hit a snag. This sales workflow is no exception. The load_and_check()
function worked great until the last update. Now, along with new data, there seem to be some issues. You've been asked to check it out!
The load_and_check()
function loads the sales.csv
dataset and proceeds to do some checks. First, it confirms that the dataset has the desired shape. Next, there are a couple of integrity checks designed to make sure that the data is consistent and that nothing suspicious is happening there.
The sales.csv
dataset contains several columns, but you should give special attention to these:
Total
: The total price, including 5% tax
Quantity
: Number of products purchased
Unit price
: Price of each product in USD
Tax
: 5% tax fee, calculated from the total price
Date
: Date of the purchase
As you work on fixing the pipeline, try to keep as much of the original code as possible. If you need to change something in the data, create new columns and keep all original ones intact. Remember to update the if
statements of checks if necessary!
import pandas as pd
import numpy as np
def load_and_check():
data = pd.read_csv('sales.csv')
# check data details and its description
display(data.head(), data.info(), data.isna().sum(), data.describe())
# check 1st column
display("column 1", data.iloc[:,[0]].head())
# strip whitespace
display(data.columns[0], data.columns[1])
for x in data.columns:
print(x)
# Correct way to strip whitespace from column names
data.columns = data.columns.str.strip()
display("after stripping whitespace", data.head(), data.isna().sum(), data.columns)
return data
data = load_and_check()
import pandas as pd
import numpy as np
def load_and_check():
data = pd.read_csv('sales.csv')
# check data details and its description
display(data.head(), data.info(), data.isna().sum(), data.describe())
# check 1st column
display("column 1", data.iloc[:,[0]].head())
# strip whitespace
data.iloc[:,[0]] = data.iloc[:,[0]].apply(lambda x: x.str.lstrip())
#print(data.head(), '\n')
display("after stripping whitespace", data.head(), data.isna().sum(), data.columns)
# Step 1: Check the loaded data
# Correct the string to a number ("17" -> 17)
if data.shape[1] != 17:
print("Please check that the data was loaded properly, different shape was expected.")
else:
print("Data loaded succesfully.")
# Step 2: Data integrity check
grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
display("grouped_data", grouped_data)
grouped_data['threshold'] = 3 * grouped_data['std']
display("grouped_data['threshold']", grouped_data['threshold'])
grouped_data['max'] = grouped_data['mean'] + grouped_data['threshold']
display("grouped_data['max']", grouped_data['max'])
grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
display("grouped_data['min']", grouped_data['min'])
data = pd.merge(data, grouped_data, on='Date', how='left')
display("data", data.head())
data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
display("data['Condition_1']", data['Condition_1'])
data['Condition_1'].fillna(False, inplace=True)
display("data['Condition_1'] after fillna", data['Condition_1'])
data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax'], 1) == round(data['Total'], 1)
display("data['Condition_2']", data['Condition_2'])
# Do any necessary changes below this comment and before the if statement
# Create a new column 'Tax_calculated' for calculated tax, as the problem is some missing data in the Tax column, create a new column where you calculate it from the known data.
data['Tax_calculated'] = data['Total'] - data['Quantity'] * data['Unit price']
display("data['Tax_calculated']", data['Tax_calculated'])
# Create a new condition for the integrity check
data['Condition_3'] = round(data['Tax_calculated'], 1) == round(data['Quantity'] * data['Unit price'] * 0.05, 1)
display("data['Condition_3']", data['Condition_3'])
# Add the new condition to the if statement using logic operators
if (data['Condition_1'].sum() != data.shape[0])&((data['Condition_2'].sum() != data.shape[0])|(data['Condition_3'].sum() != data.shape[0])):
print("Something fishy is going on with the data! Better check the pipeline!")
else:
print("Data integrity check was succesful!")
return data
data = load_and_check()