Sometimes, things that once worked perfectly suddenly hit a snag. This sales workflow is no exception. The load_and_check()
function worked great until the last update. Now, along with new data, there seem to be some issues. You've been asked to check it out!
The load_and_check()
function loads the sales.csv
dataset and proceeds to do some checks. First, it confirms that the dataset has the desired shape. Next, there are a couple of integrity checks designed to make sure that the data is consistent and that nothing suspicious is happening there.
The sales.csv
dataset contains several columns, but you should give special attention to these:
Total
: The total price, including 5% tax
Quantity
: Number of products purchased
Unit price
: Price of each product in USD
Tax
: 5% tax fee, calculated from the total price
Date
: Date of the purchase
As you work on fixing the pipeline, try to keep as much of the original code as possible. If you need to change something in the data, create new columns and keep all original ones intact. Remember to update the if
statements of checks if necessary!
import pandas as pd
import numpy as np
def load_and_check():
data = pd.read_csv('sales.csv')
# Step 1: Check the loaded data
# Correct the string to a number
if data.shape[1] != 17:
print("Please check that the data was loaded properly, different shape was expected.")
else:
print("Data loaded succesfully.")
# Step 2: Data integrity check
data['Tax_calculated'] = data['Total'] - data['Quantity'] * data['Unit price']
grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
grouped_data['threshold'] = 3 * grouped_data['std']
grouped_data['max'] = grouped_data['mean'] + grouped_data['threshold']
grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
data = pd.merge(data, grouped_data, on='Date', how='left')
data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
data['Condition_1'].fillna(False, inplace=True)
data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax_calculated'], 1) == round(data['Total'], 1)
# Do any necessary changes below this comment and before the if statement
# You can change the if statement if you find it necessary
if (data['Condition_1'].sum() != data.shape[0])&(data['Condition_2'].sum() != data.shape[0]):
print("Something fishy is going on with the data! Better check the pipeline!")
else:
print("Data integrity check was succesful!")
return data
data = load_and_check()
load_and_check()