Sometimes, things that once worked perfectly suddenly hit a snag. This sales workflow is no exception. The load_and_check() function worked great until the last update. Now, along with new data, there seem to be some issues. You've been asked to check it out!
The load_and_check() function loads the sales.csv dataset and proceeds to do some checks. First, it confirms that the dataset has the desired shape. Next, there are a couple of integrity checks designed to make sure that the data is consistent and that nothing suspicious is happening there.
The sales.csv dataset contains several columns, but you should give special attention to these:
Total: The total price, including 5% tax
Quantity: Number of products purchased
Unit price: Price of each product in USD
Tax: 5% tax fee, calculated from the total price
Date: Date of the purchase
As you work on fixing the pipeline, try to keep as much of the original code as possible. If you need to change something in the data, create new columns and keep all original ones intact. Remember to update the if statements of checks if necessary!
import pandas as pd
import numpy as np
def load_and_check():
data = pd.read_csv('sales.csv')
# Step 1: Check the loaded data
# Correct the string to a number
if data.shape[1] != "17":
print("Please check that the data was loaded properly, different shape was expected.")
else:
print("Data loaded succesfully.")
# Step 2: Data integrity check
grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
grouped_data['threshold'] = 3 * grouped_data['std']
grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold
grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
data = pd.merge(data, grouped_data, on='Date', how='left')
data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
data['Condition_1'].fillna(False, inplace=True)
data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax'], 1) == round(data['Total'], 1)
# Do any necessary changes below this comment and before the if statement
# You can change the if statement if you find it necessary
if (data['Condition_1'].sum() != data.shape[0])&(data['Condition_2'].sum() != data.shape[0]):
print("Something fishy is going on with the data! Better check the pipeline!")
else:
print("Data integrity check was succesful!")
return data
data = load_and_check()import pandas as pd
def load_and_check():
data = pd.read_csv('sales.csv')
# Step 1: Check the loaded data
expected_columns = ['Total', 'Quantity', 'Unit price', 'Tax', 'Date']
missing_columns = [col for col in expected_columns if col not in data.columns]
if missing_columns:
raise ValueError(f"Missing columns: {missing_columns}")
else:
print("Data loaded successfully.")
# Step 2: Data integrity check
data['Tax_calculated'] = data['Total'] - (data['Quantity'] * data['Unit price'])
data['Condition_1'] = (data['Tax_calculated'] == data['Tax'])
data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax'], 2) == round(data['Total'], 2)
if data['Condition_1'].all() and data['Condition_2'].all():
print("Data integrity check was successful!")
else:
print("Something fishy is going on with the data! Better check the pipeline!")
return data
data = load_and_check()