Project: Debugging a Sales Data Workflow

Sometimes, things that once worked perfectly suddenly hit a snag. This sales workflow is no exception. The load_and_check() function worked great until the last update. Now, along with new data, there seem to be some issues. You've been asked to check it out!

The load_and_check() function loads the sales.csv dataset and proceeds to do some checks. First, it confirms that the dataset has the desired shape. Next, there are a couple of integrity checks designed to make sure that the data is consistent and that nothing suspicious is happening there.

The sales.csv dataset contains several columns, but you should give special attention to these: Total: The total price, including 5% tax Quantity: Number of products purchased Unit price: Price of each product in USD Tax: 5% tax fee, calculated from the total price Date: Date of the purchase

As you work on fixing the pipeline, try to keep as much of the original code as possible. If you need to change something in the data, create new columns and keep all original ones intact. Remember to update the if statements of checks if necessary!

import pandas as pd
import numpy as np

def load_and_check():
    data = pd.read_csv('sales.csv')  
    
    # Step 1: Check the loaded data
    
    # Correct the string to a number
    if data.shape[1] != "17":
        print("Please check that the data was loaded properly, different shape was expected.")
    else:
        print("Data loaded succesfully.")

    # Step 2: Data integrity check
    grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
    grouped_data['threshold'] = 3 * grouped_data['std']
    grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold
    grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
    data = pd.merge(data, grouped_data, on='Date', how='left')
    data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
    data['Condition_1'].fillna(False, inplace=True)

    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax'], 1) == round(data['Total'], 1)
    
    # Do any necessary changes below this comment and before the if statement
    # You can change the if statement if you find it necessary
    
    if (data['Condition_1'].sum() != data.shape[0])&(data['Condition_2'].sum() != data.shape[0]):
        print("Something fishy is going on with the data! Better check the pipeline!")
    else: 
        print("Data integrity check was succesful!")
        
    return data

data = load_and_check()

import pandas as pd

def load_and_check():
    data = pd.read_csv('sales.csv')

    # Step 1: Check the loaded data
    expected_columns = ['Total', 'Quantity', 'Unit price', 'Tax', 'Date']
    missing_columns = [col for col in expected_columns if col not in data.columns]

    if missing_columns:
        raise ValueError(f"Missing columns: {missing_columns}")
    else:
        print("Data loaded successfully.")

    # Step 2: Data integrity check
    data['Tax_calculated'] = data['Total'] - (data['Quantity'] * data['Unit price'])
    data['Condition_1'] = (data['Tax_calculated'] == data['Tax'])
    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax'], 2) == round(data['Total'], 2)

    if data['Condition_1'].all() and data['Condition_2'].all():
        print("Data integrity check was successful!")
    else:
        print("Something fishy is going on with the data! Better check the pipeline!")

    return data

data = load_and_check()