Project: Debugging a Sales Data Workflow (Version1 )

Sometimes, things that once worked perfectly suddenly hit a snag. This sales workflow is no exception. The load_and_check() function worked great until the last update. Now, along with new data, there seem to be some issues. You've been asked to check it out!

The load_and_check() function loads the sales.csv dataset and proceeds to do some checks. First, it confirms that the dataset has the desired shape. Next, there are a couple of integrity checks designed to make sure that the data is consistent and that nothing suspicious is happening there.

The sales.csv dataset contains several columns, but you should give special attention to these: Total: The total price, including 5% tax Quantity: Number of products purchased Unit price: Price of each product in USD Tax: 5% tax fee, calculated from the total price Date: Date of the purchase

As you work on fixing the pipeline, try to keep as much of the original code as possible. If you need to change something in the data, create new columns and keep all original ones intact. Remember to update the if statements of checks if necessary!

import pandas as pd
import numpy as np

def load_and_check():
    data = pd.read_csv('sales.csv')
    
    # check data details and its description
    display(data.head(), data.info(), data.isna().sum(), data.describe())
    
    # check 1st column
    display("column 1", data.iloc[:,[0]].head())
    
    # strip whitespace
    display(data.columns[0], data.columns[1])
    
    for x in data.columns:
        print(x)
    
    # Correct way to strip whitespace from column names
    data.columns = data.columns.str.strip()
    display("after stripping whitespace", data.head(), data.isna().sum(), data.columns)
    
    return data

data = load_and_check()

import pandas as pd
import numpy as np

def load_and_check():
    data = pd.read_csv('sales.csv')
    
    # check data details and its description
    display(data.head(), data.info(), data.isna().sum(), data.describe())
    
    # check 1st column
    display("column 1", data.iloc[:,[0]].head())
    
    # strip whitespace
    data.iloc[:,[0]] = data.iloc[:,[0]].apply(lambda x: x.str.lstrip())
    
    #print(data.head(), '\n')
    display("after stripping whitespace", data.head(), data.isna().sum(), data.columns)
    
    # Step 1: Check the loaded data
    
    # Correct the string to a number ("17" -> 17)
    if data.shape[1] != 17:
        print("Please check that the data was loaded properly, different shape was expected.")
    else:
        print("Data loaded succesfully.")

    # Step 2: Data integrity check
    grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
    display("grouped_data", grouped_data)
    
    grouped_data['threshold'] = 3 * grouped_data['std']
    display("grouped_data['threshold']", grouped_data['threshold'])
    
    grouped_data['max'] = grouped_data['mean'] + grouped_data['threshold']
    display("grouped_data['max']", grouped_data['max'])
    
    grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
    display("grouped_data['min']", grouped_data['min'])
    
    data = pd.merge(data, grouped_data, on='Date', how='left')
    display("data", data.head())
    
    data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
    display("data['Condition_1']", data['Condition_1'])
    
    data['Condition_1'].fillna(False, inplace=True)
    display("data['Condition_1'] after fillna", data['Condition_1'])
    
    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] + data['Tax'], 1) == round(data['Total'], 1)
    display("data['Condition_2']", data['Condition_2'])
    
    # Do any necessary changes below this comment and before the if statement
    
    # Create a new column 'Tax_calculated' for calculated tax, as the problem is some missing data in the Tax column, create a new column where you calculate it from the known data.
    data['Tax_calculated'] = data['Total'] - data['Quantity'] * data['Unit price']
    display("data['Tax_calculated']", data['Tax_calculated'])
    
     # Create a new condition for the integrity check
    data['Condition_3'] = round(data['Tax_calculated'], 1) == round(data['Quantity'] * data['Unit price'] * 0.05, 1)
    display("data['Condition_3']", data['Condition_3'])
    
    # Add the new condition to the if statement using logic operators
    if (data['Condition_1'].sum() != data.shape[0])&((data['Condition_2'].sum() != data.shape[0])|(data['Condition_3'].sum() != data.shape[0])):
        print("Something fishy is going on with the data! Better check the pipeline!")
    else: 
        print("Data integrity check was succesful!")    
        
    return data

data = load_and_check()