Project: Debugging a Sales Data Workflow (Version 2)

Debugging a Sales Data Workflow

Sometimes, things that once worked perfectly suddenly hit a snag. Practice your knowledge of DataFrames to find the problem and fix it!

Programming is not only about writing code from scratch but also about reading and improving an already written code.

In this project, you will work to fix the start of a sales workflow. You will need to read an already-written code, understand where the problem is, and come up with a solution so things will run smoothly again.

As a data engineer, you often face unexpected challenges in workflows. In this scenario, the load_and_check() function, in charge of managing sales data, encounters issues after the latest update. Unfortunately, your colleague who usually handles this code is currently on holiday, leaving you to troubleshoot.

Your task is to identify and address the issues in the sales data pipeline without getting into every line of code. The load_and_check() function loads the sales.csv dataset and performs several checks. Initially, it verifies the dataset's shape, ensuring it matches expectations. Subsequently, integrity checks are conducted to maintain data consistency and flag any anomalies.

The sales.csv dataset has various columns, focusing on critical fields such as Total, Quantity, Unit price, Tax, and Date. It's essential that the Tax column accurately represents 5% of the subtotal, calculated from the Unit Price multiplied by Quantity.

Your goal is to sort out the pipeline issues, aiming for the code to return 2 success messages upon completion. While at it, try to keep the original structure as much as possible. Only change existing columns if necessary, and make sure the data remains accurate. Be mindful of updating any relevant if statements in the checks as needed.

Run the load_and_check() function and observe what happens. Follow the printouts to identify and fix any mistakes.
Review the load_and_check() function for issues. Focus on the two if checks to identify potential problems. You might need to edit both the functions and the underlying data. Modify existing columns if necessary, create new ones as you wish. Ensure the function only returns two success messages when completed.

# original version

import pandas as pd

def load_and_check():
    # Step 1: Load the data and check if it has the expected shape
    data = pd.read_csv('sales.csv')  
    
    if data.shape[1] != 18:
        print("Please check that the data was loaded properly!")
    else:
        print("Data loaded successfully.")

    # Step 2: Calculate statistical values and merge with the original data
    grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
    grouped_data['threshold'] = 3 * grouped_data['std']
    grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold
    grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
    data = pd.merge(data, grouped_data, on='Date', how='left')

    # Condition_1 checks if 'Total' is within the acceptable range (min to max) for each date
    data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
    data['Condition_1'].fillna(False, inplace=True)  

    # Condition_2 checks if the 'Tax' column is properly calculated as 5% of (Quantity * Unit price)
    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)
        
    # Step 3: Check if all rows pass both Condition_1 and Condition_2
    # Success indicates data integrity; failure suggests potential issues.
    if (data['Condition_1'].sum() == data.shape[0]) and (data['Condition_2'].sum() == data.shape[0]):
        print("Data integrity check was successful! All rows pass the integrity conditions.")
    else:
        print("Something fishy is going on with the data! Integrity check failed for some rows!")
        
    return data

processed_data = load_and_check()

How to approach the project

Run the function Run the load_and_check() function as it is and see what happens.
Isolate the issues Carefully review the load_and_check() function. Spot any potential issues.

The first if statement Check the first if statement. Inspect the data DataFrame on your own and amend the condition if necessary. You might find the .shape attribute especially useful.
The second if statement Check Condition_1 and Condition_2 separately by using two if statements. Find out which one doesn't work as expected and why.

Fix the issue Examine how Condition_2 column is determined and think about why it might be False for certain rows.

Check values for the Tax column Since there might be missing data in the Tax column, it is essential to recalculate it. The simplest method is to calculate it as data['Quantity'] * data['Unit price'] * 0.05. Make sure not to modify or add any other columns, and then check if Condition_2 column returns True.

import pandas as pd

def load_and_check():
    # Step 1: Load the data and check if it has the expected shape
    data = pd.read_csv('sales.csv')  
    
    # Issue 1 fixed: Correct number of expected columns
    if data.shape[1] != 17: 
        print("Please check that the data was loaded properly!")
    else:
        print("Data loaded successfully.")

    # Step 2: Calculate statistical values and merge with the original data
    grouped_data = data.groupby(['Date'])['Total'].agg(['mean', 'std'])
    grouped_data['threshold'] = 3 * grouped_data['std']
    grouped_data['max'] = grouped_data['mean'] + grouped_data.threshold
    grouped_data['min'] = grouped_data[['mean', 'threshold']].apply(lambda row: max(0, row['mean'] - row['threshold']), axis=1)
    data = pd.merge(data, grouped_data, on='Date', how='left')
    
    # Issue 2 fixed:  Recalculating the 'Tax' column
    data['Tax'] = (data['Quantity'] * data['Unit price']).astype(float) * 0.05  # Assuming tax is 5% of the subtotal

    # Condition_1 checks if 'Total' is within the acceptable range (min to max) for each date
    data['Condition_1'] = (data['Total'] >= data['min']) & (data['Total'] <= data['max'])
    data['Condition_1'].fillna(False, inplace=True)  

    # Condition_2 checks if the 'Tax' column is properly calculated as 5% of (Quantity * Unit price)
    data['Condition_2'] = round(data['Quantity'] * data['Unit price'] * 0.05, 1) == round(data['Tax'], 1)
        
    # Step 3: Check if all rows pass both Condition_1 and Condition_2
    # Success indicates data integrity; failure suggests potential issues.
    if (data['Condition_1'].sum() == data.shape[0]) and (data['Condition_2'].sum() == data.shape[0]):
        print("Data integrity check was successful! All rows pass the integrity conditions.")
    else:
        print("Something fishy is going on with the data! Integrity check failed for some rows!")
    
    display(data.head(), data.info())
    
    return data

processed_data = load_and_check()

Project: Debugging a Sales Data Workflow (Version 2)

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Debugging a Sales Data Workflow

Debugging a Sales Data Workflow