In today’s fast-paced, data-driven world, companies rely heavily on accurate analyses of sales data to make informed decisions. However, the quality of the insights derived from such analyses depend significantly on the integrity of the underlying code.
You have been given some starting code with two functions: one that extracts and flattens JSON data into a structured format and the other that transforms electricity sales data by cleaning, filtering, and extracting relevant features. The company plans to use your revised code to improve the accuracy of sales analytics.
Your task is to identify potential errors in the functions and the underlying data that might result in logic and runtime errors, such as missing values, incorrect data types, or incompatible values (e.g., negatives). Enhance the custom functions provided by implementing exceptions to catch data quality issues and edge cases.
The data is available here ("sales_data_sample.csv") for the analyses. This data has 25 columns, but only two columns are analyzed, namely, quantity_ordered and price_each. A sample of the data is shown below.
# Import libraries
import pandas as pd
# Load data
sales_df = pd.read_csv("data/sales_data_sample.csv")
sales_df.head()def get_quantity_ordered_sum(sales_quantity_ordered):
"""Calculates the total sum on the 'quantity_ordered' column.
Args:
sales_quantity_ordered (pd.core.series.Series): The pandas Series for the 'quantity_ordered' column.
Returns:
total_quantity_ordered (int): The total sum of the 'quantity_ordered' column.
"""
total_quantity_ordered = 0 # Initialize total sum variable
try:
for quantity in sales_quantity_ordered:
if quantity < 0:
# Convert negative values to positive to ensure proper summation
quantity *= -1
total_quantity_ordered += quantity # Accumulate quantity values
except TypeError:
# Raise an error if a non-numeric value is encountered
raise TypeError("Invalid data type in 'quantity_ordered' column.")
return total_quantity_ordered # Return the computed total sumtotal_quantity_ordered=get_quantity_ordered_sum(sales_df['quantity_ordered'])
print(total_quantity_ordered)def get_price_each_average(sales_price_each, num_of_places=2):
"""Calculates the average on the 'price_each' column
using pandas built-in methods and rounds to the desired number of places.
Args:
sales_price_each (pd.core.series.Series): The pandas Series for the 'price_each' column.
num_of_places (int): The number of decimal places to round.
Returns:
average_price_each (float): The average of the 'price_each' column.
"""
try:
# Compute total sum of 'price_each' column
total_of_price_each = sales_price_each.sum()
# Compute length of the 'price_each' column
len_of_price_each = len(sales_price_each)
# Compute and round the average
average_price_each = round(
total_of_price_each / len_of_price_each, num_of_places
)
except TypeError:
# Handle cases where non-numeric values exist in the column
sales_price_each = pd.to_numeric(sales_price_each, errors='coerce') # Convert non-numeric to NaN
sales_price_each.fillna(sales_price_each.mean()) # Replace NaN with column mean
# Recalculate total sum and length after handling invalid data
total_of_price_each = sales_price_each.sum()
len_of_price_each = len(sales_price_each)
# Compute and round the average again
average_price_each = round(
total_of_price_each / len_of_price_each, num_of_places
)
return average_price_each # Return the computed average price# Add as many cells as you require
average_price_each= get_price_each_average(sales_df['price_each'])
print(average_price_each)