ETL Project: Building a Retail Data Pipeline (Walmart Sales Data)

Walmart is the biggest retail store in the United States. Just like others, they have been expanding their e-commerce part of the business. By the end of 2022, e-commerce represented a roaring $80 billion in sales, which is 13% of total sales of Walmart. One of the main factors that affects their sales is public holidays, like the Super Bowl, Labour Day, Thanksgiving, and Christmas.

In this project, you have been tasked with creating a data pipeline for the analysis of supply and demand around the holidays, along with conducting a preliminary analysis of the data. You will be working with two data sources: grocery sales and complementary data. You have been provided with the grocery_sales table in PostgreSQL database with the following features:

`grocery_sales`

"index" - unique ID of the row
"Store_ID" - the store number
"Date" - the week of sales
"Weekly_Sales" - sales for the given store

Also, you have the extra_data.parquet file that contains complementary data:

`extra_data.parquet`

"IsHoliday" - Whether the week contains a public holiday - 1 if yes, 0 if no.
"Temperature" - Temperature on the day of sale
"Fuel_Price" - Cost of fuel in the region
"CPI" – Prevailing consumer price index
"Unemployment" - The prevailing unemployment rate
"MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4" - number of promotional markdowns
"Dept" - Department Number in each store
"Size" - size of the store
"Type" - type of the store (depends on Size column)

You will need to merge those files and perform some data manipulations. The transformed DataFrame can then be stored as the clean_data variable containing the following columns:

"Store_ID"
"Month"
"Dept"
"IsHoliday"
"Weekly_Sales"
"CPI"
""Unemployment""

After merging and cleaning the data, you will have to analyze monthly sales of Walmart and store the results of your analysis as the agg_data variable that should look like:

Month	Weekly_Sales
1.0	33174.178494
2.0	34333.326579
...	...

Finally, you should save the clean_data and agg_data as the csv files.

It is recommended to use pandas for this project.

DataFrameas

grocery_sales

variable

-- Write your SQL query here
SELECT * FROM grocery_sales

Convert the results of the query SELECT * FROM grocery_sales into a csv file

import pandas as pd

grocery_sales_df = pd.DataFrame({
    'index': [],
    'Store_ID': [],
    'Date': [],
    'Dept': [],
    'Weekly_Sales': []
})

# Assuming grocery_sales_df is populated with the query results
# Save the DataFrame to a CSV file
grocery_sales_df.to_csv('grocery_sales1.csv', index=False)
print('grocery_sales.csv file has been created.')

Explain why the above code returns an empty dataframe

The code provided returns an empty DataFrame because the DataFrame grocery_sales_df is initialized with empty lists for each column. This means that no data is actually being populated into the DataFrame from the SQL query results.

Here are the key points to consider:

Initialization with Empty Lists: The DataFrame grocery_sales_df is created with empty lists for each column, which means it starts with no data.
Lack of Data Population: The comment # Assuming grocery_sales_df is populated with the query results suggests that there is an expectation for the DataFrame to be populated with data from the SQL query. However, there is no code provided that actually performs this data population.
Variable grocery_sales: The variable grocery_sales seems to be the one that contains the actual data from the SQL query, as it has the correct data types and is not empty. However, this variable is not used in the code to populate grocery_sales_df.

To fix this, you should populate grocery_sales_df with the data from grocery_sales before saving it to a CSV file.

Populate the grocery_sales_df with data from grocery_sales and print the first five rows

# Populate grocery_sales_df with data from grocery_sales
import pandas as pd

grocery_sales_df = grocery_sales.copy()

# Print the first five rows of the DataFrame
grocery_sales_df.head()

grocery_sales_df.to_csv('grocery_sales2.csv', index=False)

Export the grocery_sales dataframe to a csv file

# Export the grocery_sales dataframe to a CSV fil

grocery_sales.to_csv('grocery_sales.csv', index=True)

grocery_sales.head()

import pandas as pd
import os

# Extract function is already implemented for you 
def extract(store_data, extra_data):
    extra_df = pd.read_parquet(extra_data)
    merged_df = store_data.merge(extra_df, on = "index")
    return merged_df

# Call the extract() function and store it as the "merged_df" variable
merged_df = extract(grocery_sales, "extra_data.parquet")

print(merged_df.head())

# Check for missing values 
merged_df.isna().sum()

# Create the transform() function with one parameter: "raw_data"
def transform(raw_data): 
    # fill missing numerical values with mean
    raw_data.fillna(
        {
            'CPI': raw_data['CPI'].mean(), 
            'Weekly_Sales': raw_data['Weekly_Sales'].mean(), 
            'Unemployment': raw_data['Unemployment'].mean()
        }, inplace=True
    ) 

    # Convert Date column to date_time_type
    raw_data["Date"] = pd.to_datetime(raw_data["Date"], format = "%Y-%m-%d") 
    
    # Extract Month value from date
    raw_data['Month'] = raw_data['Date'].dt.month 

    # Filter rows where weekly_sales > 10,000
    raw_data = raw_data.loc[raw_data['Weekly_Sales'] > 10000, :] 

    # Filter for required columns 
    raw_data = raw_data.drop(["index", "Temperature", "Fuel_Price", "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "Type", "Size", "Date"], axis=1)

    return raw_data

clean_data = transform(merged_df)
# clean_data.drop(clean_data.columns[0], axis=1, inplace=True)

print(clean_data.head())

‌
‌
‌

ETL Project: Building a Retail Data Pipeline (Walmart Sales Data)

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}grocery_sales

extra_data.parquet

`grocery_sales`

`extra_data.parquet`