Walmart is the biggest retail store in the United States. Just like others, they have been expanding their e-commerce part of the business. By the end of 2022, e-commerce represented a roaring $80 billion in sales, which is 13% of total sales of Walmart. One of the main factors that affects their sales is public holidays, like the Super Bowl, Labour Day, Thanksgiving, and Christmas.
In this project, you have been tasked with creating a data pipeline for the analysis of supply and demand around the holidays, along with conducting a preliminary analysis of the data. You will be working with two data sources: grocery sales and complementary data. You have been provided with the grocery_sales
table in PostgreSQL
database with the following features:
grocery_sales
grocery_sales
"index"
- unique ID of the row"Store_ID"
- the store number"Date"
- the week of sales"Weekly_Sales"
- sales for the given store
Also, you have the extra_data.parquet
file that contains complementary data:
extra_data.parquet
extra_data.parquet
"IsHoliday"
- Whether the week contains a public holiday - 1 if yes, 0 if no."Temperature"
- Temperature on the day of sale"Fuel_Price"
- Cost of fuel in the region"CPI"
– Prevailing consumer price index"Unemployment"
- The prevailing unemployment rate"MarkDown1"
,"MarkDown2"
,"MarkDown3"
,"MarkDown4"
- number of promotional markdowns"Dept"
- Department Number in each store"Size"
- size of the store"Type"
- type of the store (depends onSize
column)
You will need to merge those files and perform some data manipulations. The transformed DataFrame can then be stored as the clean_data
variable containing the following columns:
"Store_ID"
"Month"
"Dept"
"IsHoliday"
"Weekly_Sales"
"CPI"
- "
"Unemployment"
"
After merging and cleaning the data, you will have to analyze monthly sales of Walmart and store the results of your analysis as the agg_data
variable that should look like:
Month | Weekly_Sales |
---|---|
1.0 | 33174.178494 |
2.0 | 34333.326579 |
... | ... |
Finally, you should save the clean_data
and agg_data
as the csv files.
It is recommended to use pandas
for this project.
-- Write your SQL query here
SELECT * FROM grocery_sales
Convert the results of the query SELECT * FROM grocery_sales into a csv file
import pandas as pd
grocery_sales_df = pd.DataFrame({
'index': [],
'Store_ID': [],
'Date': [],
'Dept': [],
'Weekly_Sales': []
})
# Assuming grocery_sales_df is populated with the query results
# Save the DataFrame to a CSV file
grocery_sales_df.to_csv('grocery_sales1.csv', index=False)
print('grocery_sales.csv file has been created.')
Explain why the above code returns an empty dataframe
The code provided returns an empty DataFrame because the DataFrame grocery_sales_df
is initialized with empty lists for each column. This means that no data is actually being populated into the DataFrame from the SQL query results.
Here are the key points to consider:
-
Initialization with Empty Lists: The DataFrame
grocery_sales_df
is created with empty lists for each column, which means it starts with no data. -
Lack of Data Population: The comment
# Assuming grocery_sales_df is populated with the query results
suggests that there is an expectation for the DataFrame to be populated with data from the SQL query. However, there is no code provided that actually performs this data population. -
Variable
grocery_sales
: The variablegrocery_sales
seems to be the one that contains the actual data from the SQL query, as it has the correct data types and is not empty. However, this variable is not used in the code to populategrocery_sales_df
.
To fix this, you should populate grocery_sales_df
with the data from grocery_sales
before saving it to a CSV file.
Populate the grocery_sales_df with data from grocery_sales and print the first five rows
# Populate grocery_sales_df with data from grocery_sales
import pandas as pd
grocery_sales_df = grocery_sales.copy()
# Print the first five rows of the DataFrame
grocery_sales_df.head()
grocery_sales_df.to_csv('grocery_sales2.csv', index=False)
Export the grocery_sales dataframe to a csv file
# Export the grocery_sales dataframe to a CSV fil
grocery_sales.to_csv('grocery_sales.csv', index=True)
grocery_sales.head()
import pandas as pd
import os
# Extract function is already implemented for you
def extract(store_data, extra_data):
extra_df = pd.read_parquet(extra_data)
merged_df = store_data.merge(extra_df, on = "index")
return merged_df
# Call the extract() function and store it as the "merged_df" variable
merged_df = extract(grocery_sales, "extra_data.parquet")
print(merged_df.head())
# Check for missing values
merged_df.isna().sum()
# Create the transform() function with one parameter: "raw_data"
def transform(raw_data):
# fill missing numerical values with mean
raw_data.fillna(
{
'CPI': raw_data['CPI'].mean(),
'Weekly_Sales': raw_data['Weekly_Sales'].mean(),
'Unemployment': raw_data['Unemployment'].mean()
}, inplace=True
)
# Convert Date column to date_time_type
raw_data["Date"] = pd.to_datetime(raw_data["Date"], format = "%Y-%m-%d")
# Extract Month value from date
raw_data['Month'] = raw_data['Date'].dt.month
# Filter rows where weekly_sales > 10,000
raw_data = raw_data.loc[raw_data['Weekly_Sales'] > 10000, :]
# Filter for required columns
raw_data = raw_data.drop(["index", "Temperature", "Fuel_Price", "MarkDown1", "MarkDown2", "MarkDown3", "MarkDown4", "MarkDown5", "Type", "Size", "Date"], axis=1)
return raw_data
clean_data = transform(merged_df)
# clean_data.drop(clean_data.columns[0], axis=1, inplace=True)
print(clean_data.head())