Source: https://www.kaggle.com/datasets/nishchay331/retail-store/data
The insights of the clothes retail data
Introduction
Clothes Retail Dataset represents a fictional men's clothes store chain. It has the main root store where online orders are processed, and multiple on-site stores split into geographical zones. A CEO wants to know general trends and hidden insights about sales to make right strategic decisions.
In this data analysis project, I will answer the business questions, such as:
- How much do people spend in these stores on average?
- What are the products with the highest sales?
- How does the revenue change throughout a year?
- Which zone(s) has the highest sales?
In this project, I used PostgreSQL for data preparation and cleaning, PowerBI for analysis and visualization. The Datacamp Workspace is used for building and formatting.
This data analysis and visualizations are based on publicly available Kaggle dataset. The data in a CSV format can be downloaded here
Data Preparation and Cleaning
First thing before the data cleaning process I created a table to hold the data and imported a CSV file. Then I validated all the data, column by column:
- user_id: A unique identifier of each customer who made a transaction. There were no null values. Duplicates are possible because each row represents one item, and a customer can make multiple transactions.
- bill_id: A unique combination of digits for each bill. No null values detected. Similiarly to the previous column, it is possible to have multiple items within one bill.
- line_item_amount: Amount paid for each individual item. There were no negative values, but 3535 records with 0 values found. I rounded all the values to 2 decimal points. Next, I deleted rows with values less than 1 as outliers. No null values found.
- bill_discount: A discount applied to each bill. Duplicates are possible, no null values detected. The values have been rounded to 0 decimal points.
- transaction_date: A timestamp containing information only about a day when an item was sold. Although no changes were made to the column itself, for the futher analysis a new column was created - day_of_week with data type varchar.
- description: A detailed description for each item. I removed 4 rows with null values.
- inventory_category: This column required the most of exploration and cleaning. First, I deleted 208,091 rows containing no information at all. Then, I assigned a category based on a keyword the description column contains (e.g. 'shirt' means the category should be 'SHIRT', 'sweater' - 'SWEATERS' and so on). This way I obtainted the following categories: SHIRTS, JERSEYS, JACKETS, TROUSERS, BELTS, SWEATERS, SOCKS, SHORTS, WALLETS, JOGGERS, DENIM, FOOTWEAR and MASKS. Finally, I deleted most of the records, keeping only 125,805 of them. The reason is that based on the description it was impossible to get the information about the item's category, size or color. It would be useless to keep this part of data for the analysis.
- colour: Represents a color of each item. I changed all null values to NOT CAPTURED, which affected 1367 rows.
- size: The information about item's size. Similiarly to the colour column, I replaced 1367 null values with NOT CAPTURED.
- zone_name: The area where a store is located. One of the following: ROOT, Central, North, East, South, West. No null values found thus no changes made to this column.
- store_name: Contains the information about the area plus 4-digit store code. I found no null values and left this column without changes.
- year: This column was deleted because every row contained the same value - '2022'.
-- Create a table to hold the data DROP TABLE IF EXISTS clothes_retail_salesdata_2022; CREATE TABLE clothes_retail_salesdata_2022 ( user_id varchar, bill_id varchar, line_item_amount double precision, bill_discount double precision, transaction_date date, description varchar, inventory_category varchar, colour varchar, size varchar, zone_name varchar, store_name varchar, year varchar ); -- Import data from CSV into the table (replace 'your_file_path' with the actual path) COPY clothes_retail_salesdata_2022 FROM './retail_filtered_salesdata_2022.csv' DELIMITER ',' CSV HEADER; -- Update the table with the rounded line_item_amount UPDATE clothes_retail_salesdata_2022 SET line_item_amount = ROUND(line_item_amount,2); -- Remove records where line_item_amount is less than 1 DELETE FROM clothes_retail_salesdata_2022 WHERE line_item_amount <1; -- Update the table with the rounded bill_discount UPDATE clothes_retail_salesdata_2022 SET bill_discount = ROUND(bill_discount,0); -- Add a new column to our table ALTER TABLE clothes_retail_salesdata_2022 ADD COLUMN day_of_week varchar; -- Compute data about day of the week UPDATE clothes_retail_salesdata_2022 SET day_of_week = DAYNAME(transaction_date)::varchar; -- Remove null values from the description column DELETE FROM clothes_retail_salesdata_2022 WHERE description IS NULL; -- Remove null values from the inventory_category column DELETE FROM clothes_retail_salesdata_2022 WHERE inventory_category IS NULL; -- Clean categorical data in the inventory_category column UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'SHIRTS' WHERE inventory_category ILIKE '%shirt%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'JERSEYS' WHERE inventory_category ILIKE '%jersey%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'JACKETS' WHERE inventory_category ILIKE '%jacket%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'TROUSERS' WHERE inventory_category ILIKE '%trouser%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'BELTS' WHERE inventory_category ILIKE '%belt%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'SWEATERS' WHERE inventory_category ILIKE '%sweater%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'SOCKS' WHERE inventory_category ILIKE '%sock%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'SHORTS' WHERE inventory_category ILIKE '%shorts%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'WALLETS' WHERE inventory_category ILIKE '%wallet%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'JOGGERS' WHERE inventory_category ILIKE '%jogger%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'DENIM' WHERE inventory_category ILIKE '%denim%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'FOOTWEAR' WHERE inventory_category ILIKE '%shoe%' OR inventory_category ILIKE '%boot%'; UPDATE clothes_retail_salesdata_2022 SET inventory_category = 'MASKS' WHERE inventory_category ILIKE '%mask%'; -- Delete the rest of records where the item category cannot be determined DELETE FROM clothes_retail_salesdata_2022 WHERE LEN(inventory_category) >10; -- Assign values to null records of the colour column UPDATE clothes_retail_salesdata_2022 SET colour = 'NOT CAPTURED' WHERE colour IS NULL OR colour = 'NOT-CAPTURED'; -- Assign values to null records of the size column UPDATE clothes_retail_salesdata_2022 SET size = 'NOT CAPTURED' WHERE size IS NULL OR size = 'NOT-CAPTURED'; -- Delete the unnecessary column year ALTER TABLE clothes_retail_salesdata_2022 DROP COLUMN year; -- Query the modified data SELECT COUNT(*) FROM clothes_retail_salesdata_2022;
Analysis and Visualization
General insights
I started to leverage PowerBI by getting the general idea how our data is distributed. Based on the cleaned data, the total revenue reached 138,373,178.16 for the entire year. There were 63,012 transactions made, and almost 47,000 individual customers who purchased from the store chain.
Considering how many items people usually buy in one go, the distribution is highly right-skewed. That means roughly a half of all transactions have only one item in a bill, followed by 2 items (24.7% of transactions) and 3 items (8% of transactions) in a bill. Merely 0.7% is made of transactions with more than 10 items in each bill.
The average amount per transaction is distributed with similiar patterns. The median value is 1499, whereas the average is 2196. We can conclude that in general people tend to have small bills with few items.
Item characteristics
The revenue is unequally distributed among all clothes departments. The SHIRTS department accounts for more than two thirds of total sales (71.7%), followed by JOGGERS (20.2%) and JERSEYS (2.1%). The percentage of other departments is shown on a treemap below:
In the most of records the size of an item wasn't captured. Based on the fraction of observations containing this information, we can draw a conclusion that M (medium) might have been the most preferred size. The second most popular was L (large), followed by S (small) and XL (extra large).
Similiarly to size distribution, the information about item's color wasn't available for the most of the data. However, the graph below demonstrates insights about each color's popularity:
Indigo (shade of dark blue) was the most popular among sold items - 21.84% (records with color NOT CAPTURED were not taken into account). The other colors from top 10 were: black, charcoal, light grey, tan, dark grey, combo 2, dark denim, red and khaki.
More detailed information about colors can be viewed via this link to the dashboard:
Distribution in time
In terms of timeframe, the first half of 2022 recorded less sales than the second half of the year. The lowest sales were during the 1st quarter (January-March) - 31.1M. The 4th quarter was the most profitable - 38.37M.
When drilled one layer down, the distribution by each month becomes visible. July and October had the biggest revenue. It might be associated with broad sales of goods on reduction.
The data was also aggregated by days of the week. Weekend was the most profitable time of the week, with Sundays having almost 26M in total, followed by Saturdays with 22.64M. The weekdays had similiar values on average, from Monday to Friday.
In the link below you can explore further how the revenue is distributed across months, weeks and individual days. Have fun!
Store zone participation
Since the analyzed data comes from the store chain, each store had a unique identifier and a zone it belongs to (one of North, East, South, West, Central or ROOT). The further analysis revealed a big differences between these store zones in terms of revenue.
The vast majority (71.67%) of sales comes from the root e-commerce store, where orders were placed and processed via Internet. Comparing the revenue between stores on-site, the south stores share 68.1% of all stores combined. The next best selling zone, north, has mere 12.4% of participation.
Similiarly to the time period analysis, the chart above can be easily drilled down to see the performance of each individual store, via this link:
Key Findings
- The most popular number of items in one bill is 1;
- Based on the analyzed data, the most frequent choice would be: a shirt, size M, color navy;
- The 4th quarter of the year was the most profitable, weekends being more busy compared to weekdays;
- Almost 3/4 of the entire revenue was made by selling online through the root e-commerce store.
Recommendations
- Focus on putting small products like socks or wallets near the checkout points - people tend to grab additional items while waiting in a line;
- Shirts are the most popular type of clothes based on analyzed data. Creating more sales or new collections on it can boost store sales;
- Allocate more novelties during the 1st quarter of the year - the least busy time;
- Offer a small discounts for buying on-site - it might help reduce the number of orders on the main e-commerce store.
Thank you for reading till the very end! Your feedback and sharing are greatly appreciated.