Skip to content

Sports clothing and athleisure attire is a huge industry, worth approximately $193 billion in 2021 with a strong growth forecast over the next decade!

In this notebook, you will undertake the role of a product analyst for an online sports clothing company. The company is specifically interested in how it can improve revenue. You will dive into product data such as pricing, reviews, descriptions, and ratings, as well as revenue and website traffic, to produce recommendations for its marketing and sales teams.

You've been provided with four datasets to investigate:

brands.csv

ColumnsDescription
product_idUnique product identifier
brandBrand of the product

finance.csv

ColumnsDescription
product_idUnique product identifier
listing_priceOriginal price of the product
sale_priceDiscounted price of the product
discountDiscount off the listing price, as a decimal
revenueRevenue generated by the product

info.csv

ColumnsDescription
product_nameName of the product
product_idUnique product identifier
descriptionDescription of the product

reviews.csv

ColumnsDescription
product_idUnique product identifier
ratingAverage product rating
reviewsNumber of reviews for the product
import pandas as pd
import numpy as np

brands = pd.read_csv("brands.csv") 
finance = pd.read_csv("finance.csv")
info = pd.read_csv("info.csv")
reviews = pd.read_csv("reviews.csv")

# Start coding here...
merged_df = brands.merge(finance, on='product_id')
merged_df = merged_df.merge(info, on='product_id')
merged_df = merged_df.merge(reviews, on='product_id')
merged_df = merged_df.dropna()

# Question 1 - What is the volume of products and average revenue for Adidas and Nike products based on listing price quartiles

# assign the quartiles to variables to be used inside your if-function
first_quartile = merged_df['listing_price'].quantile(0.25)
second_quartile = merged_df['listing_price'].quantile(0.50)
third_quartile = merged_df['listing_price'].quantile(0.75)
fourth_quartile = merged_df['listing_price'].quantile(1.00)

def brand_label(row):
    if row['listing_price'] <= first_quartile:
        return 'Budget'
    elif row['listing_price'] <= second_quartile:
        return 'Average'
    elif row['listing_price'] <= third_quartile:
        return 'Expensive'
    elif row['listing_price'] <= fourth_quartile:
        return 'Elite'

merged_df['price_label'] = merged_df.apply(brand_label, axis=1)

adidas_vs_nike = merged_df.groupby(['brand', 'price_label'], as_index=False).agg({'product_id': 'count', 'revenue': 'mean'})
adidas_vs_nike = adidas_vs_nike.rename(columns={'product_id': 'num_products', 'revenue': 'mean_revenue'})
adidas_vs_nike['mean_revenue'] = adidas_vs_nike['mean_revenue'].round(2)

print('Question 1: \n\nThe table below shows that Elite Adidas products are more highly favoured in the market; while consumers buy more of the budget Nike products.')
print(adidas_vs_nike)

# Question 2 - Do any differences exist between the word count of a product's description and its mean rating

merged_df['description_length'] = merged_df['description'].apply(len)

# Upper description length limits
bins = [0, 100, 200, 300, 400, 500, 600, 700]

# Description length labels
labels = ["100", "200", "300", "400", "500", "600", "700"]

# Cut into bins
merged_df["description_length"] = pd.cut(merged_df["description_length"], bins=bins, labels=labels)

# Group by the bins
description_lengths = merged_df.groupby("description_length", as_index=False).agg(
    mean_rating=("rating", "mean"), 
    num_reviews=("reviews", "count")
).round(2)

print('\nQuestion 2: \n\nThe description_length df shows that there are differences between the word count of a product description and its mean rating. \n\nGenerally, the more descriptive a product, the higer the mean rating. \n\nHowever, the change in rating is not proportionate to the descriptiveness or otherwise of the product.')
print(description_lengths)