Project: Analyzing Online Sports Revenue

Sports clothing and athleisure attire is a huge industry, worth approximately $193 billion in 2021 with a strong growth forecast over the next decade!

In this notebook, you will undertake the role of a product analyst for an online sports clothing company. The company is specifically interested in how it can improve revenue. You will dive into product data such as pricing, reviews, descriptions, and ratings, as well as revenue and website traffic, to produce recommendations for its marketing and sales teams.

You've been provided with five datasets to investigate:

info.csv
finance.csv
reviews.csv
traffic.csv
brands.csv

The company has asked you to answer the following questions:

What is the volume of products and average revenue for Adidas and Nike products based on price quartiles?

Label products priced up to quartile one as "Budget", quartile 2 as "Average", quartile 3 as "Expensive", and quartile 4 as "Elite".
Store as a pandas DataFrame called adidas_vs_nike containing the following columns: "brand", "price_label", "num_products", and "mean_revenue".

Do any differences exist between the word count of a product's description and its mean rating?

Store the results as a pandas DataFrame called description_lengths containing the following columns: "description_length", "mean_rating", "num_reviews".

How does the volume of products and median revenue vary between clothing and footwear?

Create a pandas DataFrame called product_types containing the following columns: "num_clothing_products", "median_clothing_revenue", "num_footwear_products", "median_footwear_revenue".

# Start coding here... 
import pandas as pd

# Read in info.csv as a dataframe
info = pd.read_csv('info.csv')

# Read in finance.csv as a dataframe
finance = pd.read_csv('finance.csv')

# Read in reviews.csv as a dataframe
reviews = pd.read_csv('reviews.csv')

# Read in traffic.csv as a dataframe
traffic = pd.read_csv('traffic.csv')

# Read in brands.csv as a dataframe
brands = pd.read_csv('brands.csv')

info.head()

# Join the 5 datasets together
merged_df = info_df.merge(finance_df, on='product_id', how='outer')
merged_df = merged_df.merge(reviews_df, on='product_id', how='outer')
merged_df = merged_df.merge(traffic_df, on='product_id', how='outer')
merged_df = merged_df.merge(brands_df, on='product_id', how='outer')

# Display the merged dataframe
merged_df

# Remove null rows
merged_df = merged_df.dropna()

# Remove duplicate rows
merged_df = merged_df.drop_duplicates()

# Display the updated dataframe
merged_df

Q1: Vol and revenue based on brand and price point

# Calculate the quartiles of the listing price
quartiles = merged_df['listing_price'].quantile([0.25, 0.5, 0.75, 1])

# Create a function to assign price labels based on quartiles
def assign_price_label(price):
    if price <= quartiles[0.25]:
        return "Budget"
    elif price <= quartiles[0.5]:
        return "Average"
    elif price <= quartiles[0.75]:
        return "Expensive"
    else:
        return "Elite"

# Apply the function to create the price_label column
merged_df['price_label'] = merged_df['listing_price'].apply(assign_price_label)

# Display the updated dataframe
merged_df

# Group by brand, price_label, and product_id and calculate the average revenue
adidas_vs_nike = merged_df.groupby(['brand', 'price_label']).agg({'revenue': 'mean'})

# Reset the index to make brand and price_label columns
adidas_vs_nike = adidas_vs_nike.reset_index()

# Rename the columns
adidas_vs_nike = adidas_vs_nike.rename(columns={'revenue': 'mean_revenue'})

# Calculate the count of products in each partition
adidas_vs_nike['num_products'] = merged_df.groupby(['brand', 'price_label']).size().reset_index(name='count')['count']

# Round the mean_revenue column to two decimal places
adidas_vs_nike['mean_revenue'] = adidas_vs_nike['mean_revenue'].round(2)

# Display the dataframe
adidas_vs_nike

Q2: Word count vs rating

# Find the largest description_length
max(merged_df["description"].str.len())

# Store the length of each description
merged_df["description_length"] = merged_df["description"].str.len()

# Upper description length limits
lengthes = [0, 100, 200, 300, 400, 500, 600, 700]

# Description length labels
labels = ["100", "200", "300", "400", "500", "600", "700"]

# Cut into bins
merged_df["description_length"] = pd.cut(merged_df["description_length"], bins=lengthes, labels=labels)

# Group by the bins
description_lengths = merged_df.groupby("description_length", as_index=False).agg(
    mean_rating=("rating", "mean"), 
    num_reviews=("reviews", "count")
).round(2)

description_lengths

Q3: How does the volume of products and median revenue vary between clothing and footwear?

# Filter the footwear products
footwear_products = merged_df[merged_df['description'].str.contains('shoe*|trainer*|foot*')]

# Calculate the number of footwear products
num_footwear_products = len(footwear_products)

# Filter for clothing products
clothing_products = merged_df[~merged_df.isin(shoes["product_id"])]

# Remove null product_id values from clothing DataFrame
clothing_products.dropna(inplace=True)

# Calculate the number of clothing products
num_clothing_products = len(clothing_products)

# Calculate the median revenue for footwear products
median_footwear_revenue = footwear_products['revenue'].median()

# Calculate the median revenue for clothing products
median_clothing_revenue = clothing_products['revenue'].median()

# Create the product_types dataframe
product_types = pd.DataFrame({'num_clothing_products': [num_clothing_products],
                              'median_clothing_revenue': [median_clothing_revenue],
                              'num_footwear_products': [num_footwear_products],
                              'median_footwear_revenue': [median_footwear_revenue]},index=[0])

product_types

# # Start coding here... 
# import pandas as pd

# # Read in the data
# info = pd.read_csv("info.csv")
# finance = pd.read_csv("finance.csv")
# reviews = pd.read_csv("reviews.csv")
# traffic = pd.read_csv("traffic.csv")
# brands = pd.read_csv("brands.csv")

# # Merge the data
# merged_df = info.merge(finance, on="product_id", how="outer")
# merged_df = merged_df.merge(reviews, on="product_id", how="outer")
# merged_df = merged_df.merge(traffic, on="product_id", how="outer")
# merged_df = merged_df.merge(brands, on="product_id", how="outer")

# # Drop null values
# merged_df.dropna(inplace=True)

# # Add price labels based on listing_price quartiles
# merged_df["price_label"] = pd.qcut(merged_df["listing_price"], q=4, labels=["Budget", "Average", "Expensive", "Elite"])

# # Group by brand and price_label to get volume and mean revenue
# adidas_vs_nike = merged_df.groupby(["brand", "price_label"], as_index=False).agg(
#     num_products=("price_label", "count"), 
#     mean_revenue=("revenue", "mean")
# ).round(2).reset_index(drop=True)

# # Find the largest description_length
# max(merged_df["description"].str.len())

# # Store the length of each description
# merged_df["description_length"] = merged_df["description"].str.len()

# # Upper description length limits
# lengthes = [0, 100, 200, 300, 400, 500, 600, 700]

# # Description length labels
# labels = ["100", "200", "300", "400", "500", "600", "700"]

# # Cut into bins
# merged_df["description_length"] = pd.cut(merged_df["description_length"], bins=lengthes, labels=labels)

# # Group by the bins
# description_lengths = merged_df.groupby("description_length", as_index=False).agg(
#     mean_rating=("rating", "mean"), 
#     num_reviews=("reviews", "count")
# ).round(2)

# # List of footwear keywords
# mylist = "shoe*|trainer*|foot*"

# # Filter for footwear products
# shoes = merged_df[merged_df["description"].str.contains(mylist)]

# # Filter for clothing products
# clothing = merged_df[~merged_df.isin(shoes["product_id"])]

# # Remove null product_id values from clothing DataFrame
# clothing.dropna(inplace=True)

# # Create product_types DataFrame
# product_types = pd.DataFrame({"num_clothing_products": len(clothing), 
#                               "median_clothing_revenue": clothing["revenue"].median(), 
#                               "num_footwear_products": len(shoes), 
#                               "median_footwear_revenue": shoes["revenue"].median()}, 
#                               index=[0])