Skip to content

Product Sales

Background

  • Pens and Printer launched a new product line of office stationery 6 weeks ago.
  • As launching is expensive, we need to make sure we are using the best techniques to sell for each new product.

Methods:

  • Email: Customers in this group received an email when the product line was launched, and a further email three weeks later. This required very little work for the team.
  • Call: Customers in this group were called by a member of the sales team. On average members of the team were on the phone for around thirty minutes per customer.
  • Email and call: Customers in this group were first sent the product information email, then called a week later by the sales team to talk about their needs and how this new product may support their work. The email required little work from the team, the call was around ten minutes per customer.

Questions

  • How many customers were there for each approach?
  • What does the spread of the revenue look like overall? And for each method?
  • Was there any difference in revenue over time for each of the methods?
  • Based on the data, which method would you recommend we continue to use? Some of these methods take more time from the team so they may not be the best for us to use if the results are similar.
  • We don’t really know if there are other differences between the customers in each group, so anything you can tell us would be really helpful to give some context to what went well.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Load the dataset from the provided URL
dataset_url = "https://s3.amazonaws.com/talent-assets.datacamp.com/product_sales.csv"
df = pd.read_csv(dataset_url)

# Set up Seaborn for better visualization aesthetics
sns.set_context('notebook')  # Set the context to 'notebook' for better readability
sns.set()  # Override pandas plot with seaborn style
sns.set_style('whitegrid')  # Set the style to 'whitegrid' for a clean look

Data Validation

The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:

  • week: 6 unique values without missing values, from 1 to 6, same as product launching timeline. No cleaning is needed.
  • sales_method: additionally to 3 original categories had 2 misspelled categories, which were mapped according to the original ones.
  • customer_id: all values are unique with the same lenght without missing values. No cleaning is needed.
  • nb_sold: numeric values without missing values, from 7 to 16. No cleaning is needed.
  • revenue: numeric values with 1074 missing values. If we want to make non-biased conclusions, it's better not to impute them now, because we don't know product for each row, while each product has different price. Missing values were dropped.
  • years_as_customer: numeric values without missing values, however 2 values - 47 and 63 - don't make sense, as counting from our foundation year 1984 they are in the future. Imputed with the number of years from 1984 to today - 41.
  • nb_site_visits: numeric values without missing values, between 12 and 41. No cleaning is needed.
  • state: 50 possible values without missing values. No cleaning is needed.

After the data validation, the dataset contains 13926 rows and 8 columns without missing values.

# Replace misspelled sales_method categories with correct ones
sales_method_mapping = {
    'em + call': "Email + Call",
    'email': "Email"
}
df = df.replace(sales_method_mapping)

# Impute inappropriate years_as_customer values (47 and 63) with 41 which is biggest possible value
invalid_years = [47, 63]
imputed_value = 41
df.loc[df.years_as_customer.isin(invalid_years), 'years_as_customer'] = imputed_value
# Create a figure and axis for the plot
fig, ax = plt.subplots(figsize=(16, 8))

# Generate a bar plot to visualize missing values in the dataframe
msno.bar(df, ax=ax)

# Set the title and labels with appropriate font sizes for better readability
ax.set_title("Our data has missing values only in the revenue column", fontsize=32)
ax.set_xlabel("Data Columns", fontsize=24)
ax.set_ylabel("Non-Null Count", fontsize=24)

# Save the plot as a PNG file with tight bounding box
plt.savefig("missing_data.png", bbox_inches='tight')

# Display the plot
plt.show()
# Plot the count of missing revenue entries by sales method
ax = df[df.revenue.isna()].sales_method.value_counts().plot(
    kind='bar', label="Missing Revenue", color='grey'
)

# Overlay the total count of entries by sales method
df.sales_method.value_counts().plot(
    kind='bar', label="Total Entries", edgecolor='black', fill=False, ax=ax
)

# Adding percentage labels inside the bars for clarity
labels = {'Email': '7.3%', 'Call': '3.6%', 'Email + Call': '13.6%'}
for label, patch in zip(labels, ax.patches[:3]):
    height = patch.get_height()
    percentage_label = labels.get(label)
    ax.annotate(
        percentage_label, 
        (patch.get_x() + patch.get_width() / 2., height),
        ha='center', va='center', 
        xytext=(0, 9), textcoords='offset points'
    )

# Adding legend, title, and axis labels for better understanding
plt.legend()
plt.title("Proportion of Sold Products with Missing Revenues for Each Sales Method")
plt.xlabel("Sales Method")
plt.ylabel("Number of Customers")
plt.grid(False)
plt.xticks(rotation=0)

# Save the plot as a PNG file
plt.savefig("prop_missing.png")

# Display the plot
plt.show()
# Remove rows with missing revenue values and reset the index
df_cleaned = df.dropna(subset=['revenue']).reset_index(drop=True)

# Display information about the cleaned DataFrame
df_cleaned.info()

EDA

# Create a count plot for the number of customers by sales method
ax = sns.countplot(data=df_cleaned, x='sales_method')

# Set the labels and title for the plot
ax.set(
    xlabel="Sales Method", 
    ylabel="Number Of Customers", 
    title="Number Of Customers For Each Sales Technique"
)

# Save the plot as a high-resolution PNG file
plt.savefig("sales_methods.png", dpi=300)

# Display the plot
plt.show()
# Create a histogram to visualize the distribution of revenue
ax0 = sns.histplot(data=df_cleaned, x='revenue', alpha=0.9)

# Set the labels and title for the plot
ax0.set(
    ylabel="Number Of Customers", 
    xlabel="Revenue", 
    title="The Distribution of Revenue"
)

# Save the plot as a high-resolution PNG file
plt.savefig("revenue_dist.png", dpi=300)

# Display the plot
plt.show()
# Set the color palette to 'colorblind' for better accessibility
sns.set_palette('colorblind')

# Create a figure with two subplots arranged vertically
fig, (ax0, ax1) = plt.subplots(2, 1)

# Create a histogram to visualize the distribution of revenue, colored by sales method
ax0 = sns.histplot(
    data=df_cleaned, 
    x='revenue', 
    hue='sales_method', 
    alpha=0.9, 
    ax=ax0,
    legend=False
)

# Create a boxplot to visualize the spread of revenue for each sales method
ax1 = sns.boxplot(
    data=df_cleaned, 
    x='revenue', 
    y='sales_method', 
    hue='sales_method',
    legend=True, 
    ax=ax1
)

# Set labels for the histogram
ax0.set(
    ylabel="Number Of Customers"
)

# Set labels for the boxplot
ax1.set(
    xlabel="Revenue", 
    ylabel=None
)

# Remove y-axis ticks from the boxplot for a cleaner look
ax1.set_yticks([])

# Set the overall title for the figure
fig.suptitle("Revenue Spread For Each Sales Method")

# Save the figure as a high-resolution PNG file
plt.savefig("revenue_range.png", dpi=300)

# Display the figure
plt.show()
# Create a bar plot to visualize the average number of products sold by sales method
ax = sns.barplot(data=df_cleaned, y="nb_sold", x='sales_method')

# Set the labels and title for the plot
ax.set(
    xlabel="Sales Method",
    ylabel="Average Number Of Products Sold",
    title="Combined Sales Method Has Biggest Average Number Of Products Sold"
)

# Save the plot as a high-resolution PNG file
plt.savefig("nb_sold.png", dpi=300)

# Display the plot
plt.show()
# Set the figure size for the plot
plt.figure(figsize=(10, 8))

# Create a bar plot to visualize the total revenue for each sales method
ax = sns.barplot(
    data=df_cleaned,
    x='sales_method',
    y='revenue',
    estimator=sum  # Sum the revenue for each sales method
)

# Set the title and labels for the plot
ax.set(
    title="Total Revenue For Each Sales Method",
    xlabel="Sales Method",
    ylabel="Total Revenue"
)

# Save the plot as a high-resolution PNG file
plt.savefig("revenue_method.png", dpi=300)

# Display the plot
plt.show()
# Create a scatter plot to visualize the relationship between the number of products sold and revenue, colored by sales method
ax = sns.scatterplot(
    data=df_cleaned,
    y='revenue',
    x='nb_sold',
    hue='sales_method'
)

# Set the title and labels for the plot
ax.set(
    title="Higher-Priced Products Are Sold With The Combined Method",
    xlabel="Number Of Products Sold",
    ylabel="Revenue"
)

# Save the plot as a high-resolution PNG file
plt.savefig("revenue_per_product.png", dpi=300)

# Display the plot
plt.show()