Skip to content
Analysing Product Sales
  • AI Chat
  • Code
  • Report
  • Data Analyst Professional Practical Exam Submission

    You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

    You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

    📝 Task List

    Your written report should include written text summaries and graphics of the following:

    • Data validation:
      • Describe validation and cleaning steps for every column in the data
    • Exploratory Analysis:
      • Include two different graphics showing single variables only to demonstrate the characteristics of data
      • Include at least one graphic showing two or more variables to represent the relationship between features
      • Describe your findings
    • Definition of a metric for the business to monitor
      • How should the business use the metric to monitor the business problem
      • Can you estimate initial value(s) for the metric based on the current data
    • Final summary including recommendations that the business should undertake

    Start writing report here..

    First let's import the dataset into a pandas DataFrame get a view of what we're playing with

    # Importing the dataset into a DataFrame
    import pandas as pd
    sales = pd.read_csv('product_sales.csv')
    print(sales)
    print(sales.columns)

    Validating Data

    Here we can see that there are about 15000 rows with the following column names and issues

    1. Week- no issues. Individual values 1-6 as integers. No null values found.
    2. sales_method- Issue with 2 similar values being counted differently- 'em + call', 'email'. Used replace function to have 3 unique values- 'Email', 'Email + Call', 'Call'
    3. customer_id - No issues because all 15000 rows are unique and non null
    4. revenue- Missing 1074 values. Will be interesting to deep dive on any patterns within the missing set
    5. years_as_customer- Analysis reveals that there are 2 outliers. Customers who have been patrons for 40 and 63 years which are unlikely because the company was established in 1984. The histogram reveals a right skewed distribution
    6. nb_site_visits- no correction required. This histogram reveals that the distribution is most closest to a Normal distribution.
    7. state- no correction required
    #Validating the data
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    #Finding columns with null values-  While we can see that the data types match the description provided, the revenue column has null values
    print(sales.info())
    
    #Validate Week- This refers to the weeks in question numbered 1-6. 
    print(sales['week'].unique())
    
    #Validate sales_method - investigation reveals that there 'em + call' , 'email' can be saved as 'Email + Call','Email' to minimise the number of unique values
    print(sales['sales_method'].unique())
    sales['sales_method'].replace(['em + call','email'],['Email + Call','Email'],inplace=True)
    print(sales['sales_method'].unique())
    
    #Validate customer_id- 15000 unique ids
    print(sales['customer_id'].unique().shape)
    
    #Validate nb_sold ( new products sold) - 10 unique non-null values. No correction required
    print(sales['nb_sold'].unique())
    
    #Validate revenue- 1074 Null values found. Imputing missing data below
    print(sales['revenue'].describe())
    print(sales['revenue'].isnull().sum())
    
    #Validate years_as_customer- Describe shows that most values tend to be under 10. There are 2 outliers.
    print(sales['years_as_customer'].describe())
    sns.boxplot(data=sales['years_as_customer'])
    plt.show()
    print(sales[sales['years_as_customer']>40])
    sns.histplot(sales['years_as_customer'])
    plt.show()
    
    #Validate- nb_site visits, state- They seem fine- no correction required
    print(sales['nb_site_visits'].describe())
    print(sales['nb_site_visits'].info())
    print(sales['nb_site_visits'].unique())
    print(sales['state'].describe())
    print(sales['state'].info())
    print(sales['state'].unique())
    sns.histplot(sales['nb_site_visits'])
    plt.show()

    Imputing missing revenue

    On further analysis it was found that there is a positive correlation between nb_sold and revenue across sales methods. So revenue values can be extrapolated and imputed.

    # Check the relationship between revenue column and nb_sold column by sales_method Email
    sales_Email = sales[sales['sales_method'] == 'Email']
    print(sales_Email[['revenue','nb_sold']].corr())
    # Check the relationship between revenue column and nb_sold column by sales_method Email + Call
    sales_Email_Call = sales[sales['sales_method'] == 'Email + Call']
    print(sales_Email_Call[['revenue','nb_sold']].corr())
    # Check the relationship between revenue column and nb_sold column by sales_method Call
    sales_Call = sales[sales['sales_method'] == 'Call']
    print(sales_Email_Call[['revenue','nb_sold']].corr())
    # Visualize the relationship between revenue column and nb_sold column classified by sales_method.
    sns.relplot(x='nb_sold', y='revenue', data=sales,hue='sales_method')
    plt.show()
    
    #method for imputing revenue data
    import statsmodels.formula.api as smf
    # Create sales_estimate
    sales_model= sales[~sales['revenue'].isna()]
    # Create sales_results
    sales_estimate = sales[sales['revenue'].isna()]
    # Create the model then predict the missing values
    x = smf.ols('revenue ~ nb_sold + C(sales_method)', data = sales_model).fit()
    estimates = x.predict(sales_estimate)
    print(estimates)
    
    # Fill missing values with results
    sales['revenue'] = sales['revenue'].fillna(estimates)
    print(sales.info())

    Analysis of sales_method

    1. The proportion of sales driven by email accounts for about 50% of total sales
    2. Email alone out performs Email + Call and Call
    3. The missing revenue values have no bearing on the method of sales
    fig,ax = plt.subplots()
    ax= sales['sales_method'].value_counts().plot(kind='bar',title='Sales Method Distribution')
    ax.bar_label(ax.containers[0],padding=1)
    plt.show()
    
    print(sales['sales_method'].value_counts(normalize=True))
    
    print(sales['sales_method'][sales['revenue'].isnull()].value_counts())

    Analysis of revenue

    1. Majority of the revenue is between 50-100 with slightly more customers in the 100 side.
    2. The other large values with fewer customers are above 180
    sns.boxplot(sales['revenue'])
    plt.show()
    
    sns.histplot(sales['revenue'])
    plt.show()
    

    Analysis of nb_site_visits

    1. The histogram plot reveals that customers on average visit the site about 25 times.
    2. A majority of 5000 customers visiting the site 24-26 times in the last 6 months.
    3. This most cloesly follows a normal distribution where 90% of customers visit the site between 15-35 times
    sns.histplot(sales['nb_site_visits'])
    plt.title('Number of times customers visited (last 6 months)')
    plt.show()
    
    

    Analysing revenue by sales method

    1. Email + Call stands out for its wide spread and high median of about 175 and mean of 183
    2. Call has a narrow spread and a median of 50 and mean of 47
    3. Email has a wide spread but not as wide as email + call and a middle ground of a median at 100 and mean of 97
    4. The barplot where the estimator is tagged as sum shows that total revenue from Email is the highest, followed by email+call and call