Data Analyst Associate Practical Exam Submission
Task 1
For every column in the data:
a. State whether the values match the description given in the table above.
b. State the number of missing values in the column.
c. Describe what you did to make values match the description if they did not match.
product_id column
- The values match the description given.
- There were no missing values in the column.
- No action was required to make the values match the description.
category column
- The values match the description given.
- There were 25 missing values represented by a '-' charachter.
- To make the values match the description a CASE statement was used to return 'Unknown' if the value for category was '-', or return the value for category if this condition was not met.
animal column
- The values match the description given.
- There were no missing values in the column.
- No action was required to make the values match the description.
size column
- The values do not match the description because the column contains a mix of lower case, upper case and capitalized initial values
- There were no missing values in the column.
- DuckDB would not allow use of postgreSQL INITCAP() function on the size column to return the values as described with only the first letter of the word capitalized e.g. 'Large', therefore a CASE statement was used to acheive this.
price column
- The values match the description given.
- There were 150 missing values represented by the string 'unlisted'.
- To calculate the median price value all values from the price column that were not 'unlisted' were selected and casted to double (because they were originally varchar data type and it is necessary to perform a calculation on the data). Then the postgreSQL PERCENTILE_CONT(0.5) function was used on the values to return the median value. The median value was 28.065. A CASE statement was used to return the median value rounded up to 28.07 if the value for price was 'unlisted' or return the original value if this condition was not met.
sales column
- The values match the description given.
- There were no missing values in the column.
- No action was required to make the values match the description.
rating column
- The values match the description given.
- There were 150 missing values represented by 'NA'.
- To make the values match the description a CASE statement was used to return '0' if the value for rating was 'NA', or return the value for rating if this condition was not met.
repeat_purchase column
- The values match the description given.
- There were no missing values in the column.
- No action was required to make the values match the description.
A SQL SELECT statement was used to select all data from all of the columns which did not require altering along with CASE statements to return clean data for the columns which did need altering. The resulting clean version of the 1500 record, 8 column dataset was then used to create the supporting visualizations using MS Excel.
Task 2
Create a visualization that shows how many products are repeat purchases. Use the visualization to:
a. State which category of the variable repeat purchases has the most observations
b. Explain whether the observations are balanced across categories of the variable repeat purchases
- The sales data contains more observations for repeat-purchases than non-repeat purchases.
- The observations were not balanced across the categories of the variable repeat purchases.
- Of the 1500 products included in the dataset the repeat-purchase product type had 906 observations, compared to just 594 observations for the non-repeat-purchase product type.
Task 3
Describe the distribution of all of the sales. Your answer must include a visualization that shows the distribution.
The two graphics below are provided to show the distribution of all of the sales values in US Dollars across all products last year.
- The box plot provided includes key figures including the mean, median, upper quartile, and lower quartile values.
- The histogram shows the distribution of sales values rounded to the nearest 100 US Dollars.
- There is quite a large range of 1969.02. The highest sales value of 2255.96 USD for product_id1443 is more than 7.5 times the lowest sales value of 286.94 USD for product_id671.
- Compared to the overall range of sales values, the interquartile range is relatively small (411.09). The middle 50% of sales values are between the lower quartile of 747.40 and the upper quartile of 1158.49.
- There is a small number of high-value outliers, however the mean value of 996.60 is very close to the median value of 1000.83 which shows that overall the data is mostly symmetrically distributed.
- Looking at the histogram showing the distibution of sales values rounded to the nearest 100USD it is easier to see that the distribution is mostly symmetrical.
Task 4
Describe the relationship between repeat purchases and sales. Your answer must include a visualization to demonstrate the relationship.
Finally, the data was analysed to investigate the relationship between repeat-purchases and sales.
- The total sum of sales is higher for repeat purchases than non-repeat-purchases.
- The total sales figure for repeat-purchases was 884,064 USD compared to the value of 610,851 USD for non-repeat-purchases.
- However, this is due to the higher number of repeat-purchase products being sold in general. (Remember that 906 of the products being sold were repeat-purchases compared to just 594 non-repeat-purchase products)
If we visualize the distribution of sales values for repeat-purchases vs non-repeat-purchases as side-by-side boxplots, we can see that generally sales values are actually higher for non-repeat-purchases than repeat-purchases.
- The Mean, Median, Lower Quartile and Upper Quartile values are all higher for the non-repeat-purchase products than repeat-purchase products.
- This would suggest that repeat-purchases are associated with lower sales values than non-repeat-purchases.
Based on all of the above, we suggest the PetMind team considers focussing their efforts on selling more non-repeat-purchase products because they generally have higher sales values than repeat-purchase products. However, the team should do their best to maintain the current sales levels seen by repeat-purchase products because this is is where the majority of their sales revenue came from last year. This report focussed on sales data, therefore further analysis should be done to determine which purchase type brings in more profit for the company.