Skip to content

In this short project, we are going to use R to clean some data that will be used later on to perform some analysis and visualization.

For this case, we're working with a company that sells bikes and accessories (AdventureWorks' dataset), they have given us access to its data to develop a BI dashboard for them. Before being able to create the dashboard, we first need to take a look into the data and get it tidy so it can be used adequately.

1. Loading the data

1.1 Packages & data frames

To begin, let's load the packages and the data:

install.packages('openxlsx')

library(tidyverse)
library(assertive)
library(forcats)
library(openxlsx)
budget <- read.csv("Budget.csv", sep = ";")
budgetPeriod <- read.csv("BudgetPeriod.csv", sep = ";")
customers <- read.csv("Customers.csv", sep = ";")
dimProductCategory <- read.csv("dimProductCategory.csv", sep = ";")
dimProductSubCategory <- read.csv("dimProductSubCategory.csv", sep = ";")
products <- read.csv("Products.csv", sep = ";")
sales <- read.csv("Sales.csv", sep = ";")
territory <- read.csv("Territory.csv", sep = ";")
calendar <- read.csv("calendar.csv", sep = ";")

1.2 First look at the tables

budget
budgetPeriod
customers
dimProductCategory
dimProductSubCategory
sales
territory

Now that we have our data we can start with the cleaning and transforming. I'll go table by table just to have a clean order that we can follow.

2. Customers Table

2.1 Creating new fields

Let's create 6 new fields: one for the age of each customer (Age), one for the age range (AgeRange), one that connects to the territory table (TerritoryKey), one that counts the # of items purchased by customer (TotalItemsPurchased), one the counts the # of distinct purchases made by each customer (RegularityCount) and one that lables customers by their regularity (RegularityLabel)

# Age

customers$BirthDate <- as.Date(as.character(customers$BirthDate) )

customers$DateFirstPurchase <- as.Date(as.character(customers$DateFirstPurchase) )

customers <- customers %>%
              mutate(Age = floor(eeptools::age_calc(dob = BirthDate, 
                                                    enddate = DateFirstPurchase, 
                                                    units = "years", 
                                                    precise = TRUE
                                                    )
                                 )
                     )

# AgeRange

customers <- customers %>%
              mutate(AgeRange = case_when(Age >= 15 & Age <= 25 ~ "15 - 25",
                                          Age > 25 & Age <= 35 ~ "26 - 35",
                                          Age > 35 & Age <= 45 ~ "36 - 45",
                                          Age > 45 & Age <= 55 ~ "46 - 55",
                                          Age > 55 & Age <= 65 ~ "56 - 65",
                                          TRUE ~ ">65")
                     )

# TerritoryKey


customers <- distinct(customers %>%
                      left_join(sales, by = "CustomerKey") %>%
                      select(1:17, SalesTerritoryKey)
                      )

# TotalItemsPurchased

number_items <- sales %>% count(CustomerKey, sort = TRUE)

customers <- customers %>% 
              left_join(number_items, by = "CustomerKey")

customers <- rename(customers, TotalItemsPurchased = n)

# RegularityCount

number_purchases <- sales %>% 
                      group_by(OrderDate, CustomerKey) %>%
                      arrange(-CustomerKey) %>%
                      distinct(OrderDate, CustomerKey) %>%
                      ungroup() %>%
                      count(CustomerKey, sort = TRUE)

customers <- customers %>% 
              left_join(number_purchases, by = "CustomerKey")


customers <- rename(customers, RegularityCount = n)

# RegularityLabel

customers <- customers %>%
              mutate(RegularityLabel = case_when(RegularityCount > 1 ~ "Regular customer",
                                                 TRUE ~ "one time buyer"))

3. Products Table

3.1 missing values

A minor issue seems to be happening between the 'ProductName', 'Size' and 'SizeRange' fields. The 'ProductName' field has some observations with multiple elements in them, so first let's handle that so we can have a clean 'ProductName' field:

products <- products %>%
  separate(col = ProductName, into = c("ProductName", "NewSize"), sep = ",", remove = FALSE)

As a product of the separation we have a new field called 'NewSize' Which we are going to use to finish the cleaning process for the 3 fields mentioned before:

products$NewSize <- str_trim(products$NewSize)

products$NewSize <- str_replace(products$NewSize, "Large", "L")

products <- products %>%
              mutate(NewSize = na_if(NewSize, "Black"),
                     NewSize = na_if(NewSize, "Blue"),
                     NewSize = na_if(NewSize, "Red") 
                     )

Let's have two fields for size: one for cm (which already exists) and one for clothing sizes:

products <- products %>%
  mutate(clothing_size = case_when(is.na(Size) ~ NewSize,
                                   TRUE ~ NA_character_)
         )