In this short project, we are going to use R to clean some data that will be used later on to perform some analysis and visualization.
For this case, we're working with a company that sells bikes and accessories (AdventureWorks' dataset), they have given us access to its data to develop a BI dashboard for them. Before being able to create the dashboard, we first need to take a look into the data and get it tidy so it can be used adequately.
1. Loading the data
1.1 Packages & data frames
To begin, let's load the packages and the data:
install.packages('openxlsx')
library(tidyverse)
library(assertive)
library(forcats)
library(openxlsx)
budget <- read.csv("Budget.csv", sep = ";")
budgetPeriod <- read.csv("BudgetPeriod.csv", sep = ";")
customers <- read.csv("Customers.csv", sep = ";")
dimProductCategory <- read.csv("dimProductCategory.csv", sep = ";")
dimProductSubCategory <- read.csv("dimProductSubCategory.csv", sep = ";")
products <- read.csv("Products.csv", sep = ";")
sales <- read.csv("Sales.csv", sep = ";")
territory <- read.csv("Territory.csv", sep = ";")
calendar <- read.csv("calendar.csv", sep = ";")
1.2 First look at the tables
budget
budgetPeriod
customers
dimProductCategory
dimProductSubCategory
sales
territory
Now that we have our data we can start with the cleaning and transforming. I'll go table by table just to have a clean order that we can follow.
2. Customers Table
2.1 Creating new fields
Let's create 6 new fields: one for the age of each customer (Age), one for the age range (AgeRange), one that connects to the territory table (TerritoryKey), one that counts the # of items purchased by customer (TotalItemsPurchased), one the counts the # of distinct purchases made by each customer (RegularityCount) and one that lables customers by their regularity (RegularityLabel)
# Age
customers$BirthDate <- as.Date(as.character(customers$BirthDate) )
customers$DateFirstPurchase <- as.Date(as.character(customers$DateFirstPurchase) )
customers <- customers %>%
mutate(Age = floor(eeptools::age_calc(dob = BirthDate,
enddate = DateFirstPurchase,
units = "years",
precise = TRUE
)
)
)
# AgeRange
customers <- customers %>%
mutate(AgeRange = case_when(Age >= 15 & Age <= 25 ~ "15 - 25",
Age > 25 & Age <= 35 ~ "26 - 35",
Age > 35 & Age <= 45 ~ "36 - 45",
Age > 45 & Age <= 55 ~ "46 - 55",
Age > 55 & Age <= 65 ~ "56 - 65",
TRUE ~ ">65")
)
# TerritoryKey
customers <- distinct(customers %>%
left_join(sales, by = "CustomerKey") %>%
select(1:17, SalesTerritoryKey)
)
# TotalItemsPurchased
number_items <- sales %>% count(CustomerKey, sort = TRUE)
customers <- customers %>%
left_join(number_items, by = "CustomerKey")
customers <- rename(customers, TotalItemsPurchased = n)
# RegularityCount
number_purchases <- sales %>%
group_by(OrderDate, CustomerKey) %>%
arrange(-CustomerKey) %>%
distinct(OrderDate, CustomerKey) %>%
ungroup() %>%
count(CustomerKey, sort = TRUE)
customers <- customers %>%
left_join(number_purchases, by = "CustomerKey")
customers <- rename(customers, RegularityCount = n)
# RegularityLabel
customers <- customers %>%
mutate(RegularityLabel = case_when(RegularityCount > 1 ~ "Regular customer",
TRUE ~ "one time buyer"))
3. Products Table
3.1 missing values
A minor issue seems to be happening between the 'ProductName', 'Size' and 'SizeRange' fields. The 'ProductName' field has some observations with multiple elements in them, so first let's handle that so we can have a clean 'ProductName' field:
products <- products %>%
separate(col = ProductName, into = c("ProductName", "NewSize"), sep = ",", remove = FALSE)
As a product of the separation we have a new field called 'NewSize' Which we are going to use to finish the cleaning process for the 3 fields mentioned before:
products$NewSize <- str_trim(products$NewSize)
products$NewSize <- str_replace(products$NewSize, "Large", "L")
products <- products %>%
mutate(NewSize = na_if(NewSize, "Black"),
NewSize = na_if(NewSize, "Blue"),
NewSize = na_if(NewSize, "Red")
)
Let's have two fields for size: one for cm (which already exists) and one for clothing sizes:
products <- products %>%
mutate(clothing_size = case_when(is.na(Size) ~ NewSize,
TRUE ~ NA_character_)
)