Duplicate of Certification: Data Analyst Associate

Case Study Project - Food Claims Process
author: Norbert Dzikowski
date: 2022-11-05

knitr::opts_chunk$set(echo = TRUE)

Case study company background

Vivendo is a fast food chain in Brazil with over 200 outlets. As with many fast food establishments, customers make claims against the company. For example, they blame Vivendo for suspected food poisoning.

Dataset

The claims.csv dataset provides statistics such as:

Claim ID Character (the unique identifier of the claim),
Time to Close Numeric (number of days it took for the claim to be closed),
Claim Amount Numeric (initial claim value in the currency of Brazil),
Amount Paid Numeric (total amount paid after the claim closed in the currency of Brazil),
Location Character (location of the claim, one of “RECIFE”, “SAO LUIS”,“FORTALEZA” or “NATAL”),
Individuals on Claim Numeric (number of individuals on this claim),
Linked Cases Binary (whether this claim is believed to be linked with othercases, either TRUE or FALSE),
Cause Character (the cause of the food poisoning injuries, one of "vegetable, "meat", or "unknown").

library(stringr)
library(dplyr)
library(utils)
data <- read.csv2("claims.csv", sep = ",")
data$Claim.Amount <- substring(data$Claim.Amount, 2)
data$Claim.Amount <- substring(data$Claim.Amount, 2)
data$Claim.Amount <- gsub(',', '', data$Claim.Amount)
data$Claim.Amount <- as.numeric(data$Claim.Amount)
data$Time.to.Close <- ifelse(data$Time.to.Close < 0, 0, data$Time.to.Close)
data$Amount.Paid <- as.numeric(data$Amount.Paid)
data$Cause <- ifelse(data$Cause == "", "unknown", data$Cause)

This is how look few first records in data.

head(data)

How does the number of claims differ across locations?

 data %>% 
  group_by(Location)  %>%
  summarise(count = n()) %>%
  arrange(desc(count))

The biggest amount of claims is in Fortaleza. The rest of locations don't differ more than a few.

What is the distribution of time to close claims?

library(ggplot2)
ggplot(data, aes(Time.to.Close)) + 
geom_histogram(binwidth = 300) +
geom_vline(xintercept = mean(data$Time.to.Close), col = 'blue')
mean(data$Time.to.Close)

This is how distrubution looks like (the blue vertical line represents the mean).

How does the average time to close claims differ by location?

data %>%
  group_by(Location) %>%
	summarise(mean = mean(Time.to.Close)) %>%
  arrange(desc(mean))

Sao Luis has the biggest average time to close claims.