DataCamp Associate Data Analyst Case Study Project - Food Claims Process
BY ABDULRAHEEM BASHIR
Table of Contents
This case study is about a fast food restaurant in Brazil where consumers file claims against such as food poisoning. Vivendo fast food is the name of the fast food to be used in this case study.
Vivendo is a fast food chain in Brazil with over 200 outlets. As with many fast food establishments, customers make claims against the company. For example, they blame Vivendo for suspected food poisoning.
The legal team, who processes these claims, is currently split across four locations. The new head of the legal department wants to see if there are differences in the time it takes to close claims across the locations.
Customer Question: The legal team would like you to answer the following questions:
- How does the number of claims differ across locations?
- What is the distribution of time to close claims?
- How does the average time to close claims differ by location?
Dataset: The dataset contains one row for each claim. The dataset can be downloaded from here.
The following are the dataset descriptions:
- Claim ID: Character, the unique identifier of the claim.
- Time to Close: Numeric, number of days it took for the claim to be closed.
- Claim Amount: Numeric, initial claim value in the currency of Brazil.
- Amount Paid: Numeric, total amount paid after the claim closed in the currency of Brazil.
- Location: Character, location of the claim, one of “RECIFE”, “SAO LUIS”, “FORTALEZA”, or “NATAL”.
- Individuals on Claim: Numeric, number of individuals on this claim.
- Linked Cases: Binary, whether this claim is believed to be linked with other cases, either TRUE or FALSE.
- Cause: Character, the cause of the food poisoning injuries, one of ‘vegetable’, ‘meat’, or ‘unknown’.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Reading the csv file
# saving it as a dataframe with the name claims
claims = pd.read_csv('claims.csv')
# displaying few top rows from the dataframe
claims.head()
The preceding output shows that:
- The Claim ID column contains some undesired zeros. It also includes two pieces of information in this single column: the Claim ID and the Year of Claim.
- Some unwanted characters appear before the amount in the Claim Amount column.
- The Cause column has some empty values.
# displaying some information about the dataframe
claims.info()
The preceding output shows that:
- The datatype for the Claim Amount column is not accurate.
- Approximately 80% of the Cause column entries are null.
# Checking for the count of duplicates in the dataframe
claims.duplicated().sum()
It appears above that there is no duplicate in the dataframe.
# displaying some descriptive statistic abput the data
claims.describe()
According to the above output, the minimal time to close a claim is -57, which is unusual because there are no negative days in real life.