Data Analyst Associate Case Study Submission
You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.
You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.
Company Background
Vivendo is a fast food chain in Brazil with over 200 outlets. As with many fast food establishments, customers make claims against the company. For example, they blame Vivendo for suspected food poisoning.
The legal team, who processes these claims, is currently split across four locations. The new head of the legal department wants to see if there are differences in the time it takes to close claims across the locations.
Customer Question
The legal team has given you a data set where each row is a claim made against the company. They would like you to answer the following questions:
- How does the number of claims differ across locations?
- What is the distribution of time to close claims?
- How does the average time to close claims differ by location?
Dataset
The dataset contains one row for each claim. The dataset can be downloaded from here.
Importing library
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
Data Validation
Describe the validation tasks you performed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.
The original data has 98 rows and 8 columns. The first thing i did was to check the data type for all column. Then, i replaced and removed the currency inside "Claim Amount" column from "R$50,000.00" into "50000". I changed "Claim Amount" data type from string into integer. There are 78 missing value in "Cause" column so I also replaced the missing value with "unknown". There are 4 location that legal team processes the claims, Fortaleza, Recife, Natal, and Sao Luis. Then i cheked the data again to match with the dictionary, here the summary what i found from the data:
- There are 98 rows and 8 column, with 98 unique Claim ID
- There were no missing value after data cleaning
- There are 4 different location categories, as expected
- There are binary option for linked class, True for the claim that believed to be linked with other cases, and False.
- There are 3 main cause of the food poisoning injuries, 'vegetable', 'meat', and 'unknown'
Read dataset
import requests
url = 'https://raw.githubusercontent.com/isaaclangit/fundamentals_data_analyst/main/Data%20Camp/Data%20Camp%20Associate%20Certification/claims.csv'
res = requests.get(url, allow_redirects=True)
with open('claims.csv','wb') as file:
file.write(res.content)
df = pd.read_csv('claims.csv')
df.head()
df.shape
# Count Claim ID unique value
print("There are " + str(df['Claim ID'].nunique()) + " unique Claim ID")
df.info()
# Converted from "R$50,000.00" to 50000 and cast data type from string to integer
df['Claim Amount'] = df['Claim Amount'].apply(lambda st: st[st.find("$")+1:st.find(".")]).str.replace(",","")
df['Claim Amount'] = pd.to_numeric(df['Claim Amount'])