In this assessment the claims of a food chain company called Vivendo is examined. There a four locations. First the raw csv is loaded and displayed to have an idea what the data looks like. Then the questions are answered subsequently after each part of code.
# Start coding here# Import pandas
import pandas as pd
# Import the data as a DataFrame
event_details = pd.read_csv("food_claims_2212.csv")
# Preview the DataFrame
event_detailsTasks
For every column in the data: 1a. State whether the values match the description given in the table above. claim_id OK time_to_close OK claim_amount OK amount_paid OK location OK individuals_on_claim OK linked_cases OK cause OK
event_details.info()1b. State the number of missing values in the column. amount_paid: 36 missing linked_cases: 26 missing The other columns don't have missing values.
# Import Numpy library
import numpy as np
median_val_ap = round(event_details['amount_paid'].median(),2)
# replacing na values in columns with median
event_details["amount_paid"].fillna(median_val_ap, inplace = True)
event_details["linked_cases"].fillna("FALSE", inplace = True)
event_details.info()1c. Describe what you did to make values match the description if they did not match. I replaced the NULL values with the median for amount_paid and with FALSE for linked_cases. All columns now have 2000 records.
- Create a visualization that shows the number of claims in each location. Use the visualization to: a. State which category of the variable location has the most observations
# Group the DataFrame by the category_name column
category_totals = event_details.groupby("location", as_index=False)["claim_id"].count()
# Preview the new DataFrame
category_totalsObservations in each location
2b. Explain whether the observations are balanced across categories of the variable location. The RECIFE location seems to have significantly more observations than the other 3 locations. FORTALEZA and NATAL have a similar number of observations.
- Describe the distribution of time to close for all claims. Your answer must include a visualization that shows the distribution.
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams.update({'figure.figsize':(7,5), 'figure.dpi':100})
# Plot Histogram on x
plt.hist(event_details["time_to_close"], bins=50)
plt.gca().set(title='Frequency Histogram', ylabel='Frequency', xlabel="Time to close (days)");The time to close of most claims is around 180 days.
- Describe the relationship between time to close and location. Your answer must include a visualization to demonstrate the relationship.