Certification - Data Analyst Associate - Food Claims2

In this assessment the claims of a food chain company called Vivendo is examined. There a four locations. First the raw csv is loaded and displayed to have an idea what the data looks like. Then the questions are answered subsequently after each part of code.

# Start coding here# Import pandas
import pandas as pd

# Import the data as a DataFrame
event_details = pd.read_csv("food_claims_2212.csv")

# Preview the DataFrame
event_details

Tasks

For every column in the data: 1a. State whether the values match the description given in the table above. claim_id OK time_to_close OK claim_amount OK amount_paid OK location OK individuals_on_claim OK linked_cases OK cause OK

event_details.info()

1b. State the number of missing values in the column. amount_paid: 36 missing linked_cases: 26 missing The other columns don't have missing values.

# Import Numpy library
import numpy as np

median_val_ap = round(event_details['amount_paid'].median(),2)

# replacing na values in columns with median
event_details["amount_paid"].fillna(median_val_ap, inplace = True)
event_details["linked_cases"].fillna("FALSE", inplace = True)

event_details.info()

1c. Describe what you did to make values match the description if they did not match. I replaced the NULL values with the median for amount_paid and with FALSE for linked_cases. All columns now have 2000 records.

Create a visualization that shows the number of claims in each location. Use the visualization to: a. State which category of the variable location has the most observations

# Group the DataFrame by the category_name column
category_totals = event_details.groupby("location", as_index=False)["claim_id"].count()

# Preview the new DataFrame
category_totals

DataFrame

Current Type: Bar

Type

Current X-axis: location

X-axis

Current Y-axis: claim_id

Y-axis

Current Color: None

Color

Observations in each location

2b. Explain whether the observations are balanced across categories of the variable location. The RECIFE location seems to have significantly more observations than the other 3 locations. FORTALEZA and NATAL have a similar number of observations.

Describe the distribution of time to close for all claims. Your answer must include a visualization that shows the distribution.

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams.update({'figure.figsize':(7,5), 'figure.dpi':100})

# Plot Histogram on x
plt.hist(event_details["time_to_close"], bins=50)
plt.gca().set(title='Frequency Histogram', ylabel='Frequency', xlabel="Time to close (days)");

The time to close of most claims is around 180 days.

Describe the relationship between time to close and location. Your answer must include a visualization to demonstrate the relationship.

‌
‌
‌