Skip to content

Data Analyst Associate Practical Exam Submission

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

Task 1

claim_id:

  1. The values in the claim_id column are unique identifiers, matching the description.
  2. There are no missing values mentioned in the description. time_to_close:
  3. The values in the time_to_close column should be discrete and any positive value.
  4. To determine the number of missing values, I will examine the dataset.
  5. If there are any missing values, I will replace them with the overall median time to close. claim_amount: 1 The values in the claim_amount column should be continuous and represent the initial claim requested in the currency of Brazil, rounded to 2 decimal places.
  6. To determine the number of missing values, I will examine the dataset.
  7. If there are any missing values, I will replace them with the overall median claim amount. amount_paid:
  8. The values in the amount_paid column should be continuous and represent the final amount paid in the currency of Brazil, rounded to 2 decimal places.
  9. To determine the number of missing values, I will examine the dataset.
  10. If there are any missing values, I will replace them with the overall median amount paid.

location:

  1. The values in the location column should match the description, with options being "RECIFE", "SAO LUIS", "FORTALEZA", or "NATAL".
  2. To determine the number of missing values, I will examine the dataset.
  3. If there are any missing values, they will be removed.

individuals_on_claim:

  1. The values in the individuals_on_claim column should be discrete and represent the number of individuals on the claim, with a minimum of 1 person.
  2. To determine the number of missing values, I will examine the dataset.
  3. If there are any missing values, I will replace them with 0.

linked_cases:

  1. The values in the linked_cases column should match the description, with options being TRUE or FALSE.
  2. To determine the number of missing values, I will examine the dataset.
  3. If there are any missing values, I will replace them with FALSE.

cause:

  1. The values in the cause column should match the description, with options being "vegetable", "meat", or "unknown".
  2. To determine the number of missing values, I will examine the dataset.
  3. If there are any missing values, I will replace them with 'unknown'.

Write your answer here

Task 2

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a DataFrame

df = pd.read_csv('https://s3.amazonaws.com/talent-assets.datacamp.com/food_claims_2212.csv')

# Count the number of claims in each location
location_counts = df['location'].value_counts()

# Create the bar plot
plt.figure(figsize=(8, 6))
location_counts.plot(kind='bar')
plt.xlabel('Location')
plt.ylabel('Number of Claims')
plt.title('Number of Claims in Each Location')
plt.xticks(rotation=45)

# Show the plot
plt.show()

We can analyse the bar plot displaying the amount of claims in each place because the observations are balanced across categories of the variable "location." The observations can be regarded as balanced if there are about equal numbers of assertions across all locations. The observations are unbalanced, on the other hand, if there is a large variation in the number of claims between the places.

You may visually check the heights of the bars corresponding to each place after creating the bar plot using the code previously provided. A balanced distribution is indicated if the heights of the bars are comparable or close in magnitude. On the other hand, if one or many bars stand out as noticeably taller or shorter than the others, the distribution may be unbalanced.

Write your answer here

Task 3

The histogram shows the distribution of the "time_to_close" variable, representing the number of claims within different time intervals. The x-axis represents the time to close in days, and the y-axis represents the frequency or count of claims falling within each time interval. The height of each bar indicates the number of claims within that specific time interval.

You can observe the shape of the histogram to understand the distribution. Common distribution shapes include bell-shaped (normal), skewed (left or right), or multi-modal. Additionally, you can analyze measures such as the central tendency (mean or median) and the spread (variance or standard deviation) of the distribution to provide a more detailed description.

Write your answer here

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a DataFrame
df = pd.read_csv('https://s3.amazonaws.com/talent-assets.datacamp.com/food_claims_2212.csv')

# Plot the histogram of time to close
plt.hist(df['time_to_close'], bins=20, edgecolor='black')

# Set plot labels and title
plt.xlabel('Time to Close (days)')
plt.ylabel('Frequency')
plt.title('Distribution of Time to Close for All Claims')

# Display the plot
plt.show()

Task 4

We can use a box plot to investigate the connection between location and time to closure. By comparing the distribution of time to close for each location category, the box plot enables us to spot any variances or patterns across several locations.

Here is an example of Python code that uses the supplied dataset to create a box plot:

import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a DataFrame
df = pd.read_csv('https://s3.amazonaws.com/talent-assets.datacamp.com/food_claims_2212.csv')

# Create a box plot of time to close for each location
df.boxplot(column='time_to_close', by='location')

plt.figure(figsize=(8, 6))
df.boxplot(column='time_to_close', by='location')
plt.title('Relationship between Time to Close and Location')
plt.xlabel('Location')
plt.ylabel('Time to Close (days)')
plt.show()