Skip to content
Data Analyst Associate
  • AI Chat
  • Code
  • Report
  • Data Analyst Associate Practical Exam Submission

    You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

    You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

    Data Validation

    Describe the validation tasks you performed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.

    The original data consists of 8 columns and 98 rows. I started my work by checking for duplicate data for 'Claim ID' column. There was not duplicates. After that I did make sure that each column has the correct data type assigned to them. The initial claim value ('Claim Amount' column) in the currency of Brazil should be converted. For example: 'R$50,000.00' should be converted into 50000. I used python methods like str.replace(). I also used method astype(float) to convert to numeric (float64). When I was exploring the data, I found that there was a negative value (-57) in 'Time to Close' column. It might be due to wrong computation. 'Time to Close' must be a positive number that tells the number it takes to process the claim. The records must be investigated further to see if there are more negative values. So I replaced with postive 57. According to case study, the causes of the food poisoning injuries ('Cause' column) should be 'vegetable', 'meat' or 'unknown'. So I replaced any empty rows with 'unknown and 'vegetables' by 'vegetable'. After that there were:

    • 'Claim ID' column: 98 unique rows (identifiers of the claim);
    • 'Time to Close' column: 95 unique values, the value 120 (days) was repeated 3 times;
    • 'Claim Amount' column: 98 values, the most common values: 40000.0 (8 times);
    • 'Amount Paid' column: 98 unique rows, for all rows values 'Claim Amount' > 'Amount Paid';
    • 'Location' column: 4 locations (as expected);
    • 'Individuals on Claim' column: There were 7 claims where were 0 individuals. In my opinion it should be confirmed with the legal department providing the data to know what should be done. Individuals on Claim were not the subject of business questions, so I decided to include the records that contain them in the charts;
    • 'Linked Cases' column: 2 options - True or False (as expected);
    • 'Cause' column: 3 options (as expected).

    I used several Python tools such as: Pandas, NumPy, Matplotlib and Seaborn. The following Python methods and functions were helpful in determining the above:

    • value_counts()
    • describe()

    Data Discovery and Visualization

    Describe what you found in the analysis and how the visualizations answer the customer questions in the project brief. In your description you should:

    • Include at least two different data visualizations to demonstrate the characteristics of single variables
    • Include at least one data visualization to demonstrate the relationship between two or more variables
    • Describe how your analysis has answered the business questions in the project brief

    How does the number of claims differ across locations?

    There are four locations and SAO LUIS has the highest number of claims in all locations (30). RECIFE is not so far behind with 25 claims. FORTALEZA and NATAL are not much different from each other, namely 22 and 21.

    What is the distribution of time to close claims?

    It is a right skewed distribution which most values are clustered around the left tail of the distribution while the right tail of the distribution is longer. We can see the legal team delays between 500 and 1000 days per claim. But there are also some claims who take more 3000 days to close.

    How does the average time to close claims differ by location?

    The box plot above compares the distributions of the time it takes to close a claim for each location. This plot shows the range of all data points and a summary that displays median, quartile and extreme points of a given dataset.

    The outliers are in the three locations: RECIFE, FORTALEZA and NATAL. They lying outside the upper whiskers. SAO LUIS is working on 30 claims, which explains why the interquartile range is much larger and the maximum value is much higher than other locations. If we take a look at other locations, they have a pretty similar interquartile range, that is because they work on a fairly similar number of claims, namely 25, 22, and 21.

    Based on all of the above, there is a difference in the time it takes to close a claim across the locations. I recommend to the new head of the legal team to distribute the amount of work equitably to each legal team or add more workers to the SAO LUIS legal team so that they can improve their performance.

    ✅ When you have finished...

    • Publish your Workspace using the option on the left
    • Check the published version of your report:
      • Can you see everything you want us to grade?
      • Are all the graphics visible?
    • Review the grading rubric, have you included everything that will be graded?
    • Head back to the Certification Dashboard to submit your practical exam