Draft Data Analyst Certificate Associate

Data Analyst Associate

Example Practical Exam Solution

You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

Data Validation

Describe the validation tasks you performed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.

Write your description here

For the rest of the study : the data provided for this project was imported in python via the read_csv() method offered by python under the name : claims like this.

import pandas as pd
claims = pd.read_csv(path_to_claim_on_desktop)

claims is the name the dataset will be refered to with throughout the rest of the analysis.

Overall

Original claims set had 98 rows and 8 columns. The information was retrieved with the .shape attribute applied to claims as follows :

claims.shape

By chaining the commands : data.isna().sum() I have found that the only column having missing entries was the Cause column. A total of 78 entries over the 98 available were missing in the data.

However this field has little impact in the course of our Analysis it was left as is.

The fields of real interest being : time_to_close, location

Individual checkings

Results retrieved by applying the .info() method on the dataset has been very useful for fields validation

a) Claim ID ---> No manipulation was made here

Project task expectation : Character, the unique identifier of the claim. In python character or string's dtype values are: 'object' we can retrive this information on the first line of .info() output

b) Validating the time_to_close field ---> No manipulation was made here

Project task expectation : Numeric number of days. Since int values are legit numeric values in python this field is ok and needed no modification.

c) Claim Amount ---> **This field is the modified version **

Project task expectation :Numeric, initial claim value in the currency of Brazil. For example, “R$50,000.00” should be converted into 50000.

We had a problem here : Reuirements set for the project constrained us to numeric values. However the .info() method revealed that 'Claim Amount' field actual dtype was 'object'(translates to string) instead of numeric(int) expected.

Manipulations to fix it :

# 1. First we stripped the substring : 'R$' from the values in the column and replaced','with white space 
# 2. The result was casted into int by applying the astype() method to the entire column. 
# We then assigned the modified version of the column(column casted into int) back into the 
dataset.

# The whole Manip looked like this : 

# 1. strip and replace
claims['Claim Amount'] = claims['Claim Amount'].str.strip('R$').str.replace(',', '')

# 2. cast whith astype()
claims['Claim Amount'] = claims['Claim Amount'].astype('int')

Result being a Claim Amount filled with integer version of the original field on which few 
were possibles

d) Amount Paid ---> No manipulation was made here

Project task expectation : Numeric, total amount paid after the claim closed in the currency of Brazil. Actual dtype for here was: 'float64' so the right type to store numbers with decimals

No manipulation was made here

e) Location ---> No manipulation was made here

Project task expectation : Character, location of the claim, one of “RECIFE”, “SAO LUIS”, “FORTALEZA”, or “NATAL” Note : Recommended dtype for this variable should be 'Category' (which dtype is also object) but we touched not it's type

The only validation made here was to make sure the Only Values recorded in Location were : “RECIFE”, “SAO LUIS”, “FORTALEZA”, or “NATAL”. Here was our appraoch :

correct_locations = set([“RECIFE”, “SAO LUIS”, “FORTALEZA”, “NATAL”]) `

assert correct_locations = claims_actual_locations, "This was printed because {}, did not matched {}".format(correct_locations, claims_actual_locations)

f) Individual Case ---> No manipulation was made here

Project task expectation :

g) Linked Case ---> No manipulation was made here

Project task expectation : True or false Values Which perfectly correspond to the Boolean dtype present

h) Cause ---> No manipulation was made here

Project task expectation : Character, the cause of the food poisoning injuries, one of ‘vegetable’, ‘meat’, or ‘unknown’.Replace any empty rows with ‘unknown’. The field contained proper type values : 'object'

One Manipulation made here : claims['cause'] = claims['cause'].fillna('unknown')

verification :

Data Discovery and Visualization

Describe what you found in the analysis and how the visualizations answer the customer questions in the project brief. In your description you should:

Include at least two different data visualizations to demonstrate the characteristics of variables
Include at least one data visualization to demonstrate the relationship between two or more variables
Describe how your analysis has answered the business questions in the project brief

Write your description here

Our job here was to figure out if there are differences in the time it takes to close claims across the locations

We had to go through 3 major steps :

1) Evaluate how the Number of varies accross locations
2) Plot the distribution of time to close claims
3) Differences in Average closure_time('time_to_close') across locations

How does the number of claims differ across locations?

There are four possible location in our dataset. The one location having the largest number of claims accross the four locations Being : 'RECIFE', with 'SAO LUIS' being second although with half the number of locations. This would suggest that the team should focus on distributing their new cups in coffee shops as they are more common.

How does the range in number of reviews differ across all shops?

As the marketing team thinks that the number of reviews a place gets will be important, we should look at how the number of reviews is distributed. Looking at all reviews, we can see that most places have had less than 1000 reviews. There are some outliers that get more than 3000 reviews but this is very uncommon. When looking for places that have high reviews the team should aim for locations having over 1000 reviews, but be aware they may need to work with 500 reviews or more.

How does the number of reviews vary across each place type?

Finally we want to combine the two pieces of information to see how the place type impacts number of reviews. So far coffee shops with over 1000 reviews would be ideal but we need to look at the two variables together to see if this is realistic.

When looking at just the reviews we excluded a single outlying value so that we could see the majority of the data. To show the impact, we can look at the range of number of reviews by place with this outlier in the data. In the graphic below you can see that this outlier is dominating the data and making comparison difficult. To make it easier to compare the rest of the data, we will remove this outlier.

After we remove the outlier we can focus on the main range of data. Although Coffee Shops do include the place types with the largest number of reviews, the interquartile range of the number of reviews is lower than Cafe and espresso bar types. This would suggest that the majority of the number of reviews may be lower than other types. However, this could also be an effect of having the largest number of locations, so the large number of low review locations brings the median down.

Based on all of the above, we would recommend that the team focus on coffee shops with reviews over 1000 to start, but also keeps an open mind to including cafes and espresso bars with high reviews. Further analysis should be done to understand if store type really does impact the number of reviews. The team should also consider including their cups in stores with lower reviews so that we can further analyze whether reviews has any impact over the popularity of the new cups.

✅ When you have finished...

Publish your Workspace using the option on the left
Check the published version of your report:
- Can you see everything you want us to grade?
- Are all the graphics visible?
Review the grading rubric. Have you included everything that will be graded?
Head back to the Certification Dashboard to submit your practical