Skip to content

Data Scientist Associate

Example Practical Exam Solution

You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

Use this template to complete your analysis and write up your summary for submission.

Task 1

The dataset contains 200 rows and 9 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

  • Region: Same as description without missing values, 10 Regions.
  • Place name: Same as description without missing values.
  • Rating: 2 missing values, so I replace the missing values with 0.
  • Reviews: 2 missing values, so I replace the missing values with overall median number.
  • Price: Same as description without missing values, 3 categories.
  • Delivery option: Same as description without missing values.
  • Dine in option: 50+ missing values, so I replace missing values with 'False', and convert it into boolean data type.
  • Take out option: 50+ missing values, so I replace missing values with 'False',and convert it into boolean data type.

After the data validation, the dataset contains 200 rows and 9 columns.


12 hidden cells

Task 2

From Graph 1 Count of Rating, the most number of stores were given rating 4.6, then follows by 4.7. We can see the majority of the stores were given rating higher than 4.5.

Inspecting the Rating and Reviews variables

Hidden code

Task 3

Reviews variable is our target variable. From Graph 2-1 The Distribution of Number of Reviews, we can also see an outlier, larger than 17500. Since we don't have a lot of data, we decided to apply a log transformation. From Graph 2-2, we can see the distribution is much closer to a normal distribution.

Hidden code
Hidden code

Task 4

From Grahp 3-1 The Relationship between Reviews and Ratings, there are one outlier preventing us to interpet the relationship correctly. After removing that outlier (Graph 3-2), we can see that the number of reviews have the largest range when the store is rated 4.5 or 4.6.

Inspecting the Relationships between Ratings and Target Variable (Reviews)

Hidden code

1 hidden cell
Hidden code

Make changes to enable modeling

Finally, to enable model fitting, I have made the following changes:

  • Remove the Place name column because it has unique values, so we won't use that feature.
  • Convert all the categorical variables into numeric variables
  • Remove one outlier where the review is above 17500
  • Apply log transformation to the target variable