Skip to content

Data Scientist Professional Practical Exam

Company Background

Nearly New Nautical is a website that allows users to advertise their used boats for sale. When users list their boat, they have to provide a range of information about their boat. Boats that get lots of views bring more traffic to the website, and more potential customers.

To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.

Customer Question

The product manager wants to know the following:

  • Can you predict the number of views a listing will receive based on the boat's features?

Success Criteria

The product manager would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.

Dataset

The data you will use for this analysis can be accessed here: "data/boat_data.csv"

Validation Report

The data set has 9888 rows and 10 columns. The validation process has revealed that evey feature column contains data which is either missing or must be cleaned before processing.

FeatureTypeNo. invalid samplesNotes
Number of views last 7 daysNumeric: 13 to 32630
PriceCharacter0Currency code (4 categories) and currency amount (digits)
Boat TypeCharacter: 26 non-exclusive categories, comma-separated0
ManufacturerCharacter: 910 distinct values1338
TypeCharacter: 10 non-exclusive categories, comma-separated0There are 6 missing values but the categories are non-exclusive
Year BuiltNumeric: 1885 to 2021551All invalid samples have a value of 0
LengthNumeric: 1.04 to 100.009
WidthNumeric: 0.50 to 21.5656Lowest value of 0.01 to be considered an outlier as it is 50 times smaller than next highest value and represents a boat width of 1cm, which is plainly absurd
MaterialCharacter: 11 distinct values1749
LocationCharacter: 2995 distinct values36Majority (8945) of values in the format: "Country » Region"
import re
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance
# Import data
df = pd.read_csv('data/boat_data.csv')
target_name = 'Number of views last 7 days'
pd.concat([df.isnull().sum(), df.dtypes], keys=['Number of Nulls', 'dtype'], axis=1)
Number of views last 7 days (target)
df[[target_name]].describe()
Price
# Validate the 4 currency codes and check that price values are not missing
currency_codes_and_values = df['Price'].str.extract(r'([\w£]+)\s(\d+)')
currency_codes = currency_codes_and_values.iloc[:,0]
currency_values = currency_codes_and_values.loc[:,1]

print("Number of records per currency:\n", currency_codes.value_counts())
print("\nNumber of records missing currency value: ", currency_values.isnull().sum())
Boat Type
# Validate Boat Type - checking for multiple categories
print("Number of rows with multiple boat type values: ", df['Boat Type'].str.contains(',').sum())

unique_boat_types = df['Boat Type'].str.get_dummies(sep=',').columns
print("Number of unique boat type values: ", unique_boat_types.shape[0])
Manufacturer
# Validate manufacturer
df['Manufacturer'].nunique()
Type
print("Number of rows with multiple type values: ", df['Type'].str.contains(',').sum())

unique_boat_types = df['Type'].str.get_dummies(sep=',').columns
print("Number of unique type values: ", unique_boat_types.shape[0])