Data Scientist Professional Practical Exam
Company Background
Nearly New Nautical is a website that allows users to advertise their used boats for sale. When users list their boat, they have to provide a range of information about their boat. Boats that get lots of views bring more traffic to the website, and more potential customers.
To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.
Customer Question
The product manager wants to know the following:
- Can you predict the number of views a listing will receive based on the boat's features?
Success Criteria
The product manager would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.
Dataset
The data you will use for this analysis can be accessed here: "data/boat_data.csv"
Validation Report
The data set has 9888 rows and 10 columns. The validation process has revealed that evey feature column contains data which is either missing or must be cleaned before processing.
Feature | Type | No. invalid samples | Notes |
---|---|---|---|
Number of views last 7 days | Numeric: 13 to 3263 | 0 | |
Price | Character | 0 | Currency code (4 categories) and currency amount (digits) |
Boat Type | Character: 26 non-exclusive categories, comma-separated | 0 | |
Manufacturer | Character: 910 distinct values | 1338 | |
Type | Character: 10 non-exclusive categories, comma-separated | 0 | There are 6 missing values but the categories are non-exclusive |
Year Built | Numeric: 1885 to 2021 | 551 | All invalid samples have a value of 0 |
Length | Numeric: 1.04 to 100.00 | 9 | |
Width | Numeric: 0.50 to 21.56 | 56 | Lowest value of 0.01 to be considered an outlier as it is 50 times smaller than next highest value and represents a boat width of 1cm, which is plainly absurd |
Material | Character: 11 distinct values | 1749 | |
Location | Character: 2995 distinct values | 36 | Majority (8945) of values in the format: "Country » Region" |
import re
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance
# Import data
df = pd.read_csv('data/boat_data.csv')
target_name = 'Number of views last 7 days'
pd.concat([df.isnull().sum(), df.dtypes], keys=['Number of Nulls', 'dtype'], axis=1)
Number of views last 7 days (target)
df[[target_name]].describe()
Price
# Validate the 4 currency codes and check that price values are not missing
currency_codes_and_values = df['Price'].str.extract(r'([\w£]+)\s(\d+)')
currency_codes = currency_codes_and_values.iloc[:,0]
currency_values = currency_codes_and_values.loc[:,1]
print("Number of records per currency:\n", currency_codes.value_counts())
print("\nNumber of records missing currency value: ", currency_values.isnull().sum())
Boat Type
# Validate Boat Type - checking for multiple categories
print("Number of rows with multiple boat type values: ", df['Boat Type'].str.contains(',').sum())
unique_boat_types = df['Boat Type'].str.get_dummies(sep=',').columns
print("Number of unique boat type values: ", unique_boat_types.shape[0])
Manufacturer
# Validate manufacturer
df['Manufacturer'].nunique()
Type
print("Number of rows with multiple type values: ", df['Type'].str.contains(',').sum())
unique_boat_types = df['Type'].str.get_dummies(sep=',').columns
print("Number of unique type values: ", unique_boat_types.shape[0])