Duplicate of Certification Workspace

Data Scientist Professional Practical Exam

Company Background

Nearly New Nautical is a website that allows users to advertise their used boats for sale. When users list their boat, they have to provide a range of information about their boat. Boats that get lots of views bring more traffic to the website, and more potential customers.

To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.

Customer Question

The product manager wants to know the following:

Can you predict the number of views a listing will receive based on the boat's features?

Success Criteria

The product manager would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.

Dataset

The data you will use for this analysis can be accessed here: "data/boat_data.csv"

Validation Report

The data set has 9888 rows and 10 columns. The validation process has revealed that evey feature column contains data which is either missing or must be cleaned before processing.

Feature	Type	No. invalid samples	Notes
Number of views last 7 days	Numeric: 13 to 3263	0
Price	Character	0	Currency code (4 categories) and currency amount (digits)
Boat Type	Character: 26 non-exclusive categories, comma-separated	0
Manufacturer	Character: 910 distinct values	1338
Type	Character: 10 non-exclusive categories, comma-separated	0	There are 6 missing values but the categories are non-exclusive
Year Built	Numeric: 1885 to 2021	551	All invalid samples have a value of 0
Length	Numeric: 1.04 to 100.00	9
Width	Numeric: 0.50 to 21.56	56	Lowest value of 0.01 to be considered an outlier as it is 50 times smaller than next highest value and represents a boat width of 1cm, which is plainly absurd
Material	Character: 11 distinct values	1749
Location	Character: 2995 distinct values	36	Majority (8945) of values in the format: "Country Â» Region"

import re
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, KFold
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance

# Import data
df = pd.read_csv('data/boat_data.csv')
target_name = 'Number of views last 7 days'

pd.concat([df.isnull().sum(), df.dtypes], keys=['Number of Nulls', 'dtype'], axis=1)

Number of views last 7 days (target)

df[[target_name]].describe()

Price

# Validate the 4 currency codes and check that price values are not missing
currency_codes_and_values = df['Price'].str.extract(r'([\w£]+)\s(\d+)')
currency_codes = currency_codes_and_values.iloc[:,0]
currency_values = currency_codes_and_values.loc[:,1]

print("Number of records per currency:\n", currency_codes.value_counts())
print("\nNumber of records missing currency value: ", currency_values.isnull().sum())

Boat Type

# Validate Boat Type - checking for multiple categories
print("Number of rows with multiple boat type values: ", df['Boat Type'].str.contains(',').sum())

unique_boat_types = df['Boat Type'].str.get_dummies(sep=',').columns
print("Number of unique boat type values: ", unique_boat_types.shape[0])

Manufacturer

# Validate manufacturer
df['Manufacturer'].nunique()

Type

print("Number of rows with multiple type values: ", df['Type'].str.contains(',').sum())

unique_boat_types = df['Type'].str.get_dummies(sep=',').columns
print("Number of unique type values: ", unique_boat_types.shape[0])

‌
‌
‌