Skip to content
Duplicate of Certification Workspace
Data Scientist Professional Case Study
Company Background
Nearly New Nautical is a website that allows users to advertise their used boats for sale. When users list their boat, they have to provide a range of information about their boat. Boats that get lots of views bring more traffic to the website, and more potential customers.
To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.
Customer Question
The product manager wants to know the following:
- Can you predict the number of views a listing will receive based on the boat's features?
Success Criteria
The product manager would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.
Dataset
The data you will use for this analysis can be accessed here: "data/boat_data.csv"
| Column Name | Criteria |
|---|---|
| Price | Character, boat price listed in different currencies (e.g. EUR, £, CHF etc.) on the website. |
| Boat Type | Character, type of the boat |
| Manufacturer | Character, the name of the electric moped. |
| Type | Character, condition of the boat and engine type(e.g. Diesel, Unleaded, etc.). |
| Year Built | Numeric, year of the boat built. |
| Length | Numeric, length in meter of the boat. |
| Width | Numeric, width in meter of the boat. |
| Material | Character, material of the boat (e.g. GRP, PVC, etc.). |
| Location | Character, location of the boat is listed. |
| Number of views last 7 days | Numeric, number of the views of the list last 7 days. |
Import
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.metrics import r2_score
plt.rcParams["figure.figsize"] = (15,8)# Read file into dataframe
df = pd.read_csv("data/boat_data.csv")# Inspect dataframe by printing out the first few rows
df.head()# General overview of the dataset
df.info()# Check the number of rows and columns in the dataframe
df.shape# Check the datatypes
df.dtypes# Check for unique values
df.nunique()# Check for duplicates
df.duplicated().sum()# Print out columns containing nulls and number of nulls
df.isnull().sum()# Fill missing values with '0'
df.fillna({'Manufacturer':'Unknown', 'Material': 'Unknown', 'Type': 'Unknown','Location': 'Unknown', }, inplace=True)
df.isnull().sum()df.head()# Fill missing values of lenght and width with mean
length_mean = df['Length'].mean()
width_mean = df['Width'].mean()
df.fillna({'Length': length_mean, 'Width': width_mean }, inplace=True)
df.isnull().sum()