Duplicate of Certification Workspace

Data Scientist Professional Case Study

Company Background

Nearly New Nautical is a website that allows users to advertise their used boats for sale. When users list their boat, they have to provide a range of information about their boat. Boats that get lots of views bring more traffic to the website, and more potential customers.

To boost traffic to the website, the product manager wants to prevent listing boats that do not receive many views.

Customer Question

The product manager wants to know the following:

Can you predict the number of views a listing will receive based on the boat's features?

Success Criteria

The product manager would consider using your model if, on average, the predictions were only 50% off of the true number of views a listing would receive.

Dataset

The data you will use for this analysis can be accessed here: "data/boat_data.csv"

Column Name	Criteria
Price	Character, boat price listed in different currencies (e.g. EUR, Â£, CHF etc.) on the website.
Boat Type	Character, type of the boat
Manufacturer	Character, the name of the electric moped.
Type	Character, condition of the boat and engine type(e.g. Diesel, Unleaded, etc.).
Year Built	Numeric, year of the boat built.
Length	Numeric, length in meter of the boat.
Width	Numeric, width in meter of the boat.
Material	Character, material of the boat (e.g. GRP, PVC, etc.).
Location	Character, location of the boat is listed.
Number of views last 7 days	Numeric, number of the views of the list last 7 days.

Import

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.metrics import r2_score
plt.rcParams["figure.figsize"] = (15,8)

# Read file into dataframe
df = pd.read_csv("data/boat_data.csv")

# Inspect dataframe by printing out the first few rows
df.head()

# General overview of the dataset
df.info()

# Check the number of rows and columns in the dataframe
df.shape

# Check the datatypes
df.dtypes

# Check for unique values
df.nunique()

# Check for duplicates
df.duplicated().sum()

# Print out columns containing nulls and number of nulls
df.isnull().sum()

# Fill missing values with '0'
df.fillna({'Manufacturer':'Unknown', 'Material': 'Unknown', 'Type': 'Unknown','Location': 'Unknown', }, inplace=True)
df.isnull().sum()

df.head()

# Fill missing values of lenght and width with mean
length_mean = df['Length'].mean()
width_mean = df['Width'].mean()
df.fillna({'Length': length_mean, 'Width': width_mean }, inplace=True)
df.isnull().sum()

‌
‌
‌