Skip to content

Predicting Electric Moped Reviews

Company Background

EMO is a manufacturer of motorcycles. The company successfully launched its first electric moped in India in 2019. The product team knows how valuable owner reviews are in making improvements to their mopeds.

Unfortunately they often get reviews from people who never owned the moped. They don’t want to consider this feedback, so would like to find a way to identify reviews from these people. They have obtained data from other mopeds, where they know if the reviewer owned the moped or not. They think this is equivalent to their own reviews.

Customer Question

  • Can you predict which reviews come from people who have never owned the moped before?

Dataset

The dataset contains reviews about other mopeds from a local website.

Column NameCriteria
Used it forCharacter, the purpose of the electric moped for the user, one of “Commuting”, “Leisure”.
Owned forCharacter, duration of ownership of vehicle one of “<= 6 months”, “> 6 months”, “Never Owned”. Rows that indicate ownership should be combined into the category “Owned”.
Model nameCharacter, the name of the electric moped.
Visual AppealNumeric, visual appeal rating (on a 5 point scale, replace missing values with 0).
ReliabilityNumeric, reliability rating (on a 5 point scale, replace missing values with 0).
Extra FeatureNumeric, extra feature rating (on a 5 point scale, replace missing values with 0).
ComfortNumeric, comfort rating (on a 5 point scale, replace missing values with 0).
Maintenance costNumeric, maintenance cost rating (on a 5 point scale, replace missing values with 0).
Value for moneyNumeric, value for money rating (on a 5 point scale, replace missing values with 0).

Data Scientist Associate Practical Exam Submission

#import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

plt.style.use('ggplot')
[93]
moped = pd.read_csv('data/moped.csv')
moped.head()
moped.info()

Columns that have null values

  • Extra Features
  • Comfort
  • Maintenance cost
  • Value for Money

The null values in these columns need to be replaced with 0.

All the 5-point scale variables are also currently float64 and need to be changed to int64

Further the "Owned for" column needs to be converted into a Ownership column that indicates whether the reviewer owned the moped or not only

Finally since the used it for column is also a binary column signifying either "Leisure" or "Commuting", this column will also be changed to binary column named "Commuting" indicating 1 for Commuting and 0 for Leisure.

#make a copy of the moped df and store it in df
df = moped.copy()

#change the categories of the 'Owned for' column of the dataset to 'Owned' & 'Never owned'
df.loc[(df['Owned for'] == '> 6 months') | (df['Owned for'] == '<= 6 months'),'Owned for']='Owned'

#rename column to more accurately reflect what the column contains
df.rename(columns = {'Owned for':'Ownership'}, inplace = True)

#change to a binary column with 1 indicating owned and 0 indicating never owned
df['Ownership'] = pd.get_dummies(df['Ownership'])['Owned']

#checks
print(f'Ownership column dtype: {df["Ownership"].dtype}')
print(df['Ownership'].value_counts())
[96]
df['Used it for'] = pd.get_dummies(df['Used it for'])['Commuting']
df.rename(columns = {'Used it for':'Commuting'}, inplace = True)
df['Commuting']
[97]
df.fillna(0, inplace=True) # replace all missing values identified earlier with 0 inplace
df.isna().sum() == 0 # all columns should equal 0 as there should be no True/1 for a NA value
df['Visual Appeal'] = df['Visual Appeal'].astype('int64')
df['Reliability'] = df['Reliability'].astype('int64')
df['Extra Features'] = df['Extra Features'].astype('int64')
df['Comfort'] = df['Comfort'].astype('int64')
df['Maintenance cost'] = df['Maintenance cost'].astype('int64')
df['Value for Money'] = df['Value for Money'].astype('int64')
df.info()

Data Validation Summary

First we looked at a sample of the dataset to validate the criteria provided by the customer.
All columns and column names were present but with incorrect dtypes.
Next, we checked to see if there was any missing data in the dataset. Four columns had missing data:

  • Extra Features
  • Comfort
  • Maintenance cost
  • Value for Money

The null values in these columns were changed to 0 to reflect where, on a 5-point scale, the values actually lie.

Additionally, the 'Owned for' column showed duration of ownership of vehicle with the following categories

  • <= 6 months
  • > 6 months
  • Never Owned

For our purposes we only needed to know whether or not the reviewer owned the vehicle and so all values were changed to:

  • Owned &
  • Never owned

One change that was made during this part of the analysis was changing the "Ownership" column, which is to be be our response variable in the machine learning models. We used pd.getdummies() and then selected the "Owned" column to give our model a binary, 1 or 0, variable so that it can actually be used in the model fitting process.

The same transformation was applied to the "Used it for" feature. Where we changed the it to the "Commuting" column where 1 means it was used for commuting and 0 means it was used for leisure

All changes made were followed with checks for consistency in the data and to make sure that the dataset was ready for further analysis and processing.

Exploratory Analysis

[99]
df.describe() # overview of the numerical features of the dataset
# numerical features
df.drop(['Ownership', 'Commuting'], axis=1).hist(layout= (2,3), alpha =0.8);