Skip to content
Project: What Makes a Good Book?
Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.
You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:
price
popularity
(target variable)review/summary
review/text
review/helpfulness
authors
categories
You'll need to build a model that predicts whether a book will be rated as popular or not.
They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.
# Import some required packages
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
# Read in the dataset
books = pd.read_csv("data/books.csv")
# Preview the first five rows
books.head()
books.info()
#let's check how many categories are and respective counts of books
books['categories'].value_counts()
#counting values in the popularity column
books['popularity'].value_counts()
#printing descriptive statistics of the dataframe numeric columns
books.describe()
#printing descriptive statistics of the data frame non-numeric columns
books.describe(include=[object])
Feature engineering
#removing quotes from the categories column
books.categories = books.categories.str.replace("'",'')
#filtering those categories that has less than 100 books.
books = books.groupby('categories').filter(lambda x:len(x)>100)
#encoding 'categorical' column
categories_df = pd.get_dummies(books['categories'],prefix='cat')
categories_df
#concatenating the encoded categories_df to books dataframe
books = pd.concat([books, categories_df], axis=1)
books.drop(columns=['categories'],inplace=True)
books.head()
#encoding the popularity column to popular = 1 and unpopular = 0
books['popularity'] = books.popularity.str.replace('Popular','1').str.replace('Unpopular','0')
books['popularity'] = books['popularity'].astype(int)
books['popularity'].value_counts()
#engineering the author column
# removing the quotes from the author column
books.authors = books.authors.str.replace("'",'')
#using
#lets split the review/helpful column on '/'
#getting total number of reviews
books['num_reviews'] = books['review/helpfulness'].str.split('/',expand=True)[1]
#getting number of helpful reiviews
books['num_helpful'] = books['review/helpfulness'].str.split('/',expand=True)[0]
#converting to int type
books.num_reviews = books.num_reviews.astype(int)
books.num_helpful = books.num_helpful.astype(int)
#getting percentage of helpful reviews
books['perc_helpful'] = books.num_helpful/books.num_reviews
#filling null values
books['perc_helpful'].fillna(0,inplace=True)
#removing the original column
books.drop(columns=['review/helpfulness'],inplace=True)