Project: What Makes a Good Book?

Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

price
popularity (target variable)
review/summary
review/text
review/helpfulness
authors
categories

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

# Import some required packages
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

# Read in the dataset
books = pd.read_csv("data/books.csv")

# Preview the first five rows
books.head()

books.info()

#let's check how many categories are and respective counts of books
books['categories'].value_counts()

#counting values in the popularity column
books['popularity'].value_counts()

#printing descriptive statistics of the dataframe numeric columns
books.describe()

#printing descriptive statistics of the data frame non-numeric columns
books.describe(include=[object])

Feature engineering

#removing quotes from the categories column
books.categories = books.categories.str.replace("'",'')
#filtering those categories that has less than 100 books.
books = books.groupby('categories').filter(lambda x:len(x)>100)
#encoding 'categorical' column
categories_df = pd.get_dummies(books['categories'],prefix='cat')
categories_df

#concatenating the encoded categories_df to books dataframe
books = pd.concat([books, categories_df], axis=1)
books.drop(columns=['categories'],inplace=True)
books.head()

#encoding the popularity column to popular = 1 and unpopular = 0
books['popularity'] = books.popularity.str.replace('Popular','1').str.replace('Unpopular','0')
books['popularity'] = books['popularity'].astype(int)
books['popularity'].value_counts()

#engineering the author column
# removing the quotes from the author column
books.authors = books.authors.str.replace("'",'')

#using

#lets split the review/helpful column on '/'
#getting total number of reviews
books['num_reviews'] = books['review/helpfulness'].str.split('/',expand=True)[1]

#getting number of helpful reiviews
books['num_helpful'] = books['review/helpfulness'].str.split('/',expand=True)[0]

#converting to int type
books.num_reviews = books.num_reviews.astype(int)
books.num_helpful = books.num_helpful.astype(int)

#getting percentage of helpful reviews
books['perc_helpful'] = books.num_helpful/books.num_reviews

#filling null values
books['perc_helpful'].fillna(0,inplace=True)

#removing the original column
books.drop(columns=['review/helpfulness'],inplace=True)

‌
‌
‌

Project: What Makes a Good Book?

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Feature engineering

Feature engineering