Skip to content
Project: What Makes a Good Book?
Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.
You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:
pricepopularity(target variable)review/summaryreview/textreview/helpfulnessauthorscategories
You'll need to build a model that predicts whether a book will be rated as popular or not.
They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.
# Import some required packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os
tokenizers_parallelism = os.environ["TOKENIZERS_PARALLELISM"]
# Read in the dataset
books = pd.read_csv("data/books.csv")
# Preview the first five rows
books.head()books["categories"] = books["categories"].str.replace("'", "", regex=False).str.strip()
books["authors"] = books["authors"].str.replace("'", "", regex=False).str.strip()
books.head()# Get number of total reviews
books["num_reviews"] = books["review/helpfulness"].str.split("/", expand=True)[1]
# Get number of helpful reviews
books["num_helpful"] = books["review/helpfulness"].str.split("/", expand=True)[0]
# Convert to integer datatype
for col in ["num_reviews", "num_helpful"]:
books[col] = books[col].astype(int)
# Add percentage of helpful reviews as a column to normalize the data
books["pct_helpful_rev"] = books["num_helpful"] / books["num_reviews"]
# Fill null values
books["pct_helpful_rev"].fillna(0, inplace=True)
# Drop original column
books.drop(columns=["review/helpfulness", "num_helpful", "num_reviews"], inplace=True)
books.head(2)books["review/summary"][0]books["review/text"][0]books["description"][0]books.info()books[books.duplicated()].shape[0]books = books.drop_duplicates()
books.shape[0]books["n_author"] = books["authors"].apply(lambda x: len(str(x).split(',')) if pd.notnull(x) else 0)
books[books["n_author"]>1].shape[0]books.nunique()books.loc[books.popularity.isna(), "popularity"] = pd.NA
books.popularity = books.popularity.map({"Unpopular":0, "Popular":1}).astype("Int64")
books.head()# Visualize popularity frequencies
sns.countplot(data=books, x="popularity")
plt.show()