Skip to content

Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

  • price
  • popularity (target variable)
  • review/summary
  • review/text
  • review/helpfulness
  • authors
  • categories

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

# Import some required packages
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import os

tokenizers_parallelism = os.environ["TOKENIZERS_PARALLELISM"]

# Read in the dataset
books = pd.read_csv("data/books.csv")

# Preview the first five rows
books.head()
books["categories"] = books["categories"].str.replace("'", "", regex=False).str.strip()
books["authors"] = books["authors"].str.replace("'", "", regex=False).str.strip()
books.head()
# Get number of total reviews 
books["num_reviews"] = books["review/helpfulness"].str.split("/", expand=True)[1]

# Get number of helpful reviews 
books["num_helpful"] = books["review/helpfulness"].str.split("/", expand=True)[0]

# Convert to integer datatype
for col in ["num_reviews", "num_helpful"]:
    books[col] = books[col].astype(int)
    
# Add percentage of helpful reviews as a column to normalize the data
books["pct_helpful_rev"] = books["num_helpful"] / books["num_reviews"]

# Fill null values
books["pct_helpful_rev"].fillna(0, inplace=True)

# Drop original column
books.drop(columns=["review/helpfulness", "num_helpful", "num_reviews"], inplace=True)
books.head(2)
books["review/summary"][0]
books["review/text"][0]
books["description"][0]
books.info()
books[books.duplicated()].shape[0]
books = books.drop_duplicates()
books.shape[0]
books["n_author"] = books["authors"].apply(lambda x: len(str(x).split(',')) if pd.notnull(x) else 0)
books[books["n_author"]>1].shape[0]
books.nunique()
books.loc[books.popularity.isna(), "popularity"] = pd.NA
books.popularity = books.popularity.map({"Unpopular":0, "Popular":1}).astype("Int64")
books.head()
# Visualize popularity frequencies
sns.countplot(data=books, x="popularity")
plt.show()