Skip to content

Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.

You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:

  • price
  • popularity (target variable)
  • review/summary
  • review/text
  • review/helpfulness
  • authors
  • categories

You'll need to build a model that predicts whether a book will be rated as popular or not.

They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.

# Import some required packages
import pandas as pd

# Read in the dataset
books = pd.read_csv('data/books.csv')

# Preview the first five rows
books.head()

1 - Perform EDA

# Inspecting a DataFrame
books.info()
# Understanding distributions and frequencies
import matplotlib.pyplot as plt
import seaborn as sns
# Visualize popularity frequencies
sns.countplot(data=books, x='popularity')
plt.show()
# Visualize price distribution
sns.histplot(data=books, x='price')
plt.show()
# Check frequencies
print(books['categories'].value_counts())
print(books['categories'].value_counts().values)
less = books['categories'].value_counts().values
# Find total number of categories with less than 100 counts
less[less < 100].sum()
# Find total number of categories with greater than 100 count
less[less > 100].sum()
books.groupby('categories').agg({'title': 'count'})
# Filter out rare categories to avoid overfitting
books = books.groupby('categories').filter(lambda x: len(x) > 100)