Skip to content
Project: What Makes a Good Book?
Identifying popular products is incredibly important for e-commerce companies! Popular products generate more revenue and, therefore, play a key role in stock control.
You've been asked to support an online bookstore by building a model to predict whether a book will be popular or not. They've supplied you with an extensive dataset containing information about all books they've sold, including:
pricepopularity(target variable)review/summaryreview/textreview/helpfulnessauthorscategories
You'll need to build a model that predicts whether a book will be rated as popular or not.
They have high expectations of you, so have set a target of at least 70% accuracy! You are free to use as many features as you like, and will need to engineer new features to achieve this level of performance.
# Import some required packages
import pandas as pd
# Read in the dataset
books = pd.read_csv('data/books.csv')
# Preview the first five rows
books.head()1 - Perform EDA
# Inspecting a DataFrame
books.info()# Understanding distributions and frequencies
import matplotlib.pyplot as plt
import seaborn as sns# Visualize popularity frequencies
sns.countplot(data=books, x='popularity')
plt.show()# Visualize price distribution
sns.histplot(data=books, x='price')
plt.show()# Check frequencies
print(books['categories'].value_counts())print(books['categories'].value_counts().values)less = books['categories'].value_counts().values# Find total number of categories with less than 100 counts
less[less < 100].sum()# Find total number of categories with greater than 100 count
less[less > 100].sum()books.groupby('categories').agg({'title': 'count'})# Filter out rare categories to avoid overfitting
books = books.groupby('categories').filter(lambda x: len(x) > 100)