Skip to content
0

What makes a good book?

📚 Background

As reading trends keep changing, an ambitious online bookstore has seen a boost in popularity following an intense marketing campaign. Excited to keep the momentum going, the bookstore has kicked off a challenge among their data scientists.

They've equipped the team with a comprehensive dataset featuring book prices, reviews, author details, and categories.

The team is all set to tackle this challenge, aiming to build a model that accurately predicts book popularity, which will help the bookstore manage their stock better and tweak their marketing plans to suit what their readers love the most.

So, we are here to help them make predictions.

📊 The Data

The data provided with a single dataset to use. A summary and preview is provided below.

books.csv

ColumnDescription
'title'Book title.
'price'Book price.
'review/helpfulness'The number of helpful reviews over the total number of reviews.
'review/summary'The summary of the review.
'review/text'The review's full text.
'description'The book's description.
'authors'Author.
'categories'Book categories.
'popularity'Whether the book was popular or unpopular.

Solution Approach Explanation

In this case, to predict the necessary books to let the online bookstore keep their momentum going, there will be used a NLP algorithm which uses Sentiment modeling, Random Forest and Logistics regression.

1. Importing Necessary Libraries

from warnings import filterwarnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate
from sklearn.preprocessing import LabelEncoder
from textblob import Word, TextBlob
from wordcloud import WordCloud

books = pd.read_csv("data/books.csv")

books.head(5)
books["popularity"].nunique()

2. Downloading Necessary Packages

import nltk

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')

3. Text Data Preprocessing

Normalizing the text data into a standart by case folding: removing punctuations, numbers by using regex.

from nltk.corpus import stopwords

books["review/summary"] = books["review/summary"] = books["review/summary"].str.lower()
books["review/summary"] = books["review/summary"].str.replace("[^\w\s]", "")
books["review/summary"] = books["review/summary"].str.replace("[\d]", "")

sw = stopwords.words("english")

books["review/summary"] = books["review/summary"].apply(lambda x : " ".join(x for x in str(x).split() if x not in sw))

3.1 Deleting Rarewords

temp_df = pd.Series(" ".join(books["review/summary"]).split()).value_counts()

drops = temp_df[temp_df <= 10] #threshold can differ

books["review/summary"]  = books["review/summary"].apply(lambda x : " ".join(x for x in str(x).split() if x not in drops))

3.2 Tokenization

books["review/summary"].apply(lambda x: TextBlob(x).words).head()

3.3 Lemmatization

‌
‌
‌