What makes a good book?
📚 Background
As reading trends keep changing, an ambitious online bookstore has seen a boost in popularity following an intense marketing campaign. Excited to keep the momentum going, the bookstore has kicked off a challenge among their data scientists.
They've equipped the team with a comprehensive dataset featuring book prices, reviews, author details, and categories.
The team is all set to tackle this challenge, aiming to build a model that accurately predicts book popularity, which will help the bookstore manage their stock better and tweak their marketing plans to suit what their readers love the most.
So, we are here to help them make predictions.
📊 The Data
The data provided with a single dataset to use. A summary and preview is provided below.
books.csv
Column | Description |
---|---|
'title' | Book title. |
'price' | Book price. |
'review/helpfulness' | The number of helpful reviews over the total number of reviews. |
'review/summary' | The summary of the review. |
'review/text' | The review's full text. |
'description' | The book's description. |
'authors' | Author. |
'categories' | Book categories. |
'popularity' | Whether the book was popular or unpopular. |
Solution Approach Explanation
In this case, to predict the necessary books to let the online bookstore keep their momentum going, there will be used a NLP algorithm which uses Sentiment modeling, Random Forest and Logistics regression.
1. Importing Necessary Libraries
from warnings import filterwarnings
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_validate
from sklearn.preprocessing import LabelEncoder
from textblob import Word, TextBlob
from wordcloud import WordCloud
books = pd.read_csv("data/books.csv")
books.head(5)
books["popularity"].nunique()
2. Downloading Necessary Packages
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('vader_lexicon')
3. Text Data Preprocessing
Normalizing the text data into a standart by case folding: removing punctuations, numbers by using regex.
from nltk.corpus import stopwords
books["review/summary"] = books["review/summary"] = books["review/summary"].str.lower()
books["review/summary"] = books["review/summary"].str.replace("[^\w\s]", "")
books["review/summary"] = books["review/summary"].str.replace("[\d]", "")
sw = stopwords.words("english")
books["review/summary"] = books["review/summary"].apply(lambda x : " ".join(x for x in str(x).split() if x not in sw))
3.1 Deleting Rarewords
temp_df = pd.Series(" ".join(books["review/summary"]).split()).value_counts()
drops = temp_df[temp_df <= 10] #threshold can differ
books["review/summary"] = books["review/summary"].apply(lambda x : " ".join(x for x in str(x).split() if x not in drops))
3.2 Tokenization
books["review/summary"].apply(lambda x: TextBlob(x).words).head()
3.3 Lemmatization
‌
‌