What makes a good book?
Table of Contents
Title - description
- Problem Statement & Outcome - identifying the project goals and establishing success parameters
- Library & Data Imports - preparing the tools used for the analysis
- Preliminary Data Cleaning - performing minimal data cleaning to prepare for model building
- Baseline Model Building - using the dataset prepared in the previous step, we build a quick ML model, seeking to achieve our goal
- Comprehensive Preprocessing & EDA - cleaning of each column thoroughly, adding new features, identifying feature patterns, and understanding our data more deeply
- Finalize Dataset for Model Building - merging all dataframes into one large dataframe used to train our final model; two dataframes created: one comprising unique books and another whose granularity is book reviews
- Model Training & Evaluation - evaluating model performance with a cleaned dataset to find the model that best achieves our goal
- Model Deployment - testing our best model against a test set sample
- Which Features Make a Book Popular? - answering the fundamental question of this project
Problem Statement & Outcome
Goal
We work for an online bookstore as a data scientist. Our company would like to predict whether a book in its inventory will be popular or unpopular. Our model needs to achieve at least 70% prediction accuracy on a cross-validated test set.
We define accuracy as the following:
Tools
We have been provided a single dataset comprising nine columns, including the target feature ("popularity").
Column | Description |
---|---|
'title' | Book title |
'price' | Book price |
'review/helpfulness' | The number of helpful reviews over the total number of reviews |
'review/summary' | The summary of the review |
'review/text' | The review's full text |
'description' | The book's description |
'authors' | Author |
'categories' | Book categories |
'popularity' | Whether the book was popular or unpopular |
Outcome
A good prediction model will help the online bookstore better manage its inventory and tailor its marketing strategies to satisfy its clients' interest. As a data scientist in the firm, delivering a quality good model is also in our interests, as building such models is presumably part of our job description; delivering a good model may lead to better job security or an internal firm promotion. Regardless, we work at this firm. We signed a contract with this firm to deliver best-in-class machine learning. So let's achieve that at minimum.
Success Parameters
I define these to be:
- Deliver a model that achieves greater than 70% accuracy on the test set during model validation. Ideally, this test set accuracy is as high as possible. I will use the XGBoost Classifier as my baseline machine learning algorithm.
- Answer the question at the heart of the problem: "what makes a good book?" In this case, what makes a book 'popular,' a word I use to be synonymous with good. I would like to know what book's features most account for a book's categorization as 'popular.'
Library & Data Imports
# Import essential
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# nltk and nlp
import string
import re
# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, RobustScaler, FunctionTransformer
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.compose import make_column_transformer, make_column_selector, ColumnTransformer
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_recall_curve, roc_curve
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
#from wordcloud import WordCloud
from collections import Counter
from textblob import TextBlob
# Custom libraries
import maps as m
import textcleaner as tc
import authorcleaner as ac
import helpers as helper
# Custom transformer library
import customtransformers as ct
import warnings
warnings.filterwarnings("ignore")
sns.set_style("whitegrid")
# magic functions
%load_ext autoreload
%autoreload 2
A simple inspection of the dataset reveals that the column content appears to be formatted correctly with the exception of one column: review/helpfulness. That column should be parsed into two distinct numeric columns. Other columns, such as authors or the text columns - review/summary, review/text, and description, need to be parsed and cleaned at a later stage.
Preliminary Data Cleaning
Drop Duplicates
Of the 15,719 rows in the original dataframe, there are 3,294 duplicate rows. This is a problem, as duplicate rows can lead to overfitting, where the model learns the specifics of the duplicate examples too well and performs poorly on new, unseen data. The concern is that the model might give undue importance to these duplicates, especially if they are significant in quantity compared to unique examples.
Moreover, duplicates skew the distribution of classes or values, particularly if they are not evenly distributed. For classification problems, this could result in biased predictions, especially if certain classes are overrepresented due to duplicates.
After dropping the duplicate entries, the total dataset observation count is 12,425 rows.