Skip to content
0

What makes a good book?

📚 1) Background

As reading trends keep changing, an ambitious online bookstore has seen a boost in popularity following an intense marketing campaign. Excited to keep the momentum going, the bookstore has kicked off a challenge among their data scientists.

They've equipped the team with a comprehensive dataset featuring book prices, reviews, author details, and categories.

The team is all set to tackle this challenge, aiming to build a model that accurately predicts book popularity, which will help the bookstore manage their stock better and tweak their marketing plans to suit what their readers love the most.

Help them get the best predictions!

You are free to use any methodologies that you like in order to produce your insights.

🔖 2) Data Summary

The bookstore hold over 7000 unique titles insides their vault that belong to over 6400 authors over 300 different categories. The bookstore loves its Fiction book category, holding over 1100 titles, but you could also find categories like Religion, Biography & Autobiography and even Juvenile Fiction. Some authors are more loved than others, but our readers sure love reviewing Charles Dickes, Chirstopher Paolini and Thomas Harris for their unmatched masterpieces. We will investigate if the categories with high number of reviews have actual good reviews. But the most important insight will be what feature weights the most in predicting if a book is popular or not.

📊 3) The Data

They have provided you with a single dataset to use. A summary and preview is provided below.

books.csv

ColumnDescription
'title'Book title.
'price'Book price.
'review/helpfulness'The number of helpful reviews over the total number of reviews.
'review/summary'The summary of the review.
'review/text'The review's full text.
'description'The book's description.
'authors'Author.
'categories'Book categories.
'popularity'Whether the book was popular or unpopular.
💪 The Challenge
  • Use your skills to find the most popular books.
  • You can use any predictive model to solve the problem of categorizing books as popular or unpopular.
  • Use the accuracy score as your metric to optimize, aiming for at least a 70% accuracy on a test set.
  • You may also wish to use feature engineering to pre-process the data.
✍️ Judging criteria

This competition is for helping to understand how competitions work. This competition will not be judged.

✅ Checklist before publishing
  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your work.
  • Check that all the cells run without error.
⌛️ Time is ticking. Good luck!

# Install the missing packages
!pip install vaderSentiment
!pip install category_encoders
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from scipy.sparse import hstack
from category_encoders import TargetEncoder

sns.set_style('darkgrid')
sns.set_palette('deep')
3.1) Load the data
# Load in the data and preview a random sample
books = pd.read_csv('data/books.csv')
books.sample(15, random_state = 45)
3.2) Data cleaning and pre-processing
# Check for null values
books.isna().sum()
# Split review/helpfulness column into 2 columns: helpful_reviews, total_reviews
books['review/helpfulness'] = books['review/helpfulness'].str.split(pat = '/')
books['review_score'] = books['review/helpfulness'].str[0].astype('int') / books['review/helpfulness'].str[1].astype('int')
books['review_score'] = books['review_score'].fillna(0).round(2)
books.drop(columns = 'review/helpfulness', inplace = True)

# Update the data types of the new columns
books['review_score'] = books['review_score'].astype('float')
print(books.dtypes)
books.head()
# Rename the columns
books.rename(columns = {'review/summary': 'summary_review',
                        'review/text': 'text_review'}, 
                        inplace = True)
books.head()