What makes a good book?
📚 Background
As reading trends keep changing, an ambitious online bookstore has seen a boost in popularity following an intense marketing campaign. Excited to keep the momentum going, the bookstore has kicked off a challenge among their data scientists.
They've equipped the team with a comprehensive dataset featuring book prices, reviews, author details, and categories.
The team is all set to tackle this challenge, aiming to build a model that accurately predicts book popularity, which will help the bookstore manage their stock better and tweak their marketing plans to suit what their readers love the most.
Help them get the best predictions!
You are free to use any methodologies that you like in order to produce your insights.
📊 The Data
They have provided you with a single dataset to use. A summary and preview is provided below.
books.csv
| Column | Description |
|---|---|
'title' | Book title. |
'price' | Book price. |
'review/helpfulness' | The number of helpful reviews over the total number of reviews. |
'review/summary' | The summary of the review. |
'review/text' | The review's full text. |
'description' | The book's description. |
'authors' | Author. |
'categories' | Book categories. |
'popularity' | Whether the book was popular or unpopular. |
💪 The Challenge
- Use your skills to find the most popular books.
- You can use any predictive model to solve the problem of categorizing books as popular or unpopular.
- Use the accuracy score as your metric to optimize, aiming for at least a 70% accuracy on a test set.
- You may also wish to use feature engineering to pre-process the data.
# Import some required packages
import pandas as pd
# Read in the dataset
books_df = pd.read_csv("data/books.csv")
# Preview the first five rows
books_df.head()from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
import numpy as np
# Preprocessing and Feature Engineering
# Handling missing values
books_df.fillna('', inplace=True)
# Convert 'price' to numeric
books_df['price'] = pd.to_numeric(books_df['price'], errors='coerce').fillna(0)
# Extracting target variable and encoding it
le = LabelEncoder()
books_df['popularity'] = le.fit_transform(books_df['popularity'])
# Features and target variable
X = books_df.drop(['popularity'], axis=1)
y = books_df['popularity']
# Text columns to be vectorized
text_columns = ['title', 'review/summary', 'review/text', 'description']
# Preprocessing and Feature Engineering with adjustments for authors and categories
# Columns to be used in one-hot encoding
categorical_columns = ['authors', 'categories']
# Preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['price']),
('title', TfidfVectorizer(max_features=1000), 'title'),
('review_summary', TfidfVectorizer(max_features=1000), 'review/summary'),
('review_text', TfidfVectorizer(max_features=1000), 'review/text'),
('description', TfidfVectorizer(max_features=1000), 'description'),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
],
remainder='drop'
)
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Building the pipeline
model = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# Training the model
model.fit(X_train, y_train)
# Predictions and Evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracyMost popular categories
# Group by 'categories' and sum the 'popularity' for each category
category_popularity = books_df.groupby('categories')['popularity'].sum()
# Sort the categories by popularity in descending order
most_popular_categories = category_popularity.sort_values(ascending=False)
# Display the top 10 most popular categories
most_popular_categories.head(10)Most popular authors
# Group by 'authors' and sum the 'popularity' for each author
author_popularity = books_df.groupby('authors')['popularity'].sum()
# Sort the authors by popularity in descending order
most_popular_authors = author_popularity.sort_values(ascending=False)
# Display the most popular authors
most_popular_authorsHyperparmeter Tuning
from sklearn.model_selection import GridSearchCV
# Define the parameter grid for the Random Forest
param_grid = {
'classifier__n_estimators': [100, 200, 300],
'classifier__max_depth': [None, 10, 20, 30],
'classifier__min_samples_split': [2, 5, 10],
'classifier__min_samples_leaf': [1, 2, 4]
}
# Initialize GridSearchCV with the Random Forest classifier
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
cv=3, n_jobs=-1, verbose=2, scoring='accuracy')
# Fit GridSearchCV
grid_search.fit(X_train, y_train)
# Best parameters from GridSearchCV
best_params = grid_search.best_params_
best_accuracy = grid_search.best_score_
print("Best Parameters:", best_params)
print("Best Accuracy:", best_accuracy)
1 hidden cell