Skip to content
0

Introduction

This project delves into the literary landscape of the early 21st century by analyzing the Top Goodreads Books Collection dataset from Kaggle

Focusing on the decade between 2000 and 2013, this analysis aims to uncover the prevailing genres favored by Goodreads users and explore the patterns of reader engagement across these genres. By examining the trends in genre representation among top-rated books and the corresponding levels of ratings and votes, this project seeks to provide insights into the evolving "tastes" of the Goodreads community during this dynamic period in online book discovery and discussion.

The core analysis is conducted using Python in this Jupyter Notebook, with key findings and overarching trends visually represented through interactive dashboards and charts created in Power BI. Screenshots of these visualizations are integrated within this notebook to provide immediate insights, and a link to the full, interactive Power BI report is also provided for further exploration.

The data

import pandas as pd

goodreads = pd.read_csv("goodreads_top100.csv")

display(goodreads)

The Top Goodreads Books Collection dataset from Kaggle offers a rich array of information for each book, providing a multifaceted view of literary works and reader engagement.

Key features include 'ISBN' codes for identification, the 'Title' of each book, and details on 'Series' and 'Release Number' for books belonging to a collection.

The dataset also specifies the 'Publisher', the 'Language' of the book, and the 'Author(s)'. Crucially for this project, the 'Genres' column offers insights into thematic categorization, while 'Publication Date' provides historical context.

Reader reception is captured through the 'Rating' and 'Number of Voters', indicating average sentiment and engagement volume, respectively.

Additional columns such as 'Current Readers', 'Want to Read', and 'Price' offer further context on the book's popularity and market information.

Considering the objectives of this project, which aim to explore genre trends, key authors, language diversity, and the role of price within the top Goodreads books from 2000 to 2013, certain columns are more relevant than others.

The 'URL' and 'Description' columns do not directly contribute to these analytical goals and can be excluded to streamline the dataset.

Similarly, 'ISBN', 'Series', 'Release Number', 'Publisher', 'Num Pages', and 'Format' are less pertinent to understanding the overarching trends in genre, author prominence, language popularity, and the influence of price on reader reception.

Therefore, these columns will be considered for removal to focus the analysis on 'Title', 'Publication Date', 'Genres', 'Rating', 'Number of Voters', 'Author', 'Language', and 'Price', which are crucial for addressing the project's objectives.

Data Cleaning

To prepare the book dataset for analysis in Power BI, a two-phased cleaning process was employed. Initially, in Microsoft Excel, unnecessary columns were removed to streamline the data.

The language column was excluded due to the overwhelming prevalence of English, and the price column was removed to address a significant number of missing values.

Additionally, the publication_date column's data type was corrected to ensure proper date formatting, and the data was filtered to include only books published between 2000 and 2013, inclusive.

Subsequently, the dataset was loaded into Power Query within Excel for further transformation. Specifically, the genres column, which contained genre lists in a string format, was split into multiple columns based on the comma delimiter.

These newly created genre columns were then unpivoted, converting the data into a long format where each row represents a book and a single genre, thus facilitating accurate genre-based analysis in Power BI.

Analysis

Overview Page of the Analysis

The primary book genres represented in Goodreads' top-rated books between 2000 and 2013 and how their prevalence changed year-over-year.

The analysis reveals that Fiction (973 books), Fantasy (661), Romance (576), Young Adult (488), and Contemporary (339) are the primary genres represented. While Fiction, Fantasy, and Romance maintain a consistent presence throughout the period, with each occupying a substantial portion of the top-rated selections each year, Young Adult shows a noticeable increase in prevalence, particularly in the later years. In contrast, Contemporary has a smaller representation compared to the other top genres. Although there are some year-to-year fluctuations in the proportion of each genre, these general trends highlight the evolving genre landscape within highly rated books during this timeframe

The most frequently appearing authors in the top-rated books and the genres they are most associated with.