Skip to content
Data Cleaning: An audiobooks dataset
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Description of the dataset
A dataset of audiobooks downloaded from audible.in from 1998 to 2025 (pre-planned releases). Source
It has the following columns:
- "name"
- "author"
- "narrator"
- "time"
- "releasedate"
- "language"
- "stars"
- "price" - price in Indian Rupee.
As you will see, the data are of rather bad quality: wrong data types, missing values, inconsistent spelling...
The goal of this project is to clean the data and make it ready for analysis.
# Load the dataset and have a quick look at the first few raws and shape
df = pd.read_csv('data/audible_raw.csv')
df.head()
df.shape
# Examine the columns' data types
df.info()
# The author column starts with 'Writtenby:'.
# The narrator column has a similar problem Fix these.
df['author'] = df['author'].str.replace('Writtenby:','')
df['narrator'] = df['narrator'].str.replace('Narratedby:','')
# Check the results
df.head()
# Take a look at the stars column
df['stars'].sample(n=100,random_state=1)
# 'Not rated yet' is a very frequent value. Look at a sample of this column that doesn't contain these values
df[df['stars'] != "Not rated yet"]['stars'].sample(n=100,random_state=1)
# Replace 'Not rated yet' with NaN
df['stars'].replace('Not rated yet',np.nan)
# A typical value of the stars column looks as follows:
# "5 out of 5 stars7 ratings".
# First, extract the rating and the number of ratings into new columns.
# Second, assign appropriate data types to these columns.
# Finally, drop the stars column
df['rating_stars'] = df['stars'].str.extract(pat = '([0-9.]+)').astype(float)
df['n_ratings'] = df['stars'].str.extract(pat = '(\d+) ratings' ).astype(float)
df = df.drop('stars',axis=1)
# Examine the new rating_stars and n_ratings columns
df[['rating_stars','n_ratings']].head(10)
# Explore the price column
df['price'].sample(n=100,random_state=1)
# The random sample reveals that some values have ',' sign
# What about letters?
df[df['price'].str.contains('[a-z]')] # Shows that some values are 'Free'
# Fix these problems and assign an appropriate data type
df['price'] = df['price'].str.replace(',','')
df['price']=df['price'].str.replace('Free','0')
df['price'] = df['price'].astype(float)
# Look at the unique values in the rating_stars column
df['rating_stars'].unique()
# Turn rating_stars to category
df['rating_stars'] = df['rating_stars'].astype('category')
# Convert releasedate to datetime
df['releasedate'] = pd.to_datetime(df['releasedate'])
# Check the time column
df['time'].sample(n=200,random_state=1)
# A typical value of the time column looks like this:
# "2 hrs and 20 mins". Better represent this info as 140 minutes and store it as an integer.
# Also, one might suspect that "hrs" and "mins" are spellt differently. Finally, there is "Less than 1 minute" value.