Data Cleaning: An audiobooks dataset

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Description of the dataset

A dataset of audiobooks downloaded from audible.in from 1998 to 2025 (pre-planned releases). Source

It has the following columns:

"name"
"author"
"narrator"
"time"
"releasedate"
"language"
"stars"
"price" - price in Indian Rupee.

As you will see, the data are of rather bad quality: wrong data types, missing values, inconsistent spelling...

The goal of this project is to clean the data and make it ready for analysis.

# Load the dataset and have a quick look at the first few raws and shape
df = pd.read_csv('data/audible_raw.csv')
df.head()
df.shape

# Examine the columns' data types
df.info()

# The author column starts with 'Writtenby:'.
# The narrator column has a similar problem Fix these.
df['author'] = df['author'].str.replace('Writtenby:','')
df['narrator'] = df['narrator'].str.replace('Narratedby:','')

# Check the results
df.head()

# Take a look at the stars column
df['stars'].sample(n=100,random_state=1)

# 'Not rated yet' is a very frequent value. Look at a sample of this column that doesn't contain these values
df[df['stars'] != "Not rated yet"]['stars'].sample(n=100,random_state=1)

# Replace 'Not rated yet' with NaN
df['stars'].replace('Not rated yet',np.nan)

# A typical value of the stars column looks as follows:
# "5 out of 5 stars7 ratings".
# First, extract the rating and the number of ratings into new columns.
# Second, assign appropriate data types to these columns.
# Finally, drop the stars column

df['rating_stars'] = df['stars'].str.extract(pat = '([0-9.]+)').astype(float)

df['n_ratings'] = df['stars'].str.extract(pat = '(\d+) ratings' ).astype(float)

df = df.drop('stars',axis=1)

# Examine the new rating_stars and n_ratings columns
df[['rating_stars','n_ratings']].head(10)

# Explore the price column
df['price'].sample(n=100,random_state=1)

# The random sample reveals that some values have ',' sign

# What about letters?
df[df['price'].str.contains('[a-z]')] # Shows that some values are 'Free'

# Fix these problems and assign an appropriate data type
df['price'] = df['price'].str.replace(',','')
df['price']=df['price'].str.replace('Free','0')
df['price'] = df['price'].astype(float)

# Look at the unique values in the rating_stars column
df['rating_stars'].unique()

# Turn rating_stars to category
df['rating_stars'] = df['rating_stars'].astype('category')

# Convert releasedate to datetime
df['releasedate'] = pd.to_datetime(df['releasedate'])

# Check the time column
df['time'].sample(n=200,random_state=1)

# A typical value of the time column looks like this:
# "2 hrs and 20 mins". Better represent this info as 140 minutes and store it as an integer.
# Also, one might suspect that "hrs" and "mins" are spellt differently. Finally, there is "Less than 1 minute" value.

‌
‌
‌

Data Cleaning: An audiobooks dataset

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Description of the dataset

Description of the dataset