Skip to content
New Workbook
Sign up
Cleaning Data in Python
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Task 1: Loading and Inspecting the Data

We will be working with a dataset of audiobooks downloaded from audible.in from 1998 to 2025 (pre-planned releases). Source

The first thing we will do is load the raw audible data.

Instructions:

  • Using pandas, read the audible_raw.csv file that is located inside the data folder in our local directory. Assign to audible.
  • Show the first few rows of the audible data frame.
import pandas as pd

# Load the audible_raw.csv file
audible = pd.read_csv('data/audible_raw.csv')

# View the first rows of the dataframe
audible

💾 The data

  • "name" - The name of the audiobook.
  • "author" - The audiobook's author.
  • "narrator" - The audiobook's narrator.
  • "time" - The audiobook's duration, in hours and minutes.
  • "releasedate" - The date the audiobook was published.
  • "language" - The audiobook's language.
  • "stars" - The average number of stars (out of 5) and the number of ratings (if available).
  • "price" - The audiobook's price in INR (Indian Rupee).

We can use the .info() method to inspect the data types of the columns

# Inspect the columns' data types
print(audible.info())

Task 2: Clean text data in Author and Narrator columns

We will start cleaning some of the text columns like author and narrator. We can remove the Writtenby: and Narratedby: portions of the text in those columns.

For this, we will use the .str.replace() method

Instructions:

  • Remove 'Writtenby:' from the author column
  • Remove 'Narratedby:' from the narrator column
  • Check the results
# Remove Writtenby: from the author column
audible['author'] = audible['author'].str.replace('Writtenby:','')

# Remove Narratedby: from the narrator column
audible['narrator'] = audible['narrator'].str.replace('Narratedby:','')

# Check the results
audible

Task 3: Extract number of stars and ratings from Stars column.

The stars column combines the number of stars and the number of ratins. Let's turn this into numbers and split it into two columns: rating_stars and n_ratings.

First we will use the .sample() method to get a glimpse at the type of entries in that column.

# Get a glimpse of the stars column 

audible.stars.sample(n=10)

# Alternate code
#sampled = audible.sample(n=10)
#print (sampled['stars'])

Since there are many instances of Not rated yet, let's filter them out and sample again:

# Explore the values of the star column that are not 'Not rated yet'
audible[audible.stars !='Not rated yet'].stars.sample(n=10)

#sampled = audible.sample(n=10)
#filtered = sampled[sampled['stars']!='Not rated yet'] 
#print (filtered['stars'])

As a first step, we can replace the instances of Not rated yet with NaN



# Replace 'Not rated yet' with NaN
audible['stars'] = audible['stars'].replace('Not rated yet', np.nan)

print(audible['stars'])

We can use .str.extract() to get the number of stars and the number of ratings into their own columns.

Instructions:

  • Extract the number of stars into the rating_stars column
  • Extract the number of ratings into the n_ratings column
  • Convert both new columns to float
‌
‌
‌