Skip to content

Cleaning Data in Python

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.
  • Identify and remove all the duplicate rows in ride_sharing.
  • Inspect the unique values of all the columns in airlines and clean any inconsistencies.
  • For the airlines DataFrame, create a new column called International from dest_region, where values representing US regions map to False and all other regions map to True.
  • The banking DataFrame contains out of date ages. Update the Age column using today's date and the birth_date column.
  • Clean the restaurants_new DataFrame so that it better matches the categories in the city and type column of the restaurants DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows between restaurants and restaurants_new using criteria you think is best.
# drop values with filtering
df[df['x'] <= 5]

# drop value using .drop()
df.drop(df[df['x'] > 5].index, inplace = True)

# replace values with certain number
df.loc[df['x'] > 5, 'x'] = 5

# convert to date
import datetime as dt
pd.to_datetime(df['']).dt.date


today_date = dt.date.today()
# drop values with filtering
df[df['date'] <= today_date]
# drop value using .drop()
df.drop(df[df['date'] > today_date].index, inplace = True)
# replace values with certain number
df.loc[df['date'] > today_date, 'date'] = today_date
# find duplicates
duplicates = df.duplicated(subset = ['', '', ''], keep = False) # return True or False # keep = first/last/False
df[duplicates]

# df.drop_duplicates(subset = ['', '', ''], keep = first (default)/last, inplace = True)

Text and Categorical data

# inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
# rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]
# value consistency
# .str.lower()/.str.upper()
# .str.strip()

# collapsing data into categories
ranges = [0, 20000, 50000, np.inf]
group_names = ['0-200k', '200k-500k', '500k+']
df['income_group'] = pd.qcut(df['household_income'], q = 3, labels = group_names)
# or
df['income_group'] = pd.cut(df['household_income'], bins = ranges, labels = group_names)

df[['income_group', 'household_income']]

# map categories to fewer
mapping = {'Microsoft': 'DesktopOS', '':'', '':''}
df['Operating_system'] = df['Opearting_system'].replace(mapping)
df['Operating_system'].unique()
# Text
df[''].str.replace(r'\D+', '') # replace text with nothing

Advanced

#Uniformity

# converting date
df['date'] = pd.to_datetime(df['date'], infer_datetime_format = True, errors = 'coerce')
# or
df['date'] = df['date'].dt.strftime("%d-%m-%Y")
# Cross field validation
sum_classes = flights[['', '', '']].sum(axis = 1)
passenger_ = sum_classes == flights['total']

inconsistent_pass = flights[~passenger_]
consistent_pass = flights[passenger_]