Skip to content
Cleaning Data in Python
Cleaning Data in Python
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import missingno as msno
import fuzzywuzzy
import recordlinkage
# Import the course datasets
ride_sharing = pd.read_csv('datasets/ride_sharing_new.csv', index_col = 'Unnamed: 0')
airlines = pd.read_csv('datasets/airlines_final.csv', index_col = 'Unnamed: 0')
banking = pd.read_csv('datasets/banking_dirty.csv', index_col = 'Unnamed: 0')
restaurants = pd.read_csv('datasets/restaurants_L2.csv', index_col = 'Unnamed: 0')
restaurants_new = pd.read_csv('datasets/restaurants_L2_dirty.csv', index_col = 'Unnamed: 0')
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Print the information of ride_sharing
print(ride_sharing.info())
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')
# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')
# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'
# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
# Print tire size description
print(ride_sharing['tire_sizes'].describe())
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
# Save today's date
today = dt.date.today()
# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
# Find duplicates
duplicates = ride_sharing.duplicated(['ride_id'], keep = False)
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
# Print relevant columns of duplicated_rides
print(duplicated_rides[['ride_id','duration','user_birth_year']])
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.
- Identify and remove all the duplicate rows in
ride_sharing
. - Inspect the unique values of all the columns in
airlines
and clean any inconsistencies. - For the
airlines
DataFrame, create a new column calledInternational
fromdest_region
, where values representing US regions map toFalse
and all other regions map toTrue
. - The
banking
DataFrame contains out of date ages. Update theAge
column using today's date and thebirth_date
column. - Clean the
restaurants_new
DataFrame so that it better matches the categories in thecity
andtype
column of therestaurants
DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows betweenrestaurants
andrestaurants_new
using criteria you think is best.
# Print categories DataFrame
print(categories)
# Print unique values of survey columns in airlines
print('Cleanliness: ', airlines['cleanliness'].unique(), "\n")
print('Safety: ', airlines['safety'].unique(), "\n")
print('Satisfaction: ', airlines['satisfaction'].unique(), "\n")
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()
# Verify changes have been effected
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())