Skip to content
Cleaning Data in Python
  • AI Chat
  • Code
  • Report
  • Cleaning Data in Python

    Run the hidden code cell below to import the data used in this course.

    # Import the course packages
    import pandas as pd
    import numpy as np
    import datetime as dt
    import matplotlib.pyplot as plt
    import missingno as msno
    import fuzzywuzzy
    import recordlinkage 
    
    # Import the course datasets
    ride_sharing = pd.read_csv('datasets/ride_sharing_new.csv', index_col = 'Unnamed: 0')
    airlines = pd.read_csv('datasets/airlines_final.csv',  index_col = 'Unnamed: 0')
    banking = pd.read_csv('datasets/banking_dirty.csv', index_col = 'Unnamed: 0')
    restaurants = pd.read_csv('datasets/restaurants_L2.csv', index_col = 'Unnamed: 0')
    restaurants_new = pd.read_csv('datasets/restaurants_L2_dirty.csv', index_col = 'Unnamed: 0')

    Explore Datasets

    Use the DataFrames imported in the first cell to explore the data and practice your skills!

    • For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.
    • Identify and remove all the duplicate rows in ride_sharing.
    • Inspect the unique values of all the columns in airlines and clean any inconsistencies.
    • For the airlines DataFrame, create a new column called International from dest_region, where values representing US regions map to False and all other regions map to True.
    • The banking DataFrame contains out of date ages. Update the Age column using today's date and the birth_date column.
    • Clean the restaurants_new DataFrame so that it better matches the categories in the city and type column of the restaurants DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows between restaurants and restaurants_new using criteria you think is best.

    Course Outline

    • Understanding how to diagnose different problems in data and their occurrence during the workflow.
    • Learning the side effects of not treating data correctly.
    • Exploring various ways to address different types of dirty data.
    • Discussing common data problems and solutions.

    Why Clean Data?

    • Cleaning data is crucial in the data science workflow, which involves accessing raw data, exploring, processing, developing insights through visualizations or predictive models, and reporting these insights.
    • Dirty data can arise from duplicate values, misspellings, data type parsing errors, and legacy systems, compromising the insights and reports generated.

    Data Type Constraints

    • Working with various data types (text, integers, decimals, dates, zip codes, etc.) is common, and Python offers specific data type objects for ease of manipulation.
    • Ensuring variables have the correct data types is essential to avoid compromising analysis.

    Strings to Integers

    • Example: Calculating total revenue from a DataFrame where the Revenue column contains dollar signs and is stored as strings.
    • Use .dtypes to inspect data types and .info() to check data types and missing values.
    • To sum revenues, first remove the $ sign using .str.strip(), then convert the column to integers with .astype(). For decimal values, convert to float.
    • Verify the conversion with an assert statement to ensure the Revenue column is now an integer.

    The Assert Statement

    • The assert statement tests conditions, returning nothing if the condition is met and an error if not, useful for validating data types or values.

    Numeric or Categorical?

    • Some data appears numeric but represents categories (categorical data), like a marriage status column with integers representing different statuses.
    • Misinterpreting these as numeric can lead to misleading results. Convert them to category type using .astype() to get more accurate summary statistics.
    ## Numeric Data or ...? Working with the `ride_sharing` dataset, which includes San Francisco bicycle ride sharing information like start and end stations, trip duration, and user details. The `user_type` column indicates: - 1: Free riders - 2: Pay per ride - 3: Monthly subscribers Use `.info()` on `ride_sharing` to examine potential data type issues. `pandas` is imported as `pd`.
    # Print the information of ride_sharing
    print(ride_sharing.info())
    
    # Print summary statistics of user_type column
    print(ride_sharing['user_type'].describe())
    
    # Convert user_type from integer to category
    ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
    
    # Write an assert statement confirming the change
    assert ride_sharing['user_type_cat'].dtype == 'category'
    
    # Print new summary statistics 
    print(ride_sharing['user_type_cat'].describe())
    Hidden output
    ## Fixing Data Types: String to Integer Previously, we corrected the `user_type` column's data type. Now, we'll focus on converting the `duration` column from string to integer. This involves removing "minutes" from its values to enable numerical operations. `pandas` is imported as `pd`.
    # Strip duration of minutes
    ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')
    
    # Convert duration to integer
    ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')
    
    # Write an assert statement making sure of conversion
    assert ride_sharing['duration_time'].dtype == 'int'
    
    # Print formed columns and calculate average ride duration 
    print(ride_sharing[['duration','duration_trim','duration_time']])
    print(ride_sharing['duration_time'].mean())
    Hidden output

    Handling Out of Range Data

    When working with datasets, it's common to encounter data that falls outside the expected range. This can occur due to errors in data collection or parsing. For instance, a dataset of movies might include ratings beyond the maximum allowable value, or a list of subscription dates might contain dates in the future, which logically isn't possible.

    Strategies for Dealing with Out of Range Data

    There are several approaches to handle out-of-range data:

    • Dropping Data: This is the simplest approach but should be used cautiously. It's suitable when only a small portion of the dataset is affected. Dropping data can be done by filtering the dataset to exclude out-of-range values or using the drop method with specific row indices.

    • Setting Limits: For values that exceed a logical range, setting them to a predefined maximum or minimum can be a practical solution. This approach depends on the underlying business assumptions of the dataset.

    • Imputing Data: Setting out-of-range values as missing and then imputing them with statistical methods can also be an option, though it's more commonly used for dealing with missing data.

    Practical Examples

    • Movies Rating: For movie ratings that exceed the maximum value (e.g., 5), options include dropping these records or capping the ratings at 5 using the .loc method. Ensuring the changes are applied can be done with assert statements to check the maximum rating doesn't exceed the allowable limit.

    • Subscription Dates: If subscription dates are in the future, the data type of the column should first be confirmed and possibly converted to a datetime format for accurate comparison. Future dates can either be dropped or replaced with today's date, using the datetime.date.today() function for the current date. Assert statements can verify the treatment by ensuring no subscription dates are in the future.

    Handling out-of-range data is crucial for maintaining the accuracy and reliability of analyses. The chosen method depends on the dataset's specifics and the underlying business logic.

    Reducing Tire Size Costs We'll refine the `ride_sharing` DataFrame by adjusting the `tire_sizes` column. Initially, tire sizes are categorized as 26″, 27″, or 29″. To reduce maintenance costs, we'll cap tire sizes at 27″. This involves converting `tire_sizes` to integers, applying the cap, and verifying the adjustment.
    # Convert tire_sizes to integer
    ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
    
    # Set all values above 27 to 27
    ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
    
    # Reconvert tire_sizes back to categorical
    ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
    
    # Print tire size description
    print(ride_sharing['tire_sizes'].describe())
    Hidden output
    # Convert ride_date to datetime ride_sharing['ride_date'] = pd.to_datetime(ride_sharing['ride_date']) # Fix future dates to today's date ride_sharing.loc[ride_sharing['ride_date'] > pd.Timestamp.today(), 'ride_date'] = pd.Timestamp.today() # Verify the changes print(ride_sharing['ride_date'].describe())
    # Convert ride_date to date
    ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
    
    # Save today's date
    today = dt.date.today()
    
    # Set all in the future to today's date
    ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
    
    # Print maximum of ride_dt column
    print(ride_sharing['ride_dt'].max())
    Hidden output

    Uniqueness Constraints and Handling Duplicate Values

    Introduction to Duplicate Values

    Duplicate values in a DataFrame occur when identical information is repeated across multiple rows, either for some or all columns. These can result from data entry errors, bugs in data pipelines, or the process of joining data from various sources.

    Identifying Duplicate Values

    To find duplicates, the .duplicated() method is used, which returns a boolean Series indicating duplicate rows. By default, all columns are checked, and all duplicates except the first occurrence are marked as True. Adjusting the method's arguments allows for more nuanced identification of duplicates.

    Customizing Duplicate Identification

    The subset argument of .duplicated() specifies which columns to check for duplication, while the keep argument determines which duplicates to mark (first, last, or all). This helps in accurately identifying and understanding the nature of duplicate data.

    Sorting Duplicate Rows

    Sorting the DataFrame using .sort_values() by specific columns can provide a clearer view of the duplicate rows, revealing complete and incomplete duplicates.

    Handling Complete Duplicates

    Complete duplicates can be removed using the .drop_duplicates() method, which also supports the subset and keep arguments for targeted removal. The inplace argument allows for direct changes to the DataFrame.

    Addressing Incomplete Duplicates

    For incomplete duplicates, statistical measures like mean or maximum can be applied to consolidate rows. This is done using the .groupby() method combined with .agg(), specifying the desired statistical functions for each column involved in the duplication.

    Verifying Results

    After processing, running .duplicated() again can confirm the elimination of duplicate values, ensuring a clean and accurate dataset.

    This approach to handling duplicates balances the need for data integrity with the practicalities of real-world data management, allowing for both the removal of exact duplicates and the thoughtful consolidation of near-duplicates.

    # Check for complete duplicates in ride_sharing duplicates = ride_sharing.duplicated() # Count duplicates print(duplicates.sum()) # Check for incomplete duplicates under 'ride_id' incomplete_duplicates = ride_sharing.duplicated(subset='ride_id', keep=False) # Count incomplete duplicates print(incomplete_duplicates.sum())