Skip to content

Cleaning Data in Python

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

To Remove Something from a string USE: .str.strip() METHOD Ex - sales['Revanue'].str.strip('' from all rows in this column (IN PANDAS)

To Change DataType use .astype('type') METHOD Ex - sales['Revanue'].astype('int') -- This changes all the rows in this column to int

To convert OBJECT to DATE in Pandas use pd.to_datetime()dt.date Ex - = pd.to_datetime(movies['subscription_date']).dt.date

^^ in this piece of code, we're converting to pandas datetime type THEN to datetime(library)'s date type

REMEMBER: DATES ARE DIFFERENT FROM TIMESTAMPS

To extract a certain part of the date out of a datetime value, use dt.strfdate(%Y)... %Y for example. This function means 'string from date' and then you specify what string you want to pull from the date.

DATE TIME LIBRARY

import datetime as dt

TO IMPORT TODAY'S DATE USE: today = dt.date.today()

USE .DROP() METHOD TO DROP VALUES Ex- movies.drop(movies[movies['avg_rating']> 5].index, inplace = True)

inplace = True makes sure we are dropping values in place (As records)

USE .loc() to select and change things Ex- movies.loc[movies['avg_rating']>5 , 'avg_rating'] = 5

FINDING DUPLICATE VALUES We can use the method .duplicated() to get a boolian output.

If we push this to movies[movies.duplicated()], we will be able to see which values are duplicates.

We can push arguments into the duplicated() method. Including duplicated(subset= COLUMN_NAME, keep= 'first', 'last' OR FALSE duplicate values)

TO REMOVE DUPLICATES WE CAN USE THE .drop_duplicates() METHOD. It takes the same additional arguments as the previous method. HOWEVER, it also takes the inplace = True/False argument. Similar to .drop()

YOU CAN SORT YOUR OBSERVATIONS WITH THE .sort_values(by= COLUMN_NAME) METHOD

YOU CAN USE .groupby(by = ) METHOD to Group

YOU CAN USE .agg('___':'mean', etc) __ to find aggregate values

An Example:

statistics = {'user_birth_year': 'min', 'duration': 'mean'}

ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

DEALING WITH NA VALUES

We can use .dropna(subset = []) to DROP NA columns

We can use .fillna({ : }) to fill the NA values with another value (for example the mean. In the dictionary portion of this function, match the column with the desired fill value... like the mean for example)

To visualize NA values, use msno.matrix(df) and follow that by plt.show(). It will show where your NA's are.

COMPARING STRINGS USING THEFUZZ package

from thefuzz import fuzz

to compare two strings use fuzz.WRatio(string1, string2) --> your output will be a value from 1-100. 100: most similar, 1: not similar at all

You can use string similarity to match typo's to the correct value (comes in handy when collapsing categorical data and many typos are present) --> This is done using a For Loop

Here's the code for that:

" for cuisine in categories:

matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

for match in matches:

if match[1] >= 80: # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine

print(restaurants['cuisine_type'].unique()) "

If we're trying to create pairs with two datasets, use the package: recordlinkage

import recordlinkage indexer = recordlinkage.Index() indexer.block('insert column of interest') pairs = indexer.index(df1, df2)

To create a 'Compare Object' - this is responsable for describing how we want to compare df contents

compare_cl = recordlinkcage.Compare()

compare_cl.exact('column in df1', 'column in df2', label='labelname')

compare_cl.exact('another column in df1', 'another column in df2', label='labelname')

To find approx. string similarity use: compare_cl.string('column in df1', 'column in df2', threshold= 0.85, label='labelname')

To find the matches, use: potential_matches = compare_cl.compute(pairs, df1, df2)

This will give you a df of what matches with what... not exactly the dataframe we want to see.

To find only the pairs that we want, use: potential_matches[potential_matches.sum(axis=1) =>2]

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.
  • Identify and remove all the duplicate rows in ride_sharing.
  • Inspect the unique values of all the columns in airlines and clean any inconsistencies.
  • For the airlines DataFrame, create a new column called International from dest_region, where values representing US regions map to False and all other regions map to True.
  • The banking DataFrame contains out of date ages. Update the Age column using today's date and the birth_date column.
  • Clean the restaurants_new DataFrame so that it better matches the categories in the city and type column of the restaurants DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows between restaurants and restaurants_new using criteria you think is best.
import pandas as pd
banking_dirty = pd.read_csv('datasets/banking_dirty.csv')
banking_dirty

import datetime as dt
today = dt.date.today

banking_dirty['birth_date'] = pd.to_datetime(banking_dirty['birth_date'])

for age in banking_dirty['Age']:
    age = pd.to_datetime('today') - banking_dirty['birth_date']
    
banking_dirty
import pandas as pd
restaurants_l_2_dirty = pd.read_csv('datasets/restaurants_L2_dirty.csv')
restaurants_l_2_dirty