Cleaning Data in Python
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
To Remove Something from a string USE: .str.strip() METHOD
Ex - sales['Revanue'].str.strip('
To Change DataType use .astype('type') METHOD Ex - sales['Revanue'].astype('int') -- This changes all the rows in this column to int
To convert OBJECT to DATE in Pandas use pd.to_datetime()dt.date Ex - = pd.to_datetime(movies['subscription_date']).dt.date
^^ in this piece of code, we're converting to pandas datetime type THEN to datetime(library)'s date type
REMEMBER: DATES ARE DIFFERENT FROM TIMESTAMPS
To extract a certain part of the date out of a datetime value, use dt.strfdate(%Y)... %Y for example. This function means 'string from date' and then you specify what string you want to pull from the date.
DATE TIME LIBRARY
import datetime as dt
TO IMPORT TODAY'S DATE USE: today = dt.date.today()
USE .DROP() METHOD TO DROP VALUES Ex- movies.drop(movies[movies['avg_rating']> 5].index, inplace = True)
inplace = True makes sure we are dropping values in place (As records)
USE .loc() to select and change things Ex- movies.loc[movies['avg_rating']>5 , 'avg_rating'] = 5
FINDING DUPLICATE VALUES We can use the method .duplicated() to get a boolian output.
If we push this to movies[movies.duplicated()], we will be able to see which values are duplicates.
We can push arguments into the duplicated() method. Including duplicated(subset= COLUMN_NAME, keep= 'first', 'last' OR FALSE duplicate values)
TO REMOVE DUPLICATES WE CAN USE THE .drop_duplicates() METHOD. It takes the same additional arguments as the previous method. HOWEVER, it also takes the inplace = True/False argument. Similar to .drop()
YOU CAN SORT YOUR OBSERVATIONS WITH THE .sort_values(by= COLUMN_NAME) METHOD
YOU CAN USE .groupby(by = ) METHOD to Group
YOU CAN USE .agg('___':'mean', etc) __ to find aggregate values
An Example:
statistics = {'user_birth_year': 'min', 'duration': 'mean'}
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
DEALING WITH NA VALUES
We can use .dropna(subset = []) to DROP NA columns
We can use .fillna({ : }) to fill the NA values with another value (for example the mean. In the dictionary portion of this function, match the column with the desired fill value... like the mean for example)
To visualize NA values, use msno.matrix(df) and follow that by plt.show(). It will show where your NA's are.
COMPARING STRINGS USING THEFUZZ package
from thefuzz import fuzz
to compare two strings use fuzz.WRatio(string1, string2) --> your output will be a value from 1-100. 100: most similar, 1: not similar at all
You can use string similarity to match typo's to the correct value (comes in handy when collapsing categorical data and many typos are present) --> This is done using a For Loop
Here's the code for that:
" for cuisine in categories:
matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
for match in matches:
if match[1] >= 80: # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
print(restaurants['cuisine_type'].unique()) "
If we're trying to create pairs with two datasets, use the package: recordlinkage
import recordlinkage indexer = recordlinkage.Index() indexer.block('insert column of interest') pairs = indexer.index(df1, df2)
To create a 'Compare Object' - this is responsable for describing how we want to compare df contents
compare_cl = recordlinkcage.Compare()
compare_cl.exact('column in df1', 'column in df2', label='labelname')
compare_cl.exact('another column in df1', 'another column in df2', label='labelname')
To find approx. string similarity use: compare_cl.string('column in df1', 'column in df2', threshold= 0.85, label='labelname')
To find the matches, use: potential_matches = compare_cl.compute(pairs, df1, df2)
This will give you a df of what matches with what... not exactly the dataframe we want to see.
To find only the pairs that we want, use: potential_matches[potential_matches.sum(axis=1) =>2]
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- For each DataFrame, inspect the data types of each column and, where needed, clean and convert columns into the correct data type. You should also rename any columns to have more descriptive titles.
- Identify and remove all the duplicate rows in
ride_sharing
. - Inspect the unique values of all the columns in
airlines
and clean any inconsistencies. - For the
airlines
DataFrame, create a new column calledInternational
fromdest_region
, where values representing US regions map toFalse
and all other regions map toTrue
. - The
banking
DataFrame contains out of date ages. Update theAge
column using today's date and thebirth_date
column. - Clean the
restaurants_new
DataFrame so that it better matches the categories in thecity
andtype
column of therestaurants
DataFrame. Afterward, given typos in restaurant names, use record linkage to generate possible pairs of rows betweenrestaurants
andrestaurants_new
using criteria you think is best.
import pandas as pd
banking_dirty = pd.read_csv('datasets/banking_dirty.csv')
banking_dirty
import datetime as dt
today = dt.date.today
banking_dirty['birth_date'] = pd.to_datetime(banking_dirty['birth_date'])
for age in banking_dirty['Age']:
age = pd.to_datetime('today') - banking_dirty['birth_date']
banking_dirty
import pandas as pd
restaurants_l_2_dirty = pd.read_csv('datasets/restaurants_L2_dirty.csv')
restaurants_l_2_dirty