Data Science with Python - notes

1 hidden cell

Importing files

With pandas

data = pd.read_csv("filename", nrows, header, sep, comment, na_values)
data_array = data.values to transform to a NumPy array
to_csv('newfilename.csv') With Numpy
np.loadtxt("filename", delimiter =',' | '\t', skiprows=1, usecols=[0,2], dtype=str, names=True) # if first row has strings, importing only the 1st and 3rd column, importing values as strings with dtype, names tells us it has headers
! ls : lists files accesible in the working directory in the IPython shell
import os
wd = os.getcwd()
os.listdir(wd)

file = open("filename", mode ='r') # r for read, w for write

text = file.read()
file.close() Use it with a context manager (with) to keep the file open while you execute commands, once we're out of context the file is no longer open.
with open('filename', 'r') as file:
print(file.read())

Pickling
Pickling files is serializing data = convert object to bytestream
with open('text', 'rb') as file: # rb is read-only and binary\

data=pickle.load(file)
Excel files
data = pd.ExcelFile('text')
data.sheet_names
data.parse(0, skiprows=[]. usecols=[0, 1], names=['1', '2']) # 0=sheet name or index

SAS/Stata files
from sas7bdat import SAS7BDAT
with SAS7BDAT('text.sas7bdat') as file:
df_sas = file.to_data_frame()

data = pd.read_stata('text.dta')

HDF5 data files
Large data files, so more becoming common
import h5py
data = h5py.File('filename', 'r')
Explore its structure by:
for key in data.keys():
print(key)

for key in data['key1'].keys():
print(key) # will give you the content of the first key

Access the data within:
print(np.array(data['key1']['subkey1']), np.array(data['key1']['subkey 2']))

MATLAB files
import scipy.io scipy.io.loadmat'filename') and scipy.io.savemat()\

Importing DateTime data

To import DateTime data use:
pd.read_csv("df.csv", parse_dates =['date column'])

Or, after imporing use:
df['date column'] = pd.to_datetime(df['date column'])

Or combine three separate columns to 1 datetime colunm (colunms must be named month, day, year):
df['date column'] = pd.to_datetime(df[['month', 'day', 'year']])

Extracting parts of a DateTime column using:
df['month column'] = df['date column'].dt.month/day/year/weekday/hour.\

Relational data bases

Each row needs to contain an unique identifier.
Tables are linked through corresponding columns and identifiers.
Interacting with databases is done through SQL.
SQLAlchemy is a package in python that can interact with SQL databases:

from sqlalchemy import create_engine
engine = create_engine('sqlite:///filename.sqlite')
table_names = engine.table_names()

Querying
SELECT * FROM Table_Name # returns all columns of all rows of the table
Open database table and store results in panda df
connect to database before:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///filename.sqlite')
con = engine.connect()
rs = con.execute("SELECT * FROM Table_name")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
con.close()

To circumvent having to close the connection use a with loop:
with engine.connect() as con: rs= con.excecute("SELECT c1, c2 FROM Table_name WHERE c1 >= crit 1")
df= pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
You can also do this in 1 line
engine = create_engine('sqlite:///filename.sqlite') df = pd.read_sql_query("SELECT * FROM Orders ORDER BY c1", engine)

Joining tables
in the pd.read_sql_query("SELECT c1, c2 FROM Table1 INNER JOIN Table2 on Table1.Table1ID = Table2.Table2ID", engine)\

Cleaning data

Getting an overview of a dataframe

.head() -> header, first few rows
.shape -> nr of rows and columns
.info() -> column names, non-nullcount, datatype
.describe() -> some statistics (count, mean, stdev, percentiles)
.count ->
.values -> components (list)
.columns -> column names
.index -> row numbers(/names) (start=0, stop = x, step = 1)

df.info() to check datatypes and nulls
Remove strings from integer columns: df['c'] = df['c'].str.strip('string'), df['c'].astype('int')
test using assert:
assert df['c'].dtype == 'int', if not true it will throw an assertion error\

Data range constraints

import datetime as dt
today_date = dt.date.today()
df[df['c'] > dt.date.today()] # excludes dates after today (e.g. if that would be impossible)
or use drop:
df.drop(df[df['c'] > criteria].index, inplace = True)
assert df['c'].max <= criteria
You can also set the values that do not meet the criteria to a value:
df.loc[df['c'] > criteria, 'c'] = criteria

Convert to Date:
df['c'] = pd.to_datetime(df['c']).dt.date
df = df[df['c'] < today_date]
df.loc[df['c'] > today_date, 'c'] = today_date #sets data of past today date to today date\

Uniqueness constraints:

duplicates = df.duplicated(subset (=list), keep ('fisrt','last', False))
df[duplicates].sort_values(by= 'c')

df.drop_duplicates(subset=["column_name", "column_name2"], keep, inplace = True)

After doing checking for duplicates using the first method you can also combine duplicates rows to an average by:
df= df.groupby(by= ['c1', 'c2']).agg({'height': 'max', 'weight': 'mean'}).reset_index()
Check if aggregation is done by repeating duplicated method:
duplicates = df.duplicated(subset = column_names, keep = False)
df[duplicates].sort_values(by ='c')\

Missing values

my_array.isna().any() Shows which columns contains missing values
my_array.isna().sum() Counts the number of missing values per column
my_array.dropna() Removes rows with missing values
my_array.fillna(0) Replaces missing values with 0

Missingno visualization

import missingno as msno
msno.matrix(df)

missing = df[df['c'].isna()]
missing.describe() to get more insight in the missing numbers.

visualized by:
sorted_df = df.sort_calues(by ='c')
msno.matrix(sorted_df)

There are three types of missing data:

Missing completely at random
Missing at random (systematic realtionship between missing data and observed values
Missing not at random (systematic relationship between missing data and unobserved values (like missing data at high temps due sensor error, sensor error is unobserved))

Imputing

One can impute missing values using various statistics, this is usually done when the amount of missing values < 5%
threshold = len(df) * 0.05
cols_to_drop = df.columns[df.isna().sum() <= threshold]
df.dropna(subset=cols_to_drop, inplace=True)\ # inplace will update the DataFrame

Impute goes as follows:
cols_with_missing_values = df.columns[df.isna().sum() > 0]
for col in cols_with_missing_values[:-1]:
df[col].fillna(df[col].mode()[0]/mean/median)\ # mode is 0 to index the first item
Or use:
c_mean = df['c'].mean()
df_imputed = df.fillna({'c': c_mean})

Imputing by sub-group:
df_dict = df.groupby("column to group")['column to impute'].median().to_dict()\
df['column to impute'] = df['column to impute'].fillna(df['column on which was grouped'].map(df_dict))

Categorical membership constraints

One should keep logs of the possible categories values could fall into
Using an anti-join you can now view which observations did not fall into your category list.
Or an inner join, for observations that are only found in both dataframes.

inconsistent_categories = set(df['c']).difference(categories['c'])
inconsistent_rows = df['c'].isin(inconsistent_categories)
inconsistent_data = df[inconsistent_rows]

Or drop data:
consistent_data = df[~inconsistent_rows]

Check categories alternatively by: df.groupby('c').count()

Fixing issues:

Wrong capitalization:
df['c'] =df['c'].str.upper()/lower()

Trailing spaces:
df = df['c'].str.strip()

Collapsing data is possible in two ways:

1 group_names = ['0-200K', '200K-500K']
df['new_category column'] = pd.qcut(df['c'], q =2, labels = group_names)
df[['new category column', c]]\

2 ranges = [ 0, 200000, 50000, np.inf]
group_names = ['0-200K, '200K-500K', '500K+']
df['new column'] = pd.cut(df['c'], bins=ranges, labels = group_names)\

Reduce the amount of categories by mapping:

mapping = {'Column to remove 1' : 'Column to keep 1', 'Column to remove 2': 'Columns to keep 2'}
df['c']=df['c'].replace(mapping)
df['c'].unique()

Replacing text fields:

Use .str.replace()
Limit the amount of characters: df.loc[df['c'].str.len() > 10, 'c'] = np.nan
Check if a field is of a specific length: assert df['c'].str.len().min >= 10
Check if it contains certain characters: assert df['c'].str.contains("+|-").any() == False

Regex

str.replace(r'\D+', "") --> replace anything that is not a digit with ""

Comparing strings using fuzzy

Check typoes using: df['c'].unique()

Minimum edit distance
We will use the Damerau-Levenshtein algorithm to determine this, using the thefuzz package.\

from thefuzz import fuzz
fuzz.WRatio("text", "txt")\ --> number 0-100 for amount of match Or use process to compare a string with an array of strings: from thefuzz import process
string = " long text multiple words"
choices = pd.Series(['text words', 'words text', 'long multiple', multiple long])
process.extract(string, choices, limit =2) limit is the number of possible matches to return ranked from highest to lowest.
It returns a list of tuples with 3 elements: matching string returned, score, index in the array
Instead of specifying choices its easier to use unique_types = df['c'].unique() and limit= len(unique_types) to check against all unique types.

You can collapse categories with replace, but this can become complicated so you would want to use string matching instead. for state in categories['state']:
#Find potential matches in states with typoes
matches = process.extract(state, survey['state'], limit = survey.shape[0])
#For each potential match
for potential_match in matches:
#If high similarity score
if potential_match[1] >=80:
#Replace typo with correct category
survey.loc[survey['state'] == potential_match[0], 'state'] = state \

Record linking preventing duplicating**\

Blocking creates pairs based on matching columns to keep it scalable\

import recorlinkage
indexer = recordlinkage.Index()
indexer.block('c to block on')
pairs = indexer.index(df1, df2) Returns a array containing possible pairs of indices where DFs can be subsetted on.

Finding potential matches follows up on the previous code:
compare_cl = recordlinkage.Compare()
Find exact matches for pairs of date_of_birth and state columns: compare_cl.exact('date_of_birth', 'date_of_birth', label = 'date_of_birth')
compare_cl.exact('state', 'state', label = 'state')
Find similar matches for pairs of surname using string similarity: compare_cl.string('surname', 'surname', threshold = 0.85, label = 'surname') Find matches: potential_matches = compare_cl.compute(pairs, df1, df2)
potential_matches[potential_matches.sum(axis=1) =>2]
Linking dataframes
Now we want to remove the duplicates before appending them together.
Get indices from df2 only:
duplicate_rows = matches.index.get_level_values(1)
df2_duplicates = df2[df2.index.isin(duplicate_rows)]
df2_new = df2[~df2.index.isin(duplicate_rows)]
full_df = df1.append(df2_new)\

Unit Uniformity

Take temperature as example:

temp_fah = tempdf.loc[tempdf['c'] > 40, 'c']
temp_cels = (temp_fah -32) * (5/9)
tempdf.loc[tempdf['c'] > 40, 'c'] = temp_cels
assert tempdf['c'].max < 40\

Date data:

datedf['c'] = pd.to_datetime(datedf['c'], infer_datetime_format=True, errors = 'coerce')
or datedf['c'] = datedf['c'].dt.strftime("%d-%m-%y")\

Cross field validation

Validate a total by rowwise summing:\

sum_classes = df[['c1', 'c2']].sum(axis=1)
df_equ = sum_classes == df['total column']
Filter out inconsistent data:
inconsistent_df = df[~df_equ]
consistent_df = df[df_equ]\

Working with date time data

from datetime import date, datetime, timedelta, timezone

date_series = [date(2016, 3, 15), date(2017,10,21)] # creates a date
date_series[0].year/weekday() # accesses the year/weekday of the first entry, use -1 for last entry
delta = d1-d2 --> delta.days gives you the amount of days between the two
td = timedelta(days=29) # creates a 29 timedelta
x = x+1 is the same as x += 1
d= date(YYYY-MM-DD) --> print date as string using d.isoformat()
d.strftime() method is for every other format (d.strftime("Year is %Y-%m or %B - %d or %j and %H:%M:%S") for example)

Time

dt= parse_dates dt.replace(minute=0, second=0, mircosecond =0) will round down to the hour
dt = datetime.strptime("date time text", "format of said date time") # to parse dates from text
unix timestamp (ts) = numbers of seconds since 01-01-1970, to read this use the datetime.fromtimestamp('ts') method
Example:

#Initialize a list for holding the pairs of datetime objects

onebike_datetimes = []

#Loop over all trips

for (start, end) in onebike_datetime_strings:
trip = {'start': datetime.strptime(start, fmt),
'end': datetime.strptime(end, fmt)}

#Append the trip

onebike_datetimes.append(trip)
\

Working with durations

duration = end-start --> duration.total_seconds()
Create a timedelta by hand:
delta1 = timedelta(days=1, seconds=1)

UTC offsets= UK = 0

ET = timezone(timedelta(hours=-5))
dt = datetime(2017, 12, 30, 15, 9, 3, tzinfo= ET)

Converting timezones:
IST = timezone(timedelta(hours=5, minutes=30))
dt.astimezone(IST)
.replace only changes the timezone, whereas .astimezone() actually moves the hours and days:
dt.replace(tzinfo=timezone.utc)

Retrieve timezones using dateutil package:
from dateutil import tz
et = tz.gettz('America/New_York')

Daylight saving time
dateutil package deals with this, when calculating times you can always go back to utc.
tz.datetime.ambiguous() shows if a time could occur at two different UTC moments (time was set back one hour)
tz.enfold(second_1am) will tell if a time belongs to the second time the clock turned 1 AM in the day\

Cleaning daylight saving data with fold\
As we've just discovered, there is a ride in our data set which is being messed up by a Daylight Savings shift. Let's clean up the data set so we actually have a correct minimum ride length. We can use the fact that we know the end of the ride happened after the beginning to fix up the duration messed up by the shift out of Daylight Savings.\
\
Since Python does not handle tz.enfold() when doing arithmetic, we must put our datetime objects into UTC, where ambiguities have been resolved.\
\
onebike_datetimes is already loaded and in the right timezone. tz and timezone have been imported. Use tz.UTC for the timezone.\
\
trip_durations = []
for trip in onebike_datetimes:
  # When the start is later than the end, set the fold to be 1
  if trip['start'] > trip['end']:
    trip['end'] = tz.enfold(trip['end'])
  # Convert to UTC
  start = trip['start'].astimezone(tz.UTC)
  end = trip['end'].astimezone(tz.UTC)

  # Subtract the difference
  trip_length_seconds = (end-start).total_seconds()
  trip_durations.append(trip_length_seconds)

# Take the shortest trip duration
print("Shortest trip: " + str(min(trip_durations)))

Reading date and time

Use argument parse_dates (list c names) in pd.read_csv to indicate which columns are read as datetime data.
Or manually convert using pd.to_datetime(df['c'], format = "%Y-%m-%d %H:%M:%S")
You can do calculations with these columns and calculate the duration between columns and dt.total_seconds().

You can use groupby on datetime columns:
df.resample('M' (for month), on ='date column')['c to calculate'].mean() gives you the mean of 'c' for each month.

Timezones in pandas
To add timezones to a column in a pandas df use:
df['c'].dt.tz_localize('Continent/City)
If pandas encounters one datetime that occurs during a daylight saving time it will throw an ambiguous error.
You can set the ambiguous argument under localize as 'NaT', which will set the ambiguous time as "Not a Time". Pandas will skip over this when using other methods.
Use dt.tz.convert() to convert to a different timezone.

Shift rows to lign up end times with start times of previous row:
df['End date'].shift(1)\

# Example
# Shift the index of the end date up one; now subract it from the start date
rides['Time since'] = rides['Start date'] - (rides['End date'].shift(1))

# Move from a timedelta to a number of seconds, which is easier to work with
rides['Time since'] = rides['Time since'].dt.total_seconds()

# Resample to the month
monthly = rides.resample('M', on ='Start date')

# Print the average hours between rides each month
print(monthly['Time since'].mean()/(60*60))

# As a consultant working for a real estate start-up, you have collected Airbnb listing data from various sources to investigate the short-term rental market in New York. You'll analyze this data to provide insights on private rooms to the real estate company.

There are three files in the data folder: airbnb_price.csv, airbnb_room_type.xlsx, airbnb_last_review.tsv.

    What are the dates of the earliest and most recent reviews? Store these values as two separate variables with your preferred names.
    How many of the listings are private rooms? Save this into any variable.
    What is the average listing price? Round to the nearest penny and save into a variable.
    Combine the new variables into one DataFrame called review_dates with four columns in the following order: first_reviewed, last_reviewed, nb_private_rooms, and avg_price. The DataFrame should only contain one row of values.
    
# We've loaded your first package for you! You can add as many cells as you need.
import numpy as np
import pandas as pd
from datetime import date, datetime, timedelta, timezone

# Begin coding here ...
prices = pd.read_csv('data/airbnb_price.csv')
room_type = pd.read_excel('data/airbnb_room_type.xlsx')
last_review = pd.read_csv('data/airbnb_last_review.tsv', sep ='\t')

print(prices.head())
print(room_type.head())
print(last_review.head())   

# Merging the dataframes
airbnb_merged = prices.merge(room_type, on='listing_id')
airbnb_merged = airbnb_merged.merge(last_review, on='listing_id')
airbnb_merged.head()

# Retrieving the dates of the most recent and earliest reviews
airbnb_merged['last_review'] = pd.to_datetime(airbnb_merged['last_review'], format= "%B %d %Y")
recent = max(airbnb_merged['last_review'])
earliest = min(airbnb_merged['last_review'])
print(recent)
print(earliest)

airbnb_merged['room_type'] = airbnb_merged['room_type'].str.lower()
private_count = airbnb_merged[airbnb_merged['room_type'] == "private room"]['room_type'].count()
private_count

airbnb_merged['price_clean'] = airbnb_merged['price'].str.replace(' dollars', "").astype(float)
price_avg = airbnb_merged['price_clean'].mean()
price_avg

review_dates = {'first_reviewed': [earliest], 
                'last_reviewed': [recent], 
                'nb_private_rooms': [private_count], 
                'avg_price': [round(price_avg, 2)]
               }
review_dates = pd.DataFrame(review_dates)
review_dates

Sorting and Subsetting

To print a subset of a numpy array use:
Create a boolean list of my_array
under_21 = my_array < 21
Prints all values that adhere to the stated conditional
print(my_array[under_21])

Using loc/iloc

Only use .loc whith indexed arrays!!! loc = names, iloc = index number
Pivot tables are indexed tables\

Subsetting a 2d array is formatted as follows:\

my_array.iloc[row, column]
Subsetting the first full column would be:\
my_array.iloc[:,0] or
my_array['Column_name'] or my_array[['column1', 'column2']]

with numpy arrays:
my_np_array[:,0]
Subsetting rows in a pandas df:\

my_array.'column_name'. iloc[row:row]

Subsetting indexed arrays on rows and columns
my_array_ind_srt.loc['Rowname1':'Rowname2', 'columnname1':'columnname2'] Two index level and column subsetting
my_array_ind_srt.loc[('index1.subset1', 'index2.subset1'):('index1.subset2','index2.subset2'), 'columnname1':'columnname2']

‌
‌
‌