Skip to content

NOTES

# Importing numpy and pandas
import numpy as np
import pandas as pd

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")

Notes

P(event) = (# of ways an event can happen)/(total # of possible outcomes)

Discrete Distributions (discrete outcomes)

  • Probability Distribution - describes the probabilty of each possible outcome in a scenario.
  • Expected Value - mean of distribution.
    • multiply each value by its probability
  • Probabilty = Area, when it comes to bar charts.
    • P(die_roll)<2 = (1 * 1/6) + (1 * 1/6)
  • Law of large numbers - as your sample size increases, the sample mean will approach the Expected Value.

Continuous Distributions

  • Continuous uniform distribution - unlimited outcomes
  • Probability still = area
    • P(wait time <= 7) = ?
      • from scipy.stats import uniform
      • uniform.cdf(7, 0, 12), (target,lower limit, upper limit)
    • P(wait time >= 7) = 1-P(wait time <= 7)
  • Generating random numbers
    • from scipy.stats import uniform
    • uniform.rvs(0, 5, size=10)

The binomial distribution

  • An outcome with two possible values (ie. Heads/Tails, Pass/Fail, 1/0, etc.)
    • from scipy.stats import binom
    • binom.rvs(# of coins, {probability of success} = p, {size=# of trials} = n)
  • Probability distrubtion of the number of successes in a sequence of independent trials.
  • Expected Value = n * p
  • Each trial must be independent of another, outcome cannot affect the next.

The normal distribution

  • Area under curve = 1
  • Probability never hits 0
  • Mean = 0, Std = 1 is the Standard Normal Distribution
  • 68% of area falls w/in 1 std of mean
  • 95% of area falls w/in 2 std of mean
  • 99.7% of area falls w/in 3 std of mean
    • from scipy.stats import norm
    • norm.cdf (# less than or equal to, mean, std)

The central limit theorem

  • A sampling ditribution of a statistic becomes closer to the normal distribution as the number of trials increases.
  • Estimate characteristics of unknown underlying characteristics.
  • More easily estimate characteristics of large populations.

The poisson distriubution

  • Events happen at a certain rate, but completely at random.
    • Ex. # of earthquakes/yr in CA, # of dogs adopted from shelter/week, # of people arriving at restaurant/hr
  • Probability of some # of events happening over fixed period of time.
    • Lambda (Looks like "A") = avg. # of events per time interval
      • Peak of Poisson distribution is always Lambda
    • from scipy.stats import poisson
    • poisson.pmf(# of events, Lambda)

Exponential distribtuion

  • The probability of time between Poisson events
    • Ex. Probability of > 1 day passing between adoptions
    • Lambda means rate for exponential distributions
  • Expected value of Exponential distribution
    • In terms of rate (Poisson)
      • Lambda = 0.5 requests per minute
    • In terms of time (Exponential)
      • 1/Lambda = 1 request per 2 minutes

(Student's) t-distribution

  • Similar shape to normal distibution.
  • Has a parameter called degrees of freedom (df) which affects the thickness of the tails.
    • Lower df = thicker tails, higher std
    • Higher df = looks closer to normal dist

Log-normal distribution

  • Variable whose logarithm is normally distributed

# Count the deals for each product
counts = amir_deals['product'].value_counts()

# Calculate probability of picking a deal with each product
probs = counts / amir_deals.shape[0]

# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]

# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

#Subset columns
print(df[['alignment', 'character']])

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 = uniform.cdf(20, min_time,max_time) - uniform.cdf(10, min_time,max_time)
print(prob_between_10_and_20)

# Set random seed to 334
np.random.seed(334)

# Import uniform
from scipy.stats import uniform

# Generate 1000 wait times between 0 and 30 mins
wait_times = uniform.rvs(0, 30, size=1000)

# Create a histogram of simulated times and show plot
plt.hist(wait_times)
plt.show()

# binom.pmf(num heads, num trials, prob of heads), exact probability
binom.pmf(7, 10, 0.5)

# binom.cdf(num heads, num trials, prob of heads), less than or equal to probability
binom.cdf(7, 10, 0.5)

# 1-binom.cdf to get probability of greater than a certain number

# Normal Distribution - What height are 90% of the women shorter than?
norm.ppf(.9,161,7)

#Generate 10 random heights
norm.rvs(161, 7,size=10)

# Histogram of amount with 10 bins and show plot
amir_deals['amount'].hist(bins=10)
plt.show()

# Rolling the dice 5 times, take the mean, repeat 10 times
sample_means=[]
for i in range(10):
    samp_5 = die.sample(5, replace=True)
    sample_means.append(np.mean(samp_5))
print(sample_means) #Sampling distribution of sampling mean

# Create a histogram of num_users and show
amir_deals['num_users'].hist()
plt.show()

# Sample 20 num_users with replacement from amir_deals
samp_20 = amir_deals['num_users'].sample(20, replace=True)

# Take mean of samp_20
print(np.mean(samp_20))

# Loop 100 times
for i in range(100):
  # Take sample of 20 num_users
  samp_20 = amir_deals['num_users'].sample(20, replace=True)
  # Calculate mean of samp_20
  samp_20_mean = np.mean(samp_20)
  # Append samp_20_mean to sample_means
  sample_means.append(samp_20_mean)
    
# Convert to Series and plot histogram
sample_means_series = pd.Series(sample_means)
sample_means_series.hist()
# Show plot
plt.show()

# Import poisson from scipy.stats
from scipy.stats import poisson

# Probability of > 10 responses
prob_over_10 = 1-poisson.cdf(10,4)

# Exponential - How long until 1 request is created?
from scipy.stats import expon
expon.cdf(1, scale=0.5)

#Quantiles
lower = np.quantile(bootstraps, 0.025)
upper = np.quantile(bootstraps, 0.975)

print(f"""Lower bound: {round(lower, 2)}\n
Upper bound: {round(upper, 2)}""")

#Sample with replacement
data = np.array([13, 28, 56, 31, 63])

sample = np.random.choice(data, 5)

#Sample without replacement
purchases = np.random.choice(chocolate, 3, replace=False)

#boxplot
import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(x="Type", y="Si", data=glass, order=["virginica", "versicolor", "setosa"]) 

plt.show()

#Normal Dist samples
samples=np.random.normal(0, 1, 10000)

#Poisson samples
samples = np.random.poisson(10, 100)

#Random sample
sample = np.random.choice(data, 5)

#Create an array of 100 numbers sampled from a Poisson distribution were the observed interval is equal to 5
sample = np.random.poisson(5, 100)

#Statistical test with p-value
results = stats.ttest_ind(males, females)

#Type I error
#The null hypothesis is true and is rejected by the testing

#Exponential sample with rate parameter = 56
sample = np.random.exponential(1/56,100000)

#Want to prove your fund has proved 10% better than last year
one-tailed test

#Chi distribution
#Most useful in deciding whether the two variables of hot dog type and age are independent or not

#Paired
#hypothesis testing method is best to use when two samples are not considered independent

#F-distribution
#calculate how varied your samples are

#Parameter estimates
#the population is too large to measure every sample

#Test that can show if 1 value is greater than another
#one-tailed

#Randomized Block design
#When you bring tests into groups (like age) and then randomize if they receive treatment or control

#Confidence Interval
#A range in which the true population mean is likely to be found with a certain probability

#Maximum Likelihood Estimate
#It is used to find the parameters of a distribution that most likely generated the observed data.

#Null Hypothesis
#proposes there is not a significant difference betweeen Group A and Group B.

#F-value
#test statistic is compared to the ratio of between-level and within-level estimates to conclude whether the population means are different

#Box-plots
#Use continuous numerical data

#Empirical Cumulative Distribution Function (ECDF)
from statsmodels.distributions.empirical_distribution import ECDF

ecdf = ECDF(x)

#Hypothesis Test to prove it is not 40
tstat, pval = stats.ttest_1samp(ages, 40)

print(round(pval,3))

#Hypothesis Test to prove it is not 8
test=stats.ttest_1samp(spend, 8)

#Create an array of one million numbers sampled from a Normal distribution with mean equal to 30 and standard deviation equal to 5
sample = np.random.normal(30, 5, 1000000)

#qqplot
from statsmodels.api import qqplot
qqplot(data=steam["usage"])

#Random choice
sample = np.random.choice(["Win", "Lose"], 10, p=(0.03, 0.97))

#Draw a random sample
sample = np.random.choice(professions)

#Random binomial
sample = np.random.binomial(10,0.5,100)

#T-test
tstat, pval = stats.ttest_ind(new_batteries, old_batteries)

test= stats.ttest_ind(group_a, group_b)

#pair plot
sns.pairplot(song_metrics)

#Without replacement
purchases = np.random.choice(chocolate, 3, replace=False)

#sample of 20,000
sample = np.random.standard_normal(20000)

#Violin plot
ax = sns.violinplot(x="Type", y="Si", data=glass)

#Apply square root
print(df.apply(np.sqrt))

#Summary statistics, describe
print(food.describe())

#Create DataFrame
print(pd.DataFrame({
    "x": [1], 
    "y": [3],
}))

#Sort values
result = df.sort_values('salary', ascending = True)

#Line plot
sns.lineplot(x = 'day', y = 'order', data=df)

#Scatterplot color
sns.scatterplot(x = "age", y = "value", hue = "emissions", data = valuation)

ax = sns.scatterplot(x="GDP per capita", y="Score", hue="Generosity", data=happiness)

#Sort column = to 
sales_2019 = sales[sales['Year'] == 2019]

#Correlation coefficient
cc = np.corrcoef(lum, rad)[0, 1]
print(round(cc, 3))

#Print random sample of 5 with random seed
print(chess.sample(n=5, random_state=42))

#IQR
iqr_age = stats.iqr(age)

#Print the name of the columns
print(chess.columns)
Hidden output
# Reading a text file
filename = 'huck_finn.txt'
file = open(filename, mode='r') #'r' is for read
text = file.read()
file.close()
print(text)

# Context manager 'with'
with open('huck_finn.txt','r') as file:
    print(file.read())
    
# Open a file: file
file = open('moby_dick.txt', 'r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter="\t", skiprows= 1, usecols=[0,2])

# Print data
print(data)

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])

# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array))

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

# Import pickle package
import pickle

# Load pickle file and see what type it is
x = pd.read_pickle('data.pkl')
print(type(x))

# Open pickle file and load data
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)
    
#Beautiful Soup
from bs4 import BeautifulSoup

s = BeautifulSoup(html_doc)

# Print data
print(d)

# Print datatype
print(type(d))

# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

#Read excel file
food_df = pd.read_excel(food_file, names=['Country', 'Regular Coffee', 'Instant Coffee', 'Tea'])

# Print sheet names
print(xls.sheet_names)

# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)

# Print the head of the DataFrame df2
print(df2.head())

# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

#Contains
print(data.str.contains("2019"))

# Print the head of the DataFrame df1
print(df1.head())

#Separate email @ from column
print(contact.email.str.split('@', expand = True))

#Set_index
game = game.set_index('name')

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())

# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

# Print head of DataFrame
print(df_sas.head())

# Plot histograms of a DataFrame feature (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# Print the head of the DataFrame df
print(df.head())

# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()

# Import packages
import numpy as np
import h5py

# Assign filename: file
file = 'LIGO_data.hdf5'

# Load file: data
data = h5py.File(file, 'r')

# Print the datatype of the loaded file
print(type(data))

# Print the keys of the file
for key in data.keys():
    print(key)
    
# Get the HDF5 group: group
group = data['strain']

# Check out keys of group
for key in group.keys():
    print(key)

# Set variable equal to time series data: strain
strain = data['strain']['Strain']

# Set number of time points to sample: num_samples
num_samples = 10000

# Set time vector
time = np.arange(0, 1, 1/num_samples)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

# Import package
import scipy.io

# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat
print(type(mat))

# Print the keys of the MATLAB dictionary
print(mat.keys())

# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))

# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))

# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()

# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print the head of the DataFrame df
print(df.head())

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
    df = pd.DataFrame(rs.fetchall())

    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head of DataFrame
print(df.head())

# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)

# Print head of DataFrame
print(df.head())
# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes') 

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].head())

# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower() 
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_size'].unique())
print(airlines['dest_region'].unique())

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                               labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1 

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

# Print the header of account_opened
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce') 
# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])

# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis=1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['age'] == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)
plt.show()

# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())

# Import process from thefuzz
from thefuzz import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'
    
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
      
# Inspect the final result
print(restaurants['cuisine_type'].unique())

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) 
print(potential_matches)

# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

#Convert columns from wide to long format
df = pd.melt(df, id_vars = 'id', value_vars = ['math', 'chemistry'])

print(df)

#Drop rows where all values are NaN
print(score.dropna(how = 'all'))

#To lower case
jobs['roles'] = jobs['roles'].str.lower()

#Pivot Tables
df = pd.pivot_table(
	restaurant,
    values = ['price', 'rating'],
    index = 'cuisine',
    aggfunc = np.mean

#Print strings that contains "d"
print(s.str.contains("d"))
    
#Determine data types in df columns
print(books.dtypes)
    
#Combine columns into single df
df = pd.concat([restaurant, location], axis = 1)