NOTES

# Importing numpy and pandas
import numpy as np
import pandas as pd

# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")

Notes

P(event) = (# of ways an event can happen)/(total # of possible outcomes)

Discrete Distributions (discrete outcomes)

Probability Distribution - describes the probabilty of each possible outcome in a scenario.
Expected Value - mean of distribution.
- multiply each value by its probability
Probabilty = Area, when it comes to bar charts.
- P(die_roll)<2 = (1 * 1/6) + (1 * 1/6)
Law of large numbers - as your sample size increases, the sample mean will approach the Expected Value.

Continuous Distributions

Continuous uniform distribution - unlimited outcomes
Probability still = area
- P(wait time <= 7) = ?
  - from scipy.stats import uniform
  - uniform.cdf(7, 0, 12), (target,lower limit, upper limit)
- P(wait time >= 7) = 1-P(wait time <= 7)
Generating random numbers
- from scipy.stats import uniform
- uniform.rvs(0, 5, size=10)

The binomial distribution

An outcome with two possible values (ie. Heads/Tails, Pass/Fail, 1/0, etc.)
- from scipy.stats import binom
- binom.rvs(# of coins, {probability of success} = p, {size=# of trials} = n)
Probability distrubtion of the number of successes in a sequence of independent trials.
Expected Value = n * p
Each trial must be independent of another, outcome cannot affect the next.

The normal distribution

Area under curve = 1
Probability never hits 0
Mean = 0, Std = 1 is the Standard Normal Distribution
68% of area falls w/in 1 std of mean
95% of area falls w/in 2 std of mean
99.7% of area falls w/in 3 std of mean
- from scipy.stats import norm
- norm.cdf (# less than or equal to, mean, std)

The central limit theorem

A sampling ditribution of a statistic becomes closer to the normal distribution as the number of trials increases.
Estimate characteristics of unknown underlying characteristics.
More easily estimate characteristics of large populations.

The poisson distriubution

Events happen at a certain rate, but completely at random.
- Ex. # of earthquakes/yr in CA, # of dogs adopted from shelter/week, # of people arriving at restaurant/hr
Probability of some # of events happening over fixed period of time.
- Lambda (Looks like "A") = avg. # of events per time interval
  - Peak of Poisson distribution is always Lambda
- from scipy.stats import poisson
- poisson.pmf(# of events, Lambda)

Exponential distribtuion

The probability of time between Poisson events
- Ex. Probability of > 1 day passing between adoptions
- Lambda means rate for exponential distributions
Expected value of Exponential distribution
- In terms of rate (Poisson)
  - Lambda = 0.5 requests per minute
- In terms of time (Exponential)
  - 1/Lambda = 1 request per 2 minutes

(Student's) t-distribution

Similar shape to normal distibution.
Has a parameter called degrees of freedom (df) which affects the thickness of the tails.
- Lower df = thicker tails, higher std
- Higher df = looks closer to normal dist

Log-normal distribution

Variable whose logarithm is normally distributed


# Count the deals for each product
counts = amir_deals['product'].value_counts()

# Calculate probability of picking a deal with each product
probs = counts / amir_deals.shape[0]

# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()

# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]

# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']

#Subset columns
print(df[['alignment', 'character']])

# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]

# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)

# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 = uniform.cdf(20, min_time,max_time) - uniform.cdf(10, min_time,max_time)
print(prob_between_10_and_20)

# Set random seed to 334
np.random.seed(334)

# Import uniform
from scipy.stats import uniform

# Generate 1000 wait times between 0 and 30 mins
wait_times = uniform.rvs(0, 30, size=1000)

# Create a histogram of simulated times and show plot
plt.hist(wait_times)
plt.show()

# binom.pmf(num heads, num trials, prob of heads), exact probability
binom.pmf(7, 10, 0.5)

# binom.cdf(num heads, num trials, prob of heads), less than or equal to probability
binom.cdf(7, 10, 0.5)

# 1-binom.cdf to get probability of greater than a certain number

# Normal Distribution - What height are 90% of the women shorter than?
norm.ppf(.9,161,7)

#Generate 10 random heights
norm.rvs(161, 7,size=10)

# Histogram of amount with 10 bins and show plot
amir_deals['amount'].hist(bins=10)
plt.show()

# Rolling the dice 5 times, take the mean, repeat 10 times
sample_means=[]
for i in range(10):
    samp_5 = die.sample(5, replace=True)
    sample_means.append(np.mean(samp_5))
print(sample_means) #Sampling distribution of sampling mean

# Create a histogram of num_users and show
amir_deals['num_users'].hist()
plt.show()

# Sample 20 num_users with replacement from amir_deals
samp_20 = amir_deals['num_users'].sample(20, replace=True)

# Take mean of samp_20
print(np.mean(samp_20))

# Loop 100 times
for i in range(100):
  # Take sample of 20 num_users
  samp_20 = amir_deals['num_users'].sample(20, replace=True)
  # Calculate mean of samp_20
  samp_20_mean = np.mean(samp_20)
  # Append samp_20_mean to sample_means
  sample_means.append(samp_20_mean)
    
# Convert to Series and plot histogram
sample_means_series = pd.Series(sample_means)
sample_means_series.hist()
# Show plot
plt.show()

# Import poisson from scipy.stats
from scipy.stats import poisson

# Probability of > 10 responses
prob_over_10 = 1-poisson.cdf(10,4)

# Exponential - How long until 1 request is created?
from scipy.stats import expon
expon.cdf(1, scale=0.5)

#Quantiles
lower = np.quantile(bootstraps, 0.025)
upper = np.quantile(bootstraps, 0.975)

print(f"""Lower bound: {round(lower, 2)}\n
Upper bound: {round(upper, 2)}""")

#Sample with replacement
data = np.array([13, 28, 56, 31, 63])

sample = np.random.choice(data, 5)

#Sample without replacement
purchases = np.random.choice(chocolate, 3, replace=False)

#boxplot
import matplotlib.pyplot as plt
import seaborn as sns

sns.boxplot(x="Type", y="Si", data=glass, order=["virginica", "versicolor", "setosa"]) 

plt.show()

#Normal Dist samples
samples=np.random.normal(0, 1, 10000)

#Poisson samples
samples = np.random.poisson(10, 100)

#Random sample
sample = np.random.choice(data, 5)

#Create an array of 100 numbers sampled from a Poisson distribution were the observed interval is equal to 5
sample = np.random.poisson(5, 100)

#Statistical test with p-value
results = stats.ttest_ind(males, females)

#Type I error
#The null hypothesis is true and is rejected by the testing

#Exponential sample with rate parameter = 56
sample = np.random.exponential(1/56,100000)

#Want to prove your fund has proved 10% better than last year
one-tailed test

#Chi distribution
#Most useful in deciding whether the two variables of hot dog type and age are independent or not

#Paired
#hypothesis testing method is best to use when two samples are not considered independent

#F-distribution
#calculate how varied your samples are

#Parameter estimates
#the population is too large to measure every sample

#Test that can show if 1 value is greater than another
#one-tailed

#Randomized Block design
#When you bring tests into groups (like age) and then randomize if they receive treatment or control

#Confidence Interval
#A range in which the true population mean is likely to be found with a certain probability

#Maximum Likelihood Estimate
#It is used to find the parameters of a distribution that most likely generated the observed data.

#Null Hypothesis
#proposes there is not a significant difference betweeen Group A and Group B.

#F-value
#test statistic is compared to the ratio of between-level and within-level estimates to conclude whether the population means are different

#Box-plots
#Use continuous numerical data

#Empirical Cumulative Distribution Function (ECDF)
from statsmodels.distributions.empirical_distribution import ECDF

ecdf = ECDF(x)

#Hypothesis Test to prove it is not 40
tstat, pval = stats.ttest_1samp(ages, 40)

print(round(pval,3))

#Hypothesis Test to prove it is not 8
test=stats.ttest_1samp(spend, 8)

#Create an array of one million numbers sampled from a Normal distribution with mean equal to 30 and standard deviation equal to 5
sample = np.random.normal(30, 5, 1000000)

#qqplot
from statsmodels.api import qqplot
qqplot(data=steam["usage"])

#Random choice
sample = np.random.choice(["Win", "Lose"], 10, p=(0.03, 0.97))

#Draw a random sample
sample = np.random.choice(professions)

#Random binomial
sample = np.random.binomial(10,0.5,100)

#T-test
tstat, pval = stats.ttest_ind(new_batteries, old_batteries)

test= stats.ttest_ind(group_a, group_b)

#pair plot
sns.pairplot(song_metrics)

#Without replacement
purchases = np.random.choice(chocolate, 3, replace=False)

#sample of 20,000
sample = np.random.standard_normal(20000)

#Violin plot
ax = sns.violinplot(x="Type", y="Si", data=glass)

#Apply square root
print(df.apply(np.sqrt))

#Summary statistics, describe
print(food.describe())

#Create DataFrame
print(pd.DataFrame({
    "x": [1], 
    "y": [3],
}))

#Sort values
result = df.sort_values('salary', ascending = True)

#Line plot
sns.lineplot(x = 'day', y = 'order', data=df)

#Scatterplot color
sns.scatterplot(x = "age", y = "value", hue = "emissions", data = valuation)

ax = sns.scatterplot(x="GDP per capita", y="Score", hue="Generosity", data=happiness)

#Sort column = to 
sales_2019 = sales[sales['Year'] == 2019]

#Correlation coefficient
cc = np.corrcoef(lum, rad)[0, 1]
print(round(cc, 3))

#Print random sample of 5 with random seed
print(chess.sample(n=5, random_state=42))

#IQR
iqr_age = stats.iqr(age)

#Print the name of the columns
print(chess.columns)

Hidden output

# Reading a text file
filename = 'huck_finn.txt'
file = open(filename, mode='r') #'r' is for read
text = file.read()
file.close()
print(text)

# Context manager 'with'
with open('huck_finn.txt','r') as file:
    print(file.read())
    
# Open a file: file
file = open('moby_dick.txt', 'r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)

# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())

# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()

# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter="\t", skiprows= 1, usecols=[0,2])

# Print data
print(data)

# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()

# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])

# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.values

# Print the datatype of data_array to the shell
print(type(data_array))

# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()

# Import pickle package
import pickle

# Load pickle file and see what type it is
x = pd.read_pickle('data.pkl')
print(type(x))

# Open pickle file and load data
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)
    
#Beautiful Soup
from bs4 import BeautifulSoup

s = BeautifulSoup(html_doc)

# Print data
print(d)

# Print datatype
print(type(d))

# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

#Read excel file
food_df = pd.read_excel(food_file, names=['Country', 'Regular Coffee', 'Instant Coffee', 'Tea'])

# Print sheet names
print(xls.sheet_names)

# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)

# Print the head of the DataFrame df2
print(df2.head())

# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

#Contains
print(data.str.contains("2019"))

# Print the head of the DataFrame df1
print(df1.head())

#Separate email @ from column
print(contact.email.str.split('@', expand = True))

#Set_index
game = game.set_index('name')

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())

# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

# Print head of DataFrame
print(df_sas.head())

# Plot histograms of a DataFrame feature (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()

# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# Print the head of the DataFrame df
print(df.head())

# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()

# Import packages
import numpy as np
import h5py

# Assign filename: file
file = 'LIGO_data.hdf5'

# Load file: data
data = h5py.File(file, 'r')

# Print the datatype of the loaded file
print(type(data))

# Print the keys of the file
for key in data.keys():
    print(key)
    
# Get the HDF5 group: group
group = data['strain']

# Check out keys of group
for key in group.keys():
    print(key)

# Set variable equal to time series data: strain
strain = data['strain']['Strain']

# Set number of time points to sample: num_samples
num_samples = 10000

# Set time vector
time = np.arange(0, 1, 1/num_samples)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

# Import package
import scipy.io

# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat
print(type(mat))

# Print the keys of the MATLAB dictionary
print(mat.keys())

# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))

# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))

# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()

# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(size=3))
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print the head of the DataFrame df
print(df.head())

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
    df = pd.DataFrame(rs.fetchall())

    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head of DataFrame
print(df.head())

# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())

# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)

# Print head of DataFrame
print(df.head())

# Print the information of ride_sharing
print(ride_sharing.info())

# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())

# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'

# Print new summary statistics 
print(ride_sharing['user_type_cat'].describe())

# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes') 

# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'

# Print formed columns and calculate average ride duration 
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())

# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27

# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].head())

# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date

# Save today's date
today = dt.date.today()

# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today

# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())

# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)

# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')

# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])

# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()

# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}

# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()

# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]

# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0

# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])

# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)

# Print rows with inconsistent category
print(airlines[cat_clean_rows])

# Print rows with consistent categories only
print(airlines[~cat_clean_rows])

# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())

# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower() 
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})

# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()

# Verify changes have been effected
print(airlines['dest_size'].unique())
print(airlines['dest_region'].unique())

# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']

# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                               labels = label_names)

# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
            'Thursday': 'weekday', 'Friday': 'weekday', 
            'Saturday': 'weekend', 'Sunday': 'weekend'}

airlines['day_week'] = airlines['day'].replace(mappings)

# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")

# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")

# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")

# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")

# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False

# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()

# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]

# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40

# Print new survey_response column
print(airlines_survey['survey_response'])

# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1 

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'

# Print the header of account_opened
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format = True,
                                           # Return missing value for error
                                           errors = 'coerce') 
# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
print(banking['acct_year'])

# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis=1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = banking['age'] == ages_manual

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Print number of missing values in banking
print(banking.isna().sum())

# Visualize missingness matrix
msno.matrix(banking)
plt.show()

# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)
plt.show()

# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())

# Import process from thefuzz
from thefuzz import process

# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()

# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))

# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))

# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))

# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches to italian
for match in matches:
  # Check whether the similarity score is greater than or equal to 80
  if match[1] >= 80:
    # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
    restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'
    
# Iterate through categories
for cuisine in categories:  
  # Create a list of matches, comparing cuisine with the cuisine_type column
  matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))

# Iterate through the list of matches
  for match in matches:
     # Check whether the similarity score is greater than or equal to 80
    if match[1] >= 80:
      # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
      restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
      
# Inspect the final result
print(restaurants['cuisine_type'].unique())

# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()

# Block pairing on cuisine_type
indexer.block('cuisine_type')

# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)

# Create a comparison object
comp_cl = recordlinkage.Compare()

# Find exact matches on city, cuisine_types 
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')

# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 

# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) 
print(potential_matches)

# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]

# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)

# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]

# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)

#Convert columns from wide to long format
df = pd.melt(df, id_vars = 'id', value_vars = ['math', 'chemistry'])

print(df)

#Drop rows where all values are NaN
print(score.dropna(how = 'all'))

#To lower case
jobs['roles'] = jobs['roles'].str.lower()

#Pivot Tables
df = pd.pivot_table(
	restaurant,
    values = ['price', 'rating'],
    index = 'cuisine',
    aggfunc = np.mean

#Print strings that contains "d"
print(s.str.contains("d"))
    
#Determine data types in df columns
print(books.dtypes)
    
#Combine columns into single df
df = pd.concat([restaurant, location], axis = 1)

NOTES

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}NOTES

Notes

NOTES