Skip to content
NOTES
# Importing numpy and pandas
import numpy as np
import pandas as pd
# Importing the course datasets
deals = pd.read_csv("datasets/amir_deals.csv")
happiness = pd.read_csv("datasets/world_happiness.csv")
food = pd.read_csv("datasets/food_consumption.csv")
Notes
P(event) = (# of ways an event can happen)/(total # of possible outcomes)
Discrete Distributions (discrete outcomes)
- Probability Distribution - describes the probabilty of each possible outcome in a scenario.
- Expected Value - mean of distribution.
- multiply each value by its probability
- Probabilty = Area, when it comes to bar charts.
- P(die_roll)<2 = (1 * 1/6) + (1 * 1/6)
- Law of large numbers - as your sample size increases, the sample mean will approach the Expected Value.
Continuous Distributions
- Continuous uniform distribution - unlimited outcomes
- Probability still = area
- P(wait time <= 7) = ?
- from scipy.stats import uniform
- uniform.cdf(7, 0, 12), (target,lower limit, upper limit)
- P(wait time >= 7) = 1-P(wait time <= 7)
- P(wait time <= 7) = ?
- Generating random numbers
- from scipy.stats import uniform
- uniform.rvs(0, 5, size=10)
The binomial distribution
- An outcome with two possible values (ie. Heads/Tails, Pass/Fail, 1/0, etc.)
- from scipy.stats import binom
- binom.rvs(# of coins, {probability of success} = p, {size=# of trials} = n)
- Probability distrubtion of the number of successes in a sequence of independent trials.
- Expected Value = n * p
- Each trial must be independent of another, outcome cannot affect the next.
The normal distribution
- Area under curve = 1
- Probability never hits 0
- Mean = 0, Std = 1 is the Standard Normal Distribution
- 68% of area falls w/in 1 std of mean
- 95% of area falls w/in 2 std of mean
- 99.7% of area falls w/in 3 std of mean
- from scipy.stats import norm
- norm.cdf (# less than or equal to, mean, std)
The central limit theorem
- A sampling ditribution of a statistic becomes closer to the normal distribution as the number of trials increases.
- Estimate characteristics of unknown underlying characteristics.
- More easily estimate characteristics of large populations.
The poisson distriubution
- Events happen at a certain rate, but completely at random.
- Ex. # of earthquakes/yr in CA, # of dogs adopted from shelter/week, # of people arriving at restaurant/hr
- Probability of some # of events happening over fixed period of time.
- Lambda (Looks like "A") = avg. # of events per time interval
- Peak of Poisson distribution is always Lambda
- from scipy.stats import poisson
- poisson.pmf(# of events, Lambda)
- Lambda (Looks like "A") = avg. # of events per time interval
Exponential distribtuion
- The probability of time between Poisson events
- Ex. Probability of > 1 day passing between adoptions
- Lambda means rate for exponential distributions
- Expected value of Exponential distribution
- In terms of rate (Poisson)
- Lambda = 0.5 requests per minute
- In terms of time (Exponential)
- 1/Lambda = 1 request per 2 minutes
- In terms of rate (Poisson)
(Student's) t-distribution
- Similar shape to normal distibution.
- Has a parameter called degrees of freedom (df) which affects the thickness of the tails.
- Lower df = thicker tails, higher std
- Higher df = looks closer to normal dist
Log-normal distribution
- Variable whose logarithm is normally distributed
# Count the deals for each product
counts = amir_deals['product'].value_counts()
# Calculate probability of picking a deal with each product
probs = counts / amir_deals.shape[0]
# Create a histogram of restaurant_groups and show plot
restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
plt.show()
# Create probability distribution
size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
# Reset index and rename columns
size_dist = size_dist.reset_index()
size_dist.columns = ['group_size', 'prob']
#Subset columns
print(df[['alignment', 'character']])
# Expected value
expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
print(expected_value)
# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]
# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)
# Subset groups of size 4 or more
groups_4_or_more = size_dist[size_dist['group_size'] >= 4]
# Sum the probabilities of groups_4_or_more
prob_4_or_more = np.sum(groups_4_or_more['prob'])
print(prob_4_or_more)
# Calculate probability of waiting 10-20 mins
prob_between_10_and_20 = uniform.cdf(20, min_time,max_time) - uniform.cdf(10, min_time,max_time)
print(prob_between_10_and_20)
# Set random seed to 334
np.random.seed(334)
# Import uniform
from scipy.stats import uniform
# Generate 1000 wait times between 0 and 30 mins
wait_times = uniform.rvs(0, 30, size=1000)
# Create a histogram of simulated times and show plot
plt.hist(wait_times)
plt.show()
# binom.pmf(num heads, num trials, prob of heads), exact probability
binom.pmf(7, 10, 0.5)
# binom.cdf(num heads, num trials, prob of heads), less than or equal to probability
binom.cdf(7, 10, 0.5)
# 1-binom.cdf to get probability of greater than a certain number
# Normal Distribution - What height are 90% of the women shorter than?
norm.ppf(.9,161,7)
#Generate 10 random heights
norm.rvs(161, 7,size=10)
# Histogram of amount with 10 bins and show plot
amir_deals['amount'].hist(bins=10)
plt.show()
# Rolling the dice 5 times, take the mean, repeat 10 times
sample_means=[]
for i in range(10):
samp_5 = die.sample(5, replace=True)
sample_means.append(np.mean(samp_5))
print(sample_means) #Sampling distribution of sampling mean
# Create a histogram of num_users and show
amir_deals['num_users'].hist()
plt.show()
# Sample 20 num_users with replacement from amir_deals
samp_20 = amir_deals['num_users'].sample(20, replace=True)
# Take mean of samp_20
print(np.mean(samp_20))
# Loop 100 times
for i in range(100):
# Take sample of 20 num_users
samp_20 = amir_deals['num_users'].sample(20, replace=True)
# Calculate mean of samp_20
samp_20_mean = np.mean(samp_20)
# Append samp_20_mean to sample_means
sample_means.append(samp_20_mean)
# Convert to Series and plot histogram
sample_means_series = pd.Series(sample_means)
sample_means_series.hist()
# Show plot
plt.show()
# Import poisson from scipy.stats
from scipy.stats import poisson
# Probability of > 10 responses
prob_over_10 = 1-poisson.cdf(10,4)
# Exponential - How long until 1 request is created?
from scipy.stats import expon
expon.cdf(1, scale=0.5)
#Quantiles
lower = np.quantile(bootstraps, 0.025)
upper = np.quantile(bootstraps, 0.975)
print(f"""Lower bound: {round(lower, 2)}\n
Upper bound: {round(upper, 2)}""")
#Sample with replacement
data = np.array([13, 28, 56, 31, 63])
sample = np.random.choice(data, 5)
#Sample without replacement
purchases = np.random.choice(chocolate, 3, replace=False)
#boxplot
import matplotlib.pyplot as plt
import seaborn as sns
sns.boxplot(x="Type", y="Si", data=glass, order=["virginica", "versicolor", "setosa"])
plt.show()
#Normal Dist samples
samples=np.random.normal(0, 1, 10000)
#Poisson samples
samples = np.random.poisson(10, 100)
#Random sample
sample = np.random.choice(data, 5)
#Create an array of 100 numbers sampled from a Poisson distribution were the observed interval is equal to 5
sample = np.random.poisson(5, 100)
#Statistical test with p-value
results = stats.ttest_ind(males, females)
#Type I error
#The null hypothesis is true and is rejected by the testing
#Exponential sample with rate parameter = 56
sample = np.random.exponential(1/56,100000)
#Want to prove your fund has proved 10% better than last year
one-tailed test
#Chi distribution
#Most useful in deciding whether the two variables of hot dog type and age are independent or not
#Paired
#hypothesis testing method is best to use when two samples are not considered independent
#F-distribution
#calculate how varied your samples are
#Parameter estimates
#the population is too large to measure every sample
#Test that can show if 1 value is greater than another
#one-tailed
#Randomized Block design
#When you bring tests into groups (like age) and then randomize if they receive treatment or control
#Confidence Interval
#A range in which the true population mean is likely to be found with a certain probability
#Maximum Likelihood Estimate
#It is used to find the parameters of a distribution that most likely generated the observed data.
#Null Hypothesis
#proposes there is not a significant difference betweeen Group A and Group B.
#F-value
#test statistic is compared to the ratio of between-level and within-level estimates to conclude whether the population means are different
#Box-plots
#Use continuous numerical data
#Empirical Cumulative Distribution Function (ECDF)
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(x)
#Hypothesis Test to prove it is not 40
tstat, pval = stats.ttest_1samp(ages, 40)
print(round(pval,3))
#Hypothesis Test to prove it is not 8
test=stats.ttest_1samp(spend, 8)
#Create an array of one million numbers sampled from a Normal distribution with mean equal to 30 and standard deviation equal to 5
sample = np.random.normal(30, 5, 1000000)
#qqplot
from statsmodels.api import qqplot
qqplot(data=steam["usage"])
#Random choice
sample = np.random.choice(["Win", "Lose"], 10, p=(0.03, 0.97))
#Draw a random sample
sample = np.random.choice(professions)
#Random binomial
sample = np.random.binomial(10,0.5,100)
#T-test
tstat, pval = stats.ttest_ind(new_batteries, old_batteries)
test= stats.ttest_ind(group_a, group_b)
#pair plot
sns.pairplot(song_metrics)
#Without replacement
purchases = np.random.choice(chocolate, 3, replace=False)
#sample of 20,000
sample = np.random.standard_normal(20000)
#Violin plot
ax = sns.violinplot(x="Type", y="Si", data=glass)
#Apply square root
print(df.apply(np.sqrt))
#Summary statistics, describe
print(food.describe())
#Create DataFrame
print(pd.DataFrame({
"x": [1],
"y": [3],
}))
#Sort values
result = df.sort_values('salary', ascending = True)
#Line plot
sns.lineplot(x = 'day', y = 'order', data=df)
#Scatterplot color
sns.scatterplot(x = "age", y = "value", hue = "emissions", data = valuation)
ax = sns.scatterplot(x="GDP per capita", y="Score", hue="Generosity", data=happiness)
#Sort column = to
sales_2019 = sales[sales['Year'] == 2019]
#Correlation coefficient
cc = np.corrcoef(lum, rad)[0, 1]
print(round(cc, 3))
#Print random sample of 5 with random seed
print(chess.sample(n=5, random_state=42))
#IQR
iqr_age = stats.iqr(age)
#Print the name of the columns
print(chess.columns)
Hidden output
# Reading a text file
filename = 'huck_finn.txt'
file = open(filename, mode='r') #'r' is for read
text = file.read()
file.close()
print(text)
# Context manager 'with'
with open('huck_finn.txt','r') as file:
print(file.read())
# Open a file: file
file = open('moby_dick.txt', 'r')
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())
# Import package
import numpy as np
# Assign filename to variable: file
file = 'digits.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')
# Print datatype of digits
print(type(digits))
# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
# Import numpy
import numpy as np
# Assign the filename: file
file = 'digits_header.txt'
# Load the data: data
data = np.loadtxt(file, delimiter="\t", skiprows= 1, usecols=[0,2])
# Print data
print(data)
# Assign filename: file
file = 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
# Assign the filename: file
file = 'titanic.csv'
# Import file using np.recfromcsv: d
d = np.recfromcsv(file)
# Print out first three entries of d
print(d[:3])
# Assign the filename: file
file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)
# Build a numpy array from the DataFrame: data_array
data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array))
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
# Import pickle package
import pickle
# Load pickle file and see what type it is
x = pd.read_pickle('data.pkl')
print(type(x))
# Open pickle file and load data
with open('data.pkl', 'rb') as file:
d = pickle.load(file)
#Beautiful Soup
from bs4 import BeautifulSoup
s = BeautifulSoup(html_doc)
# Print data
print(d)
# Print datatype
print(type(d))
# Import pandas
import pandas as pd
# Assign spreadsheet filename: file
file = 'battledeath.xlsx'
# Load spreadsheet: xls
xls = pd.ExcelFile(file)
#Read excel file
food_df = pd.read_excel(food_file, names=['Country', 'Regular Coffee', 'Instant Coffee', 'Tea'])
# Print sheet names
print(xls.sheet_names)
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')
# Print the head of the DataFrame df1
print(df1.head())
# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)
# Print the head of the DataFrame df2
print(df2.head())
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])
#Contains
print(data.str.contains("2019"))
# Print the head of the DataFrame df1
print(df1.head())
#Separate email @ from column
print(contact.email.str.split('@', expand = True))
#Set_index
game = game.set_index('name')
# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])
# Print the head of the DataFrame df2
print(df2.head())
# Import sas7bdat package
from sas7bdat import SAS7BDAT
# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
df_sas = file.to_data_frame()
# Print head of DataFrame
print(df_sas.head())
# Plot histograms of a DataFrame feature (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()
# Import pandas
import pandas as pd
# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')
# Print the head of the DataFrame df
print(df.head())
# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()
# Import packages
import numpy as np
import h5py
# Assign filename: file
file = 'LIGO_data.hdf5'
# Load file: data
data = h5py.File(file, 'r')
# Print the datatype of the loaded file
print(type(data))
# Print the keys of the file
for key in data.keys():
print(key)
# Get the HDF5 group: group
group = data['strain']
# Check out keys of group
for key in group.keys():
print(key)
# Set variable equal to time series data: strain
strain = data['strain']['Strain']
# Set number of time points to sample: num_samples
num_samples = 10000
# Set time vector
time = np.arange(0, 1, 1/num_samples)
# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()
# Import package
import scipy.io
# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')
# Print the datatype type of mat
print(type(mat))
# Print the keys of the MATLAB dictionary
print(mat.keys())
# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))
# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))
# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Save the table names to a list: table_names
table_names = engine.table_names()
# Print the table names to the shell
print(table_names)
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT LastName, Title FROM Employee")
df = pd.DataFrame(rs.fetchmany(size=3))
df.columns = rs.keys()
# Print the length of the DataFrame df
print(len(df))
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print the head of the DataFrame df
print(df.head())
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager
with engine.connect() as con:
rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
df = pd.DataFrame(rs.fetchall())
# Set the DataFrame's column names
df.columns = rs.keys()
# Print head of DataFrame
print(df.head())
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print head of DataFrame df
print(df.head())
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)
# Print head of DataFrame
print(df.head())
# Print the information of ride_sharing
print(ride_sharing.info())
# Print summary statistics of user_type column
print(ride_sharing['user_type'].describe())
# Convert user_type from integer to category
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
# Write an assert statement confirming the change
assert ride_sharing['user_type_cat'].dtype == 'category'
# Print new summary statistics
print(ride_sharing['user_type_cat'].describe())
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')
# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')
# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'
# Print formed columns and calculate average ride duration
print(ride_sharing[['duration','duration_trim','duration_time']])
print(ride_sharing['duration_time'].mean())
# Convert tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
# Reconvert tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
# Print tire size description
print(ride_sharing['tire_sizes'].head())
# Convert ride_date to date
ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
# Save today's date
today = dt.date.today()
# Set all in the future to today's date
ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
# Print maximum of ride_dt column
print(ride_sharing['ride_dt'].max())
# Find duplicates
duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)
# Sort your duplicated rides
duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
# Print relevant columns
print(duplicated_rides[['ride_id','duration','user_birth_year']])
# Drop complete duplicates from ride_sharing
ride_dup = ride_sharing.drop_duplicates()
# Create statistics dictionary for aggregation function
statistics = {'user_birth_year': 'min', 'duration': 'mean'}
# Group by ride_id and compute new statistics
ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
# Find duplicated values again
duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
duplicated_rides = ride_unique[duplicates == True]
# Assert duplicates are processed
assert duplicated_rides.shape[0] == 0
# Find the cleanliness category in airlines not in categories
cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
# Find rows with that category
cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
# Print rows with inconsistent category
print(airlines[cat_clean_rows])
# Print rows with consistent categories only
print(airlines[~cat_clean_rows])
# Print unique values of both columns
print(airlines['dest_region'].unique())
print(airlines['dest_size'].unique())
# Lower dest_region column and then replace "eur" with "europe"
airlines['dest_region'] = airlines['dest_region'].str.lower()
airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
# Remove white spaces from `dest_size`
airlines['dest_size'] = airlines['dest_size'].str.strip()
# Verify changes have been effected
print(airlines['dest_size'].unique())
print(airlines['dest_region'].unique())
# Create ranges for categories
label_ranges = [0, 60, 180, np.inf]
label_names = ['short', 'medium', 'long']
# Create wait_type column
airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges,
labels = label_names)
# Create mappings and replace
mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday',
'Thursday': 'weekday', 'Friday': 'weekday',
'Saturday': 'weekend', 'Sunday': 'weekend'}
airlines['day_week'] = airlines['day'].replace(mappings)
# Replace "Dr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")
# Replace "Mr." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")
# Replace "Miss" with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Miss","")
# Replace "Ms." with empty string ""
airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")
# Assert that full_name has no honorifics
assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False
# Store length of each row in survey_response column
resp_length = airlines['survey_response'].str.len()
# Find rows in airlines where resp_length > 40
airlines_survey = airlines[resp_length > 40]
# Assert minimum survey_response length is > 40
assert airlines_survey['survey_response'].str.len().min() > 40
# Print new survey_response column
print(airlines_survey['survey_response'])
# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'
# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1
# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'
# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'
# Print the header of account_opened
print(banking['account_opened'].head())
# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
# Infer datetime format
infer_datetime_format = True,
# Return missing value for error
errors = 'coerce')
# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')
# Print acct_year
print(banking['acct_year'])
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']
# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis=1) == banking['inv_amount']
# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]
# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year
# Find rows where age column == ages_manual
age_equ = banking['age'] == ages_manual
# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]
# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])
# Print number of missing values in banking
print(banking.isna().sum())
# Visualize missingness matrix
msno.matrix(banking)
plt.show()
# Print number of missing values in banking
print(banking.isna().sum())
# Visualize missingness matrix
msno.matrix(banking)
plt.show()
# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]
# Sort banking by age and visualize
banking_sorted = banking.sort_values(by = 'age')
msno.matrix(banking_sorted)
plt.show()
# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])
# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5
# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})
# Print number of missing values
print(banking_imputed.isna().sum())
# Import process from thefuzz
from thefuzz import process
# Store the unique values of cuisine_type in unique_types
unique_types = restaurants['cuisine_type'].unique()
# Calculate similarity of 'asian' to all values of unique_types
print(process.extract('asian', unique_types, limit = len(unique_types)))
# Calculate similarity of 'american' to all values of unique_types
print(process.extract('american', unique_types, limit = len(unique_types)))
# Calculate similarity of 'italian' to all values of unique_types
print(process.extract('italian', unique_types, limit = len(unique_types)))
# Create a list of matches, comparing 'italian' with the cuisine_type column
matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
# Iterate through the list of matches to italian
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'
# Iterate through categories
for cuisine in categories:
# Create a list of matches, comparing cuisine with the cuisine_type column
matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
# Iterate through the list of matches
for match in matches:
# Check whether the similarity score is greater than or equal to 80
if match[1] >= 80:
# If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
# Inspect the final result
print(restaurants['cuisine_type'].unique())
# Create an indexer and object and find possible pairs
indexer = recordlinkage.Index()
# Block pairing on cuisine_type
indexer.block('cuisine_type')
# Generate pairs
pairs = indexer.index(restaurants, restaurants_new)
# Create a comparison object
comp_cl = recordlinkage.Compare()
# Find exact matches on city, cuisine_types
comp_cl.exact('city', 'city', label='city')
comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')
# Find similar matches of rest_name
comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8)
# Get potential matches and print
potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new)
print(potential_matches)
# Isolate potential matches with row sum >=3
matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
# Get values of second column index of matches
matching_indices = matches.index.get_level_values(1)
# Subset restaurants_new based on non-duplicate values
non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]
# Append non_dup to restaurants
full_restaurants = restaurants.append(non_dup)
print(full_restaurants)
#Convert columns from wide to long format
df = pd.melt(df, id_vars = 'id', value_vars = ['math', 'chemistry'])
print(df)
#Drop rows where all values are NaN
print(score.dropna(how = 'all'))
#To lower case
jobs['roles'] = jobs['roles'].str.lower()
#Pivot Tables
df = pd.pivot_table(
restaurant,
values = ['price', 'rating'],
index = 'cuisine',
aggfunc = np.mean
#Print strings that contains "d"
print(s.str.contains("d"))
#Determine data types in df columns
print(books.dtypes)
#Combine columns into single df
df = pd.concat([restaurant, location], axis = 1)