Skip to content
NOTES
  • AI Chat
  • Code
  • Report
  • NOTES

    # Importing numpy and pandas
    import numpy as np
    import pandas as pd
    
    # Importing the course datasets
    deals = pd.read_csv("datasets/amir_deals.csv")
    happiness = pd.read_csv("datasets/world_happiness.csv")
    food = pd.read_csv("datasets/food_consumption.csv")

    Notes

    P(event) = (# of ways an event can happen)/(total # of possible outcomes)

    Discrete Distributions (discrete outcomes)

    • Probability Distribution - describes the probabilty of each possible outcome in a scenario.
    • Expected Value - mean of distribution.
      • multiply each value by its probability
    • Probabilty = Area, when it comes to bar charts.
      • P(die_roll)<2 = (1 * 1/6) + (1 * 1/6)
    • Law of large numbers - as your sample size increases, the sample mean will approach the Expected Value.

    Continuous Distributions

    • Continuous uniform distribution - unlimited outcomes
    • Probability still = area
      • P(wait time <= 7) = ?
        • from scipy.stats import uniform
        • uniform.cdf(7, 0, 12), (target,lower limit, upper limit)
      • P(wait time >= 7) = 1-P(wait time <= 7)
    • Generating random numbers
      • from scipy.stats import uniform
      • uniform.rvs(0, 5, size=10)

    The binomial distribution

    • An outcome with two possible values (ie. Heads/Tails, Pass/Fail, 1/0, etc.)
      • from scipy.stats import binom
      • binom.rvs(# of coins, {probability of success} = p, {size=# of trials} = n)
    • Probability distrubtion of the number of successes in a sequence of independent trials.
    • Expected Value = n * p
    • Each trial must be independent of another, outcome cannot affect the next.

    The normal distribution

    • Area under curve = 1
    • Probability never hits 0
    • Mean = 0, Std = 1 is the Standard Normal Distribution
    • 68% of area falls w/in 1 std of mean
    • 95% of area falls w/in 2 std of mean
    • 99.7% of area falls w/in 3 std of mean
      • from scipy.stats import norm
      • norm.cdf (# less than or equal to, mean, std)

    The central limit theorem

    • A sampling ditribution of a statistic becomes closer to the normal distribution as the number of trials increases.
    • Estimate characteristics of unknown underlying characteristics.
    • More easily estimate characteristics of large populations.

    The poisson distriubution

    • Events happen at a certain rate, but completely at random.
      • Ex. # of earthquakes/yr in CA, # of dogs adopted from shelter/week, # of people arriving at restaurant/hr
    • Probability of some # of events happening over fixed period of time.
      • Lambda (Looks like "A") = avg. # of events per time interval
        • Peak of Poisson distribution is always Lambda
      • from scipy.stats import poisson
      • poisson.pmf(# of events, Lambda)

    Exponential distribtuion

    • The probability of time between Poisson events
      • Ex. Probability of > 1 day passing between adoptions
      • Lambda means rate for exponential distributions
    • Expected value of Exponential distribution
      • In terms of rate (Poisson)
        • Lambda = 0.5 requests per minute
      • In terms of time (Exponential)
        • 1/Lambda = 1 request per 2 minutes

    (Student's) t-distribution

    • Similar shape to normal distibution.
    • Has a parameter called degrees of freedom (df) which affects the thickness of the tails.
      • Lower df = thicker tails, higher std
      • Higher df = looks closer to normal dist

    Log-normal distribution

    • Variable whose logarithm is normally distributed
    
    # Count the deals for each product
    counts = amir_deals['product'].value_counts()
    
    # Calculate probability of picking a deal with each product
    probs = counts / amir_deals.shape[0]
    
    # Create a histogram of restaurant_groups and show plot
    restaurant_groups['group_size'].hist(bins=np.linspace(2,6,5))
    plt.show()
    
    # Create probability distribution
    size_dist = restaurant_groups['group_size'].value_counts() / restaurant_groups.shape[0]
    
    # Reset index and rename columns
    size_dist = size_dist.reset_index()
    size_dist.columns = ['group_size', 'prob']
    
    #Subset columns
    print(df[['alignment', 'character']])
    
    # Expected value
    expected_value = np.sum(size_dist['group_size'] * size_dist['prob'])
    print(expected_value)
    
    # Subset groups of size 4 or more
    groups_4_or_more = size_dist[size_dist['group_size'] >= 4]
    
    # Sum the probabilities of groups_4_or_more
    prob_4_or_more = np.sum(groups_4_or_more['prob'])
    print(prob_4_or_more)
    
    # Subset groups of size 4 or more
    groups_4_or_more = size_dist[size_dist['group_size'] >= 4]
    
    # Sum the probabilities of groups_4_or_more
    prob_4_or_more = np.sum(groups_4_or_more['prob'])
    print(prob_4_or_more)
    
    # Calculate probability of waiting 10-20 mins
    prob_between_10_and_20 = uniform.cdf(20, min_time,max_time) - uniform.cdf(10, min_time,max_time)
    print(prob_between_10_and_20)
    
    # Set random seed to 334
    np.random.seed(334)
    
    # Import uniform
    from scipy.stats import uniform
    
    # Generate 1000 wait times between 0 and 30 mins
    wait_times = uniform.rvs(0, 30, size=1000)
    
    # Create a histogram of simulated times and show plot
    plt.hist(wait_times)
    plt.show()
    
    # binom.pmf(num heads, num trials, prob of heads), exact probability
    binom.pmf(7, 10, 0.5)
    
    # binom.cdf(num heads, num trials, prob of heads), less than or equal to probability
    binom.cdf(7, 10, 0.5)
    
    # 1-binom.cdf to get probability of greater than a certain number
    
    # Normal Distribution - What height are 90% of the women shorter than?
    norm.ppf(.9,161,7)
    
    #Generate 10 random heights
    norm.rvs(161, 7,size=10)
    
    # Histogram of amount with 10 bins and show plot
    amir_deals['amount'].hist(bins=10)
    plt.show()
    
    # Rolling the dice 5 times, take the mean, repeat 10 times
    sample_means=[]
    for i in range(10):
        samp_5 = die.sample(5, replace=True)
        sample_means.append(np.mean(samp_5))
    print(sample_means) #Sampling distribution of sampling mean
    
    # Create a histogram of num_users and show
    amir_deals['num_users'].hist()
    plt.show()
    
    # Sample 20 num_users with replacement from amir_deals
    samp_20 = amir_deals['num_users'].sample(20, replace=True)
    
    # Take mean of samp_20
    print(np.mean(samp_20))
    
    # Loop 100 times
    for i in range(100):
      # Take sample of 20 num_users
      samp_20 = amir_deals['num_users'].sample(20, replace=True)
      # Calculate mean of samp_20
      samp_20_mean = np.mean(samp_20)
      # Append samp_20_mean to sample_means
      sample_means.append(samp_20_mean)
        
    # Convert to Series and plot histogram
    sample_means_series = pd.Series(sample_means)
    sample_means_series.hist()
    # Show plot
    plt.show()
    
    # Import poisson from scipy.stats
    from scipy.stats import poisson
    
    # Probability of > 10 responses
    prob_over_10 = 1-poisson.cdf(10,4)
    
    # Exponential - How long until 1 request is created?
    from scipy.stats import expon
    expon.cdf(1, scale=0.5)
    
    #Quantiles
    lower = np.quantile(bootstraps, 0.025)
    upper = np.quantile(bootstraps, 0.975)
    
    print(f"""Lower bound: {round(lower, 2)}\n
    Upper bound: {round(upper, 2)}""")
    
    #Sample with replacement
    data = np.array([13, 28, 56, 31, 63])
    
    sample = np.random.choice(data, 5)
    
    #Sample without replacement
    purchases = np.random.choice(chocolate, 3, replace=False)
    
    #boxplot
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    sns.boxplot(x="Type", y="Si", data=glass, order=["virginica", "versicolor", "setosa"]) 
    
    plt.show()
    
    #Normal Dist samples
    samples=np.random.normal(0, 1, 10000)
    
    #Poisson samples
    samples = np.random.poisson(10, 100)
    
    #Random sample
    sample = np.random.choice(data, 5)
    
    #Create an array of 100 numbers sampled from a Poisson distribution were the observed interval is equal to 5
    sample = np.random.poisson(5, 100)
    
    #Statistical test with p-value
    results = stats.ttest_ind(males, females)
    
    #Type I error
    #The null hypothesis is true and is rejected by the testing
    
    #Exponential sample with rate parameter = 56
    sample = np.random.exponential(1/56,100000)
    
    #Want to prove your fund has proved 10% better than last year
    one-tailed test
    
    #Chi distribution
    #Most useful in deciding whether the two variables of hot dog type and age are independent or not
    
    #Paired
    #hypothesis testing method is best to use when two samples are not considered independent
    
    #F-distribution
    #calculate how varied your samples are
    
    #Parameter estimates
    #the population is too large to measure every sample
    
    #Test that can show if 1 value is greater than another
    #one-tailed
    
    #Randomized Block design
    #When you bring tests into groups (like age) and then randomize if they receive treatment or control
    
    #Confidence Interval
    #A range in which the true population mean is likely to be found with a certain probability
    
    #Maximum Likelihood Estimate
    #It is used to find the parameters of a distribution that most likely generated the observed data.
    
    #Null Hypothesis
    #proposes there is not a significant difference betweeen Group A and Group B.
    
    #F-value
    #test statistic is compared to the ratio of between-level and within-level estimates to conclude whether the population means are different
    
    #Box-plots
    #Use continuous numerical data
    
    #Empirical Cumulative Distribution Function (ECDF)
    from statsmodels.distributions.empirical_distribution import ECDF
    
    ecdf = ECDF(x)
    
    #Hypothesis Test to prove it is not 40
    tstat, pval = stats.ttest_1samp(ages, 40)
    
    print(round(pval,3))
    
    #Hypothesis Test to prove it is not 8
    test=stats.ttest_1samp(spend, 8)
    
    #Create an array of one million numbers sampled from a Normal distribution with mean equal to 30 and standard deviation equal to 5
    sample = np.random.normal(30, 5, 1000000)
    
    #qqplot
    from statsmodels.api import qqplot
    qqplot(data=steam["usage"])
    
    #Random choice
    sample = np.random.choice(["Win", "Lose"], 10, p=(0.03, 0.97))
    
    #Draw a random sample
    sample = np.random.choice(professions)
    
    #Random binomial
    sample = np.random.binomial(10,0.5,100)
    
    #T-test
    tstat, pval = stats.ttest_ind(new_batteries, old_batteries)
    
    test= stats.ttest_ind(group_a, group_b)
    
    #pair plot
    sns.pairplot(song_metrics)
    
    #Without replacement
    purchases = np.random.choice(chocolate, 3, replace=False)
    
    #sample of 20,000
    sample = np.random.standard_normal(20000)
    
    #Violin plot
    ax = sns.violinplot(x="Type", y="Si", data=glass)
    
    #Apply square root
    print(df.apply(np.sqrt))
    
    #Summary statistics, describe
    print(food.describe())
    
    #Create DataFrame
    print(pd.DataFrame({
        "x": [1], 
        "y": [3],
    }))
    
    #Sort values
    result = df.sort_values('salary', ascending = True)
    
    #Line plot
    sns.lineplot(x = 'day', y = 'order', data=df)
    
    #Scatterplot color
    sns.scatterplot(x = "age", y = "value", hue = "emissions", data = valuation)
    
    ax = sns.scatterplot(x="GDP per capita", y="Score", hue="Generosity", data=happiness)
    
    #Sort column = to 
    sales_2019 = sales[sales['Year'] == 2019]
    
    #Correlation coefficient
    cc = np.corrcoef(lum, rad)[0, 1]
    print(round(cc, 3))
    
    #Print random sample of 5 with random seed
    print(chess.sample(n=5, random_state=42))
    
    #IQR
    iqr_age = stats.iqr(age)
    
    #Print the name of the columns
    print(chess.columns)
    
    Hidden output
    # Reading a text file
    filename = 'huck_finn.txt'
    file = open(filename, mode='r') #'r' is for read
    text = file.read()
    file.close()
    print(text)
    
    # Context manager 'with'
    with open('huck_finn.txt','r') as file:
        print(file.read())
        
    # Open a file: file
    file = open('moby_dick.txt', 'r')
    
    # Print it
    print(file.read())
    
    # Check whether file is closed
    print(file.closed)
    
    # Close file
    file.close()
    
    # Check whether file is closed
    print(file.closed)
    
    # Read & print the first 3 lines
    with open('moby_dick.txt') as file:
        print(file.readline())
        print(file.readline())
        print(file.readline())
    
    # Import package
    import numpy as np
    
    # Assign filename to variable: file
    file = 'digits.csv'
    
    # Load file as array: digits
    digits = np.loadtxt(file, delimiter=',')
    
    # Print datatype of digits
    print(type(digits))
    
    # Select and reshape a row
    im = digits[21, 1:]
    im_sq = np.reshape(im, (28, 28))
    
    # Plot reshaped data (matplotlib.pyplot already loaded as plt)
    plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
    plt.show()
    
    # Import numpy
    import numpy as np
    
    # Assign the filename: file
    file = 'digits_header.txt'
    
    # Load the data: data
    data = np.loadtxt(file, delimiter="\t", skiprows= 1, usecols=[0,2])
    
    # Print data
    print(data)
    
    # Assign filename: file
    file = 'seaslug.txt'
    
    # Import file: data
    data = np.loadtxt(file, delimiter='\t', dtype=str)
    
    # Print the first element of data
    print(data[0])
    
    # Import data as floats and skip the first row: data_float
    data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
    
    # Print the 10th element of data_float
    print(data_float[9])
    
    # Plot a scatterplot of the data
    plt.scatter(data_float[:, 0], data_float[:, 1])
    plt.xlabel('time (min.)')
    plt.ylabel('percentage of larvae')
    plt.show()
    
    # Assign the filename: file
    file = 'titanic.csv'
    
    # Import file using np.recfromcsv: d
    d = np.recfromcsv(file)
    
    # Print out first three entries of d
    print(d[:3])
    
    # Assign the filename: file
    file = 'digits.csv'
    
    # Read the first 5 rows of the file into a DataFrame: data
    data = pd.read_csv(file, nrows=5, header=None)
    
    # Build a numpy array from the DataFrame: data_array
    data_array = data.values
    
    # Print the datatype of data_array to the shell
    print(type(data_array))
    
    # Import matplotlib.pyplot as plt
    import matplotlib.pyplot as plt
    
    # Assign filename: file
    file = 'titanic_corrupt.txt'
    
    # Import file: data
    data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
    
    # Print the head of the DataFrame
    print(data.head())
    
    # Plot 'Age' variable in a histogram
    pd.DataFrame.hist(data[['Age']])
    plt.xlabel('Age (years)')
    plt.ylabel('count')
    plt.show()
    
    # Import pickle package
    import pickle
    
    # Load pickle file and see what type it is
    x = pd.read_pickle('data.pkl')
    print(type(x))
    
    # Open pickle file and load data
    with open('data.pkl', 'rb') as file:
        d = pickle.load(file)
        
    #Beautiful Soup
    from bs4 import BeautifulSoup
    
    s = BeautifulSoup(html_doc)
    
    # Print data
    print(d)
    
    # Print datatype
    print(type(d))
    
    # Import pandas
    import pandas as pd
    
    # Assign spreadsheet filename: file
    file = 'battledeath.xlsx'
    
    # Load spreadsheet: xls
    xls = pd.ExcelFile(file)
    
    #Read excel file
    food_df = pd.read_excel(food_file, names=['Country', 'Regular Coffee', 'Instant Coffee', 'Tea'])
    
    # Print sheet names
    print(xls.sheet_names)
    
    # Load a sheet into a DataFrame by name: df1
    df1 = xls.parse('2004')
    
    # Print the head of the DataFrame df1
    print(df1.head())
    
    # Load a sheet into a DataFrame by index: df2
    df2 = xls.parse(0)
    
    # Print the head of the DataFrame df2
    print(df2.head())
    
    # Parse the first sheet and rename the columns: df1
    df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])
    
    #Contains
    print(data.str.contains("2019"))
    
    # Print the head of the DataFrame df1
    print(df1.head())
    
    #Separate email @ from column
    print(contact.email.str.split('@', expand = True))
    
    #Set_index
    game = game.set_index('name')
    
    # Parse the first column of the second sheet and rename the column: df2
    df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])
    
    # Print the head of the DataFrame df2
    print(df2.head())
    
    # Import sas7bdat package
    from sas7bdat import SAS7BDAT
    
    # Save file to a DataFrame: df_sas
    with SAS7BDAT('sales.sas7bdat') as file:
        df_sas = file.to_data_frame()
    
    # Print head of DataFrame
    print(df_sas.head())
    
    # Plot histograms of a DataFrame feature (pandas and pyplot already imported)
    pd.DataFrame.hist(df_sas[['P']])
    plt.ylabel('count')
    plt.show()
    
    # Import pandas
    import pandas as pd
    
    # Load Stata file into a pandas DataFrame: df
    df = pd.read_stata('disarea.dta')
    
    # Print the head of the DataFrame df
    print(df.head())
    
    # Plot histogram of one column of the DataFrame
    pd.DataFrame.hist(df[['disa10']])
    plt.xlabel('Extent of disease')
    plt.ylabel('Number of countries')
    plt.show()
    
    # Import packages
    import numpy as np
    import h5py
    
    # Assign filename: file
    file = 'LIGO_data.hdf5'
    
    # Load file: data
    data = h5py.File(file, 'r')
    
    # Print the datatype of the loaded file
    print(type(data))
    
    # Print the keys of the file
    for key in data.keys():
        print(key)
        
    # Get the HDF5 group: group
    group = data['strain']
    
    # Check out keys of group
    for key in group.keys():
        print(key)
    
    # Set variable equal to time series data: strain
    strain = data['strain']['Strain']
    
    # Set number of time points to sample: num_samples
    num_samples = 10000
    
    # Set time vector
    time = np.arange(0, 1, 1/num_samples)
    
    # Plot data
    plt.plot(time, strain[:num_samples])
    plt.xlabel('GPS Time (s)')
    plt.ylabel('strain')
    plt.show()
    
    # Import package
    import scipy.io
    
    # Load MATLAB file: mat
    mat = scipy.io.loadmat('albeck_gene_expression.mat')
    
    # Print the datatype type of mat
    print(type(mat))
    
    # Print the keys of the MATLAB dictionary
    print(mat.keys())
    
    # Print the type of the value corresponding to the key 'CYratioCyt'
    print(type(mat['CYratioCyt']))
    
    # Print the shape of the value corresponding to the key 'CYratioCyt'
    print(np.shape(mat['CYratioCyt']))
    
    # Subset the array and plot it
    data = mat['CYratioCyt'][25, 5:]
    fig = plt.figure()
    plt.plot(data)
    plt.xlabel('time (min.)')
    plt.ylabel('normalized fluorescence (measure of expression)')
    plt.show()
    
    # Import necessary module
    from sqlalchemy import create_engine
    
    # Create engine: engine
    engine = create_engine('sqlite:///Chinook.sqlite')
    
    # Import necessary module
    from sqlalchemy import create_engine
    
    # Create engine: engine
    engine = create_engine('sqlite:///Chinook.sqlite')
    
    # Save the table names to a list: table_names
    table_names = engine.table_names()
    
    # Print the table names to the shell
    print(table_names)
    
    # Open engine in context manager
    # Perform query and save results to DataFrame: df
    with engine.connect() as con:
        rs = con.execute("SELECT LastName, Title FROM Employee")
        df = pd.DataFrame(rs.fetchmany(size=3))
        df.columns = rs.keys()
    
    # Print the length of the DataFrame df
    print(len(df))
    
    # Create engine: engine
    engine = create_engine('sqlite:///Chinook.sqlite')
    
    # Open engine in context manager
    # Perform query and save results to DataFrame: df
    with engine.connect() as con:
        rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6")
        df = pd.DataFrame(rs.fetchall())
        df.columns = rs.keys()
    
    # Print the head of the DataFrame df
    print(df.head())
    
    # Create engine: engine
    engine = create_engine('sqlite:///Chinook.sqlite')
    
    # Open engine in context manager
    with engine.connect() as con:
        rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate")
        df = pd.DataFrame(rs.fetchall())
    
        # Set the DataFrame's column names
        df.columns = rs.keys()
    
    # Print head of DataFrame
    print(df.head())
    
    # Import packages
    from sqlalchemy import create_engine
    import pandas as pd
    
    # Create engine: engine
    engine = create_engine('sqlite:///Chinook.sqlite')
    
    # Execute query and store records in DataFrame: df
    df = pd.read_sql_query("SELECT * FROM Album", engine)
    
    # Open engine in context manager
    # Perform query and save results to DataFrame: df
    with engine.connect() as con:
        rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID")
        df = pd.DataFrame(rs.fetchall())
        df.columns = rs.keys()
    
    # Print head of DataFrame df
    print(df.head())
    
    # Execute query and store records in DataFrame: df
    df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)
    
    # Print head of DataFrame
    print(df.head())
    # Print the information of ride_sharing
    print(ride_sharing.info())
    
    # Print summary statistics of user_type column
    print(ride_sharing['user_type'].describe())
    
    # Convert user_type from integer to category
    ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')
    
    # Write an assert statement confirming the change
    assert ride_sharing['user_type_cat'].dtype == 'category'
    
    # Print new summary statistics 
    print(ride_sharing['user_type_cat'].describe())
    
    # Strip duration of minutes
    ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes') 
    
    # Convert duration to integer
    ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')
    
    # Write an assert statement making sure of conversion
    assert ride_sharing['duration_time'].dtype == 'int'
    
    # Print formed columns and calculate average ride duration 
    print(ride_sharing[['duration','duration_trim','duration_time']])
    print(ride_sharing['duration_time'].mean())
    
    # Convert tire_sizes to integer
    ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')
    
    # Set all values above 27 to 27
    ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
    
    # Reconvert tire_sizes back to categorical
    ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')
    
    # Print tire size description
    print(ride_sharing['tire_sizes'].head())
    
    # Convert ride_date to date
    ride_sharing['ride_dt'] = pd.to_datetime(ride_sharing['ride_date']).dt.date
    
    # Save today's date
    today = dt.date.today()
    
    # Set all in the future to today's date
    ride_sharing.loc[ride_sharing['ride_dt'] > today, 'ride_dt'] = today
    
    # Print maximum of ride_dt column
    print(ride_sharing['ride_dt'].max())
    
    # Find duplicates
    duplicates = ride_sharing.duplicated(subset = 'ride_id', keep = False)
    
    # Sort your duplicated rides
    duplicated_rides = ride_sharing[duplicates].sort_values('ride_id')
    
    # Print relevant columns
    print(duplicated_rides[['ride_id','duration','user_birth_year']])
    
    # Drop complete duplicates from ride_sharing
    ride_dup = ride_sharing.drop_duplicates()
    
    # Create statistics dictionary for aggregation function
    statistics = {'user_birth_year': 'min', 'duration': 'mean'}
    
    # Group by ride_id and compute new statistics
    ride_unique = ride_dup.groupby('ride_id').agg(statistics).reset_index()
    
    # Find duplicated values again
    duplicates = ride_unique.duplicated(subset = 'ride_id', keep = False)
    duplicated_rides = ride_unique[duplicates == True]
    
    # Assert duplicates are processed
    assert duplicated_rides.shape[0] == 0
    
    # Find the cleanliness category in airlines not in categories
    cat_clean = set(airlines['cleanliness']).difference(categories['cleanliness'])
    
    # Find rows with that category
    cat_clean_rows = airlines['cleanliness'].isin(cat_clean)
    
    # Print rows with inconsistent category
    print(airlines[cat_clean_rows])
    
    # Print rows with consistent categories only
    print(airlines[~cat_clean_rows])
    
    # Print unique values of both columns
    print(airlines['dest_region'].unique())
    print(airlines['dest_size'].unique())
    
    # Lower dest_region column and then replace "eur" with "europe"
    airlines['dest_region'] = airlines['dest_region'].str.lower() 
    airlines['dest_region'] = airlines['dest_region'].replace({'eur':'europe'})
    
    # Remove white spaces from `dest_size`
    airlines['dest_size'] = airlines['dest_size'].str.strip()
    
    # Verify changes have been effected
    print(airlines['dest_size'].unique())
    print(airlines['dest_region'].unique())
    
    # Create ranges for categories
    label_ranges = [0, 60, 180, np.inf]
    label_names = ['short', 'medium', 'long']
    
    # Create wait_type column
    airlines['wait_type'] = pd.cut(airlines['wait_min'], bins = label_ranges, 
                                   labels = label_names)
    
    # Create mappings and replace
    mappings = {'Monday':'weekday', 'Tuesday':'weekday', 'Wednesday': 'weekday', 
                'Thursday': 'weekday', 'Friday': 'weekday', 
                'Saturday': 'weekend', 'Sunday': 'weekend'}
    
    airlines['day_week'] = airlines['day'].replace(mappings)
    
    # Replace "Dr." with empty string ""
    airlines['full_name'] = airlines['full_name'].str.replace("Dr.","")
    
    # Replace "Mr." with empty string ""
    airlines['full_name'] = airlines['full_name'].str.replace("Mr.","")
    
    # Replace "Miss" with empty string ""
    airlines['full_name'] = airlines['full_name'].str.replace("Miss","")
    
    # Replace "Ms." with empty string ""
    airlines['full_name'] = airlines['full_name'].str.replace("Ms.","")
    
    # Assert that full_name has no honorifics
    assert airlines['full_name'].str.contains('Ms.|Mr.|Miss|Dr.').any() == False
    
    # Store length of each row in survey_response column
    resp_length = airlines['survey_response'].str.len()
    
    # Find rows in airlines where resp_length > 40
    airlines_survey = airlines[resp_length > 40]
    
    # Assert minimum survey_response length is > 40
    assert airlines_survey['survey_response'].str.len().min() > 40
    
    # Print new survey_response column
    print(airlines_survey['survey_response'])
    
    # Find values of acct_cur that are equal to 'euro'
    acct_eu = banking['acct_cur'] == 'euro'
    
    # Convert acct_amount where it is in euro to dollars
    banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1 
    
    # Unify acct_cur column by changing 'euro' values to 'dollar'
    banking.loc[acct_eu, 'acct_cur'] = 'dollar'
    
    # Assert that only dollar currency remains
    assert banking['acct_cur'].unique() == 'dollar'
    
    # Print the header of account_opened
    print(banking['account_opened'].head())
    
    # Convert account_opened to datetime
    banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                               # Infer datetime format
                                               infer_datetime_format = True,
                                               # Return missing value for error
                                               errors = 'coerce') 
    # Get year of account opened
    banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')
    
    # Print acct_year
    print(banking['acct_year'])
    
    # Store fund columns to sum against
    fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']
    
    # Find rows where fund_columns row sum == inv_amount
    inv_equ = banking[fund_columns].sum(axis=1) == banking['inv_amount']
    
    # Store consistent and inconsistent data
    consistent_inv = banking[inv_equ]
    inconsistent_inv = banking[~inv_equ]
    
    # Store consistent and inconsistent data
    print("Number of inconsistent investments: ", inconsistent_inv.shape[0])
    
    # Store today's date and find ages
    today = dt.date.today()
    ages_manual = today.year - banking['birth_date'].dt.year
    
    # Find rows where age column == ages_manual
    age_equ = banking['age'] == ages_manual
    
    # Store consistent and inconsistent data
    consistent_ages = banking[age_equ]
    inconsistent_ages = banking[~age_equ]
    
    # Store consistent and inconsistent data
    print("Number of inconsistent ages: ", inconsistent_ages.shape[0])
    
    # Print number of missing values in banking
    print(banking.isna().sum())
    
    # Visualize missingness matrix
    msno.matrix(banking)
    plt.show()
    
    # Print number of missing values in banking
    print(banking.isna().sum())
    
    # Visualize missingness matrix
    msno.matrix(banking)
    plt.show()
    
    # Isolate missing and non missing values of inv_amount
    missing_investors = banking[banking['inv_amount'].isna()]
    investors = banking[~banking['inv_amount'].isna()]
    
    # Sort banking by age and visualize
    banking_sorted = banking.sort_values(by = 'age')
    msno.matrix(banking_sorted)
    plt.show()
    
    # Drop missing values of cust_id
    banking_fullid = banking.dropna(subset = ['cust_id'])
    
    # Compute estimated acct_amount
    acct_imp = banking_fullid['inv_amount'] * 5
    
    # Impute missing acct_amount with corresponding acct_imp
    banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})
    
    # Print number of missing values
    print(banking_imputed.isna().sum())
    
    # Import process from thefuzz
    from thefuzz import process
    
    # Store the unique values of cuisine_type in unique_types
    unique_types = restaurants['cuisine_type'].unique()
    
    # Calculate similarity of 'asian' to all values of unique_types
    print(process.extract('asian', unique_types, limit = len(unique_types)))
    
    # Calculate similarity of 'american' to all values of unique_types
    print(process.extract('american', unique_types, limit = len(unique_types)))
    
    # Calculate similarity of 'italian' to all values of unique_types
    print(process.extract('italian', unique_types, limit = len(unique_types)))
    
    # Create a list of matches, comparing 'italian' with the cuisine_type column
    matches = process.extract('italian', restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
    
    # Iterate through the list of matches to italian
    for match in matches:
      # Check whether the similarity score is greater than or equal to 80
      if match[1] >= 80:
        # Select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
        restaurants.loc[restaurants['cuisine_type'] == match[0]] = 'italian'
        
    # Iterate through categories
    for cuisine in categories:  
      # Create a list of matches, comparing cuisine with the cuisine_type column
      matches = process.extract(cuisine, restaurants['cuisine_type'], limit=len(restaurants.cuisine_type))
    
    # Iterate through the list of matches
      for match in matches:
         # Check whether the similarity score is greater than or equal to 80
        if match[1] >= 80:
          # If it is, select all rows where the cuisine_type is spelled this way, and set them to the correct cuisine
          restaurants.loc[restaurants['cuisine_type'] == match[0]] = cuisine
          
    # Inspect the final result
    print(restaurants['cuisine_type'].unique())
    
    # Create an indexer and object and find possible pairs
    indexer = recordlinkage.Index()
    
    # Block pairing on cuisine_type
    indexer.block('cuisine_type')
    
    # Generate pairs
    pairs = indexer.index(restaurants, restaurants_new)
    
    # Create a comparison object
    comp_cl = recordlinkage.Compare()
    
    # Find exact matches on city, cuisine_types 
    comp_cl.exact('city', 'city', label='city')
    comp_cl.exact('cuisine_type', 'cuisine_type', label='cuisine_type')
    
    # Find similar matches of rest_name
    comp_cl.string('rest_name', 'rest_name', label='name', threshold = 0.8) 
    
    # Get potential matches and print
    potential_matches = comp_cl.compute(pairs, restaurants, restaurants_new) 
    print(potential_matches)
    
    # Isolate potential matches with row sum >=3
    matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
    
    # Get values of second column index of matches
    matching_indices = matches.index.get_level_values(1)
    
    # Subset restaurants_new based on non-duplicate values
    non_dup = restaurants_new[~restaurants_new.index.isin(matching_indices)]
    
    # Append non_dup to restaurants
    full_restaurants = restaurants.append(non_dup)
    print(full_restaurants)
    
    #Convert columns from wide to long format
    df = pd.melt(df, id_vars = 'id', value_vars = ['math', 'chemistry'])
    
    print(df)
    
    #Drop rows where all values are NaN
    print(score.dropna(how = 'all'))
    
    #To lower case
    jobs['roles'] = jobs['roles'].str.lower()
    
    #Pivot Tables
    df = pd.pivot_table(
    	restaurant,
        values = ['price', 'rating'],
        index = 'cuisine',
        aggfunc = np.mean
    
    #Print strings that contains "d"
    print(s.str.contains("d"))
        
    #Determine data types in df columns
    print(books.dtypes)
        
    #Combine columns into single df
    df = pd.concat([restaurant, location], axis = 1)