Skip to content
Introduction to Importing Data in Python
  • AI Chat
  • Code
  • Report
  • Introduction to Importing Data in Python

    1. Import the course packages

    # Import the course packages
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.io
    import h5py
    from sas7bdat import SAS7BDAT
    from sqlalchemy import create_engine
    import pickle
    
    # Import the course datasets
    titanic = pd.read_csv("datasets/titanic_sub.csv")
    battledeath_2002 = pd.ExcelFile("datasets/battledeath.xlsx").parse("2002")
    engine = create_engine('sqlite:///Chinook.sqlite')
    con = engine.connect()
    rs = con.execute('SELECT * FROM Album')
    chinook = pd.DataFrame(rs.fetchall())
    seaslug = np.loadtxt("seaslug.txt", delimiter="\t", dtype=str)

    1.1.2 Importing entire text files

    • It is a text file that contains the opening sentences of Moby Dick.
    # Open a file: file
    file = open('moby_dick.txt','r')
    
    # Print it
    print(file.read())
    
    # Check whether file is closed
    print(file.closed)
    
    # Close file
    file.close()
    
    # Check whether file is closed
    print(file.closed)
    
    

    1.1.3 Importing text files line by line

    • For large files,may not want to print all of their content to the shell: may wish to print only the first few lines.
    • Enter the readline() method.
    # Read & print the first 3 lines
    with open('moby_dick.txt') as file:
        print(file.readline())
        print(file.readline())
        print(file.readline())

    1.3.1 Using NumPy to import flat files

    • We'll load the MNIST digit recognition dataset using the numpy function loadtxt()
    # Import package
    import numpy as np
    import matplotlib.pyplot as plt
    
    # Assign filename to variable: file
    file = 'digits.csv'
    
    # Load file as array: digits
    digits = np.loadtxt(file, delimiter=',', skiprows=1)
    
    # Print datatype of digits
    print(type(digits))
    
    # Select and reshape a row
    im = digits[21, 1:]
    im_sq = np.reshape(im, (28, 28))
    
    # Plot reshaped data (matplotlib.pyplot already loaded as plt)
    plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
    plt.show()
    

    1.3.2 Customizing your NumPy import

    • a number of arguments that np.loadtxt() takes that are useful:
    • delimiter changes the delimiter that loadtxt() is expecting.
    • ',' for comma-delimited.
    • '\t' for tab-delimited.
    • skiprows to specify how many rows (not indices) to skip
    • usecols takes list of the indices of the columns to keep.
    # Import numpy
    import numpy as np
    
    # Assign the filename: file
    file = 'digits_header.txt'
    
    # Load the data: data
    data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,3])
    
    # Print data
    print(data)
    

    1.3.3 Importing different datatypes

    • These data consists of percentage of sea slug larvae that had metamorphosed in a given time period.

    • Due to the header,to import it using np.loadtxt(),

    • Python would throw you a ValueError(tell you that it could not convert string to float).

    • Two ways to deal with this:

    • Firstly, you can set the data type argument dtype equal to str (for string).

    • Alternatively, you can skip the first row , using the skiprows argument.

    import matplotlib.pyplot as plt
    
    # Assign filename: file
    file = 'seaslug.txt'
    
    # Import file: data
    data = np.loadtxt(file, delimiter='\t', dtype=str)
    
    # Print the first element of data
    print(data[0])
    
    # Import data as floats and skip the first row: data_float
    data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1) 
    
    # Print the 10th element of data_float
    print(data_float[9])
    
    # Plot a scatterplot of the data
    plt.scatter(data_float[:, 0], data_float[:, 1])
    plt.xlabel('time (min.)')
    plt.ylabel('percentage of larvae')
    plt.show()