Introduction to Importing Data in Python
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets hereExplore Datasets
Try importing the remaining files to explore the data and practice your skills!
datasets/disarea.dtadatasets/ja_data2.matdatasets/L-L1_LOSC_4_V1-1126259446-32.hdf5datasets/mnist_kaggle_some_rows.csvdatasets/sales.sas7bdat
Importing Entire Text Files
You can use the open function to open and read files.
# Open a file: file filename = 'moby_dick.txt' file = open(filename, mode='r') # Print it print(file.read()) # Check whether file is closed print(file.closed) # Close file file.close() # Check whether file is closed print(file.closed)
Context Manager using the with as statement
A context manager is a Python object that defines the runtime context to be established when executing a with statement. The with statement is used to wrap the execution of a block of code with methods defined by a context manager. The with statement ensures that the context is entered before the block of code is executed and exited after the block of code is executed.
Here is an example of using the with statement to read and print the first 3 lines of a text file:
with open('moby_dick.txt') as file: print(file.readline()) print(file.readline()) print(file.readline())
Flat files
Flat files are basic text files containing records, that is, table data, without structured relationships. This is in contrast to a relational database, for example, in which columns of distinct tables can be related. We'll get to these later. To be even more precise, flat files consist of records, where by a record we mean a row of fields or attributes, each of which contains at most one item of information. In the flat file 'titanic dot csv', each
Importing flat files using NumPy
Numpy arrays are the standards for importing numerical data.
# Import numpy import numpy as np # Assign the filename: file file = 'digits_header.txt' # Load the data: data # The delimiter of this file is actually a tab # In this context, we are skipping the first row which represents column names and using the first 3 columns of the file data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2]) # Print data print(data)
Another example is shown below: Due to the header, if you tried to import the flat file as-is using np.loadtxt(), Python would throw you a ValueError and tell you that it could not convert string to float. There are two ways to deal with this: firstly, you can set the data type argument dtype equal to str (for string).
Alternatively, you can skip the first row as we have seen before, using the skiprows argument.
# Assign filename: file file = 'seaslug.txt' # Import file: data data = np.loadtxt(file, delimiter='\t', dtype=str) # Print the first element of data print(data[0]) # Import data as floats and skip the first row: data_float data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1) # Print the 10th element of data_float print(data_float[9])
Working with mixed datatypes in NumPy
Much of the time you will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this. There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.
Import 'titanic.csv' using the function np.genfromtxt() as follows:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, data is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported. You can test this by checking out the array's shape in the shell by executing np.shape(data).
Accessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].
There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None. In this exercise, you'll practice using this to achieve the same result.
You'll only need to pass file to np.recfromcsv() because it has the defaults delimiter=',' and names=True in addition to dtype=None!
Here is an example:
# Assign the filename: file file = 'titanic.csv' # Import file using np.recfromcsv: d d = np.recfromcsv(file) # Print out first three entries of d print(d[:3])
Importing flat files using pandas
The DataFrame object in pandas is a more appropriate structure in which to store such data and, thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table(). A generic example is the following:
# Import pandas as pd import pandas as pd # Assign the filename: file file = 'titanic.csv' # Read the file into a DataFrame: df df = pd.read_csv(file)
For pd.read_csv() function:
- The
nrowsargument is an optional parameter that specifies the number of rows to read from the CSV file. By default, it is set to None, which means it reads all the rows from the file. If you want to read only a specific number of rows, you can set nrows to the desired value. This can be useful when working with large CSV files and you want to read only a subset of the data. - The
headerargument is an optional parameter that indicates which row to use as the column names (header) for the DataFrame. By default, it is set to 'infer', which means pandas will try to automatically detect the header row. If you pass an integer value to the header argument, it specifies the row index (0-based) to be used as the header. If you set header=None, it indicates that there are no column names in the file, and pandas will generate default column names (0, 1, 2, etc.).
Example: # Assign the filename: file file = 'digits.csv' # Read the first 5 rows of the file into a DataFrame: data data = pd.read_csv(file, nrows=5, header=None) # Build a numpy array from the DataFrame: data_array data_array = np.array(data) # Print the datatype of data_array to the shell print(type(data_array))
Introduction to other file types
Introduction to Pickled Files in Python
There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries. If you want your files to be human readable, you may want to save them as text files in a clever manner. JSONs, which you will see in a later chapter, are appropriate for Python dictionaries.
However, if you merely want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or a bytestream.
Pickling is the process of converting a Python object into a byte stream to store it in a file or memory. This byte stream can be used later to reconstruct the original object. Pickling is useful when you want to save the state of your program or transfer data over a network.
Python provides the pickle module to perform pickling and unpickling. The pickle module can handle almost any Python object, including lists, dictionaries, functions, and classes.
Here's an example of how to pickle an object:
import pickle # create an object to pickle data = {'name': 'John', 'age': 30, 'city': 'New York'} # open a file to write the pickled data with open('data.pickle', 'wb') as f: pickle.dump(data, f)
Here is an example of how to unpickle an object (the second argument rb which is used to read binary):
# Import pickle package import pickle # Open pickle file and load data: d with open('data.pkl', 'rb') as file: d = pickle.load(file)