Skip to content
Introduction to Importing Data in Python
  • AI Chat
  • Code
  • Report
  • Introduction to Importing Data in Python

    Run the hidden code cell below to import the data used in this course.

    # Import the course packages
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.io
    import h5py
    from sas7bdat import SAS7BDAT
    from sqlalchemy import create_engine
    import pickle
    
    # Import the course datasets
    titanic = pd.read_csv("datasets/titanic_sub.csv")
    battledeath_2002 = pd.ExcelFile("datasets/battledeath.xlsx").parse("2002")
    engine = create_engine('sqlite:///datasets/Chinook.sqlite')
    con = engine.connect()
    rs = con.execute('SELECT * FROM Album')
    chinook = pd.DataFrame(rs.fetchall())
    seaslug = np.loadtxt("datasets/seaslug.txt", delimiter="\t", dtype=str)

    Explore Datasets

    Try importing the remaining files to explore the data and practice your skills!

    • datasets/disarea.dta
    • datasets/ja_data2.mat
    • datasets/L-L1_LOSC_4_V1-1126259446-32.hdf5
    • datasets/mnist_kaggle_some_rows.csv
    • datasets/sales.sas7bdat

    Learn to import data into Python from various sources including flat files, software-native files, and relational databases. This course covers data cleaning, visualization, model building, and interpretation.

    Introduction and flat files

    Course Overview

    Importing Data Sources

    • Flat Files: .txt, .csv
    • Software-Native Files: Excel, Stata, SAS, MATLAB
    • Relational Databases: SQLite, PostgreSQL

    Learning Objectives

    • Plain Text Files: Understand how to import and read plain text and table data (e.g., titanic.csv).
    • Reading and Writing Files: Learn to use Python's open function for reading ('r') and writing ('w') files. Emphasize the importance of closing files or using a context manager (with statement) for automatic file handling.
    • Practical Exercises: Practice printing files to the console, reading specific lines, and using NumPy for handling numerical data in flat files.

    Key Concepts

    • Flat Files: Characterized by tabular data where each row is a record and each column a feature.
    • Context Manager: Recommended for file operations to ensure proper resource management without manual closure.

    Exploring the Working Directory

    To import data into Python, knowing the files in your working directory is crucial. IPython offers magic commands for system shell access, such as ! ls to list directory contents. Use ! ls to identify files in your current directory.

    Importing Text Files

    This exercise involves moby_dick.txt, containing Moby Dick's opening sentences. Learn to open, print, and close a text file.

    # Open a file: file
    file = open('datasets/seaslug.txt', 'r')
    
    # Print it
    print(file.read())
    
    # Check whether file is closed
    print(file.closed) 
    
    # Close file
    file.close()
    
    # Check whether file is closed
    print(file.closed)
    Hidden output

    Importing Text Files Line by Line

    For handling large files, it's efficient to read and print them line by line. Use file.readline() to achieve this. Within a context manager, you can easily read lines without worrying about closing the file:

    with open('moby_dick.txt') as file: print(file.readline())
    # Read & print the first 3 lines
    with open('datasets/seaslug.txt') as file:
        print(file.readline())
        print(file.readline())
        print(file.readline())
    Hidden output
    1. Introduction to Flat Files

      • Introduction to importing plain text files.
      • Overview of flat files like 'titanic.csv'.
      • Flat files structure: rows represent unique entities (e.g., passengers), columns represent attributes (e.g., gender, cabin).
    2. Understanding Flat Files

      • Definition: Flat files are simple text files without structured relationships, containing records (rows) with fields (columns) that hold data.
      • Example: In 'titanic.csv', each row is a unique passenger, columns include name, gender, cabin.
    3. Headers and File Extensions

      • Headers describe column contents and are crucial for data import.
      • File extensions like .csv (comma-separated values) and .txt (text files) indicate the delimiter used.
    4. Importing Flat Files

      • Methods: Use numpy for numerical data arrays, pandas for dataframes. Covers importing files with both numerical and string data, such as 'titanic.csv'.

    Flat Files & The Zen of Python

    Python features many Python Enhancement Proposals (PEPs), including PEP8 for coding style and PEP20, known as the Zen of Python, which outlines Python's design principles in 20 aphorisms (19 written). To see these principles, including a relevant 5th aphorism, run import this in your shell.

    import this
    import this
    1. Importing Flat Files Using NumPy

      • To import numerical data as a numpy array, you can use NumPy, which is efficient and the standard for numerical data in Python.
    2. Why NumPy?

      • Numpy arrays are fast and clean, making them ideal for storing numerical data. They are also required by many other Python packages, such as scikit-learn for Machine Learning.
    3. NumPy Functions: loadtxt and genfromtxt

      • To import data, use NumPy's loadtxt or genfromtxt functions. You'll need to import NumPy and then call loadtxt, specifying the filename and the delimiter (default is white space).
    4. Customizing Your NumPy Import

      • You can customize imports by skipping rows (e.g., skiprows=1 for headers) or selecting specific columns with usecols=[0, 2]. To import data as strings, set dtype='str'.
    5. Handling Mixed Datatypes

      • While loadtxt is suitable for basic cases, it struggles with mixed datatypes, such as datasets with both floats and strings like the Titanic dataset.