Skip to content
Introduction to Importing Data in Python

Introduction to Importing Data in Python

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import h5py
from sas7bdat import SAS7BDAT
from sqlalchemy import create_engine
import pickle

# Import the course datasets
titanic = pd.read_csv("datasets/titanic_sub.csv")
battledeath_2002 = pd.ExcelFile("datasets/battledeath.xlsx").parse("2002")
engine = create_engine('sqlite:///datasets/Chinook.sqlite')
con = engine.connect()
rs = con.execute('SELECT * FROM Album')
chinook = pd.DataFrame(rs.fetchall())
seaslug = np.loadtxt("datasets/seaslug.txt", delimiter="\t", dtype=str)

Explore Datasets

Try importing the remaining files to explore the data and practice your skills!

  • datasets/disarea.dta
  • datasets/ja_data2.mat
  • datasets/L-L1_LOSC_4_V1-1126259446-32.hdf5
  • datasets/mnist_kaggle_some_rows.csv
  • datasets/sales.sas7bdat

Learn to import data into Python from various sources including flat files, software-native files, and relational databases. This course covers data cleaning, visualization, model building, and interpretation.

Introduction and flat files

Course Overview

Importing Data Sources

  • Flat Files: .txt, .csv
  • Software-Native Files: Excel, Stata, SAS, MATLAB
  • Relational Databases: SQLite, PostgreSQL

Learning Objectives

  • Plain Text Files: Understand how to import and read plain text and table data (e.g., titanic.csv).
  • Reading and Writing Files: Learn to use Python's open function for reading ('r') and writing ('w') files. Emphasize the importance of closing files or using a context manager (with statement) for automatic file handling.
  • Practical Exercises: Practice printing files to the console, reading specific lines, and using NumPy for handling numerical data in flat files.

Key Concepts

  • Flat Files: Characterized by tabular data where each row is a record and each column a feature.
  • Context Manager: Recommended for file operations to ensure proper resource management without manual closure.

Exploring the Working Directory

To import data into Python, knowing the files in your working directory is crucial. IPython offers magic commands for system shell access, such as ! ls to list directory contents. Use ! ls to identify files in your current directory.

Importing Text Files

This exercise involves moby_dick.txt, containing Moby Dick's opening sentences. Learn to open, print, and close a text file.

# Open a file: file
file = open('datasets/seaslug.txt', 'r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed) 

# Close file
file.close()

# Check whether file is closed
print(file.closed)
Hidden output

Importing Text Files Line by Line

For handling large files, it's efficient to read and print them line by line. Use file.readline() to achieve this. Within a context manager, you can easily read lines without worrying about closing the file:

with open('moby_dick.txt') as file: print(file.readline())
# Read & print the first 3 lines
with open('datasets/seaslug.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())
Hidden output
  1. Introduction to Flat Files

    • Introduction to importing plain text files.
    • Overview of flat files like 'titanic.csv'.
    • Flat files structure: rows represent unique entities (e.g., passengers), columns represent attributes (e.g., gender, cabin).
  2. Understanding Flat Files

    • Definition: Flat files are simple text files without structured relationships, containing records (rows) with fields (columns) that hold data.
    • Example: In 'titanic.csv', each row is a unique passenger, columns include name, gender, cabin.
  3. Headers and File Extensions

    • Headers describe column contents and are crucial for data import.
    • File extensions like .csv (comma-separated values) and .txt (text files) indicate the delimiter used.
  4. Importing Flat Files

    • Methods: Use numpy for numerical data arrays, pandas for dataframes. Covers importing files with both numerical and string data, such as 'titanic.csv'.

Flat Files & The Zen of Python

Python features many Python Enhancement Proposals (PEPs), including PEP8 for coding style and PEP20, known as the Zen of Python, which outlines Python's design principles in 20 aphorisms (19 written). To see these principles, including a relevant 5th aphorism, run import this in your shell.

import this
import this
  1. Importing Flat Files Using NumPy

    • To import numerical data as a numpy array, you can use NumPy, which is efficient and the standard for numerical data in Python.
  2. Why NumPy?

    • Numpy arrays are fast and clean, making them ideal for storing numerical data. They are also required by many other Python packages, such as scikit-learn for Machine Learning.
  3. NumPy Functions: loadtxt and genfromtxt

    • To import data, use NumPy's loadtxt or genfromtxt functions. You'll need to import NumPy and then call loadtxt, specifying the filename and the delimiter (default is white space).
  4. Customizing Your NumPy Import

    • You can customize imports by skipping rows (e.g., skiprows=1 for headers) or selecting specific columns with usecols=[0, 2]. To import data as strings, set dtype='str'.
  5. Handling Mixed Datatypes

    • While loadtxt is suitable for basic cases, it struggles with mixed datatypes, such as datasets with both floats and strings like the Titanic dataset.