Skip to content

Introduction to Importing Data in Python

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Working with tabs in Excel

Parse the first sheet and rename the columns: df1

df1 = xls.parse(0, skiprows=[1], names=['Country', 'AAM due to War (2002)'])

Print the head of the DataFrame df1

print(df1.head())

Parse the first column of the second sheet and rename the column: df2

df2 = xls.parse(1, usecols=[0], skiprows=[1], names=['Country'])

Print the head of the DataFrame df2

print(df2.head())

skiprows, usecols and names should all be list type

Importing and visualizing stata

Import pandas

import pandas as pd

Load Stata file into a pandas DataFrame: df

df = pd.read_stata('disarea.dta')

Print the head of the DataFrame df

print(df.head())

Plot histogram of one column of the DataFrame

pd.DataFrame.hist(df[['disa10']]) plt.xlabel('Extent of disease') plt.ylabel('Number of countries') plt.show()

HDF5 files

Get the HDF5 group: group

group = data['strain']

Check out keys of group

for key in group.keys(): print(key)

Set variable equal to time series data: strain

strain = np.array(data['strain']['Strain'])

Set number of time points to sample: num_samples

num_samples = 10000

Set time vector

time = np.arange(0, 1, 1/num_samples)

Plot data

plt.plot(time, strain[:num_samples]) plt.xlabel('GPS Time (s)') plt.ylabel('strain') plt.show()

MATLAB

Print the keys of the MATLAB dictionary

print(mat.keys())

Print the type of the value corresponding to the key 'CYratioCyt'

print(type(mat['CYratioCyt']))

Print the shape of the value corresponding to the key 'CYratioCyt'

print(mat['CYratioCyt'].shape)

Subset the array and plot it

data = mat['CYratioCyt'][25, 5:] fig = plt.figure() plt.plot(data) plt.xlabel('time (min.)') plt.ylabel('normalized fluorescence (measure of expression)') plt.show()

Keys are names of matlab variables and values are the values of each variable

SQL languages are used to work with relational databases.

Import packages

from sqlalchemy import create_engine import pandas as pd

Create engine: engine

engine = create_engine('sqlite:///Chinook.sqlite')

Open engine connection: con

con = engine.connect()

Perform query: rs

rs = con.execute("SELECT * FROM Album")

Save results of the query to DataFrame: df

df = pd.DataFrame(rs.fetchall())

Close connection

con.close()

Print head of DataFrame df

print(df.head())

Open engine in context manager

Perform query and save results to DataFrame: df

with engine.connect() as con: rs = con.execute("SELECT LastName, Title FROM Employee") df = pd.DataFrame(rs.fetchmany(size=3)) df.columns = rs.keys()

Print the length of the DataFrame df

print(len(df))

Print the head of the DataFrame df

print(df.head())

Create engine: engine

engine = create_engine('sqlite:///Chinook.sqlite')

Open engine in context manager

Perform query and save results to DataFrame: df

with engine.connect() as con: rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6") df = pd.DataFrame(rs.fetchall()) df.columns = rs.keys()

Print the head of the DataFrame df

print(df.head())

Create engine: engine

engine = create_engine('sqlite:///Chinook.sqlite')

Open engine in context manager

with engine.connect() as con: rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate") df = pd.DataFrame(rs.fetchall())

# Set the DataFrame's column names df.columns = rs.keys()

Print head of DataFrame

print(df.head())

Import packages

from sqlalchemy import create_engine import pandas as pd

Create engine: engine

engine = create_engine('sqlite:///Chinook.sqlite')

Execute query and store records in DataFrame: df

df = pd.read_sql_query("SELECT * FROM Album", engine)

Print head of DataFrame

print(df.head())

Open engine in context manager and store query result in df1

with engine.connect() as con: rs = con.execute("SELECT * FROM Album") df1 = pd.DataFrame(rs.fetchall()) df1.columns = rs.keys()

Confirm that both methods yield the same result

print(df.equals(df1))

Import packages

from sqlalchemy import create_engine import pandas as pd

Create engine: engine

engine = create_engine('sqlite:///Chinook.sqlite')

Execute query and store records in DataFrame: df

df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >=6 ORDER BY BirthDate", engine)

Print head of DataFrame

print(df.head())

The power of SQL lies in relationships between tables: INNER JOIN Here, you'll perform your first INNER JOIN! You'll be working with your favourite SQLite database, Chinook.sqlite. For each record in the Album table, you'll extract the Title along with the Name of the Artist. The latter will come from the Artist table and so you will need to INNER JOIN these two tables on the ArtistID column of both.

Recall that to INNER JOIN the Orders and Customers tables from the Northwind database, Hugo executed the following SQL query:

"SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID" The following code has already been executed to import the necessary packages and to create the engine:

import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///Chinook.sqlite') Instructions 100 XP Assign to rs the results from the following query: select all the records, extracting the Title of the record and Name of the artist of each record from the Album table and the Artist table, respectively. To do so, INNER JOIN these two tables on the ArtistID column of both. In a call to pd.DataFrame(), apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df. Set the DataFrame's column names to the corresponding names of the table columns.

Open engine in context manager

Perform query and save results to DataFrame: df

with engine.connect() as con: rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID") df = pd.DataFrame(rs.fetchall()) df.columns = rs.keys()

Print head of DataFrame df

print(df.head())

Execute query and store records in DataFrame: df

df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)

Print head of DataFrame

print(df.head())

# Add your code snippets here

Explore Datasets

Try importing the remaining files to explore the data and practice your skills!

  • datasets/disarea.dta
  • datasets/ja_data2.mat
  • datasets/L-L1_LOSC_4_V1-1126259446-32.hdf5
  • datasets/mnist_kaggle_some_rows.csv
  • datasets/sales.sas7bdat