Introduction to Importing Data in Python
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Working with tabs in Excel
Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[1], names=['Country', 'AAM due to War (2002)'])
Print the head of the DataFrame df1
print(df1.head())
Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[1], names=['Country'])
Print the head of the DataFrame df2
print(df2.head())
skiprows, usecols and names should all be list type
Importing and visualizing stata
Import pandas
import pandas as pd
Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')
Print the head of the DataFrame df
print(df.head())
Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']]) plt.xlabel('Extent of disease') plt.ylabel('Number of countries') plt.show()
HDF5 files
Get the HDF5 group: group
group = data['strain']
Check out keys of group
for key in group.keys(): print(key)
Set variable equal to time series data: strain
strain = np.array(data['strain']['Strain'])
Set number of time points to sample: num_samples
num_samples = 10000
Set time vector
time = np.arange(0, 1, 1/num_samples)
Plot data
plt.plot(time, strain[:num_samples]) plt.xlabel('GPS Time (s)') plt.ylabel('strain') plt.show()
MATLAB
Print the keys of the MATLAB dictionary
print(mat.keys())
Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))
Print the shape of the value corresponding to the key 'CYratioCyt'
print(mat['CYratioCyt'].shape)
Subset the array and plot it
data = mat['CYratioCyt'][25, 5:] fig = plt.figure() plt.plot(data) plt.xlabel('time (min.)') plt.ylabel('normalized fluorescence (measure of expression)') plt.show()
Keys are names of matlab variables and values are the values of each variable
SQL languages are used to work with relational databases.
Import packages
from sqlalchemy import create_engine import pandas as pd
Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
Open engine connection: con
con = engine.connect()
Perform query: rs
rs = con.execute("SELECT * FROM Album")
Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())
Close connection
con.close()
Print head of DataFrame df
print(df.head())
Open engine in context manager
Perform query and save results to DataFrame: df
with engine.connect() as con: rs = con.execute("SELECT LastName, Title FROM Employee") df = pd.DataFrame(rs.fetchmany(size=3)) df.columns = rs.keys()
Print the length of the DataFrame df
print(len(df))
Print the head of the DataFrame df
print(df.head())
Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
Open engine in context manager
Perform query and save results to DataFrame: df
with engine.connect() as con: rs = con.execute("SELECT * FROM Employee WHERE EmployeeId >= 6") df = pd.DataFrame(rs.fetchall()) df.columns = rs.keys()
Print the head of the DataFrame df
print(df.head())
Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
Open engine in context manager
with engine.connect() as con: rs = con.execute("SELECT * FROM Employee ORDER BY BirthDate") df = pd.DataFrame(rs.fetchall())
# Set the DataFrame's column names df.columns = rs.keys()
Print head of DataFrame
print(df.head())
Import packages
from sqlalchemy import create_engine import pandas as pd
Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Album", engine)
Print head of DataFrame
print(df.head())
Open engine in context manager and store query result in df1
with engine.connect() as con: rs = con.execute("SELECT * FROM Album") df1 = pd.DataFrame(rs.fetchall()) df1.columns = rs.keys()
Confirm that both methods yield the same result
print(df.equals(df1))
Import packages
from sqlalchemy import create_engine import pandas as pd
Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM Employee WHERE EmployeeId >=6 ORDER BY BirthDate", engine)
Print head of DataFrame
print(df.head())
The power of SQL lies in relationships between tables: INNER JOIN Here, you'll perform your first INNER JOIN! You'll be working with your favourite SQLite database, Chinook.sqlite. For each record in the Album table, you'll extract the Title along with the Name of the Artist. The latter will come from the Artist table and so you will need to INNER JOIN these two tables on the ArtistID column of both.
Recall that to INNER JOIN the Orders and Customers tables from the Northwind database, Hugo executed the following SQL query:
"SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID" The following code has already been executed to import the necessary packages and to create the engine:
import pandas as pd from sqlalchemy import create_engine engine = create_engine('sqlite:///Chinook.sqlite') Instructions 100 XP Assign to rs the results from the following query: select all the records, extracting the Title of the record and Name of the artist of each record from the Album table and the Artist table, respectively. To do so, INNER JOIN these two tables on the ArtistID column of both. In a call to pd.DataFrame(), apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df. Set the DataFrame's column names to the corresponding names of the table columns.
Open engine in context manager
Perform query and save results to DataFrame: df
with engine.connect() as con: rs = con.execute("SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID") df = pd.DataFrame(rs.fetchall()) df.columns = rs.keys()
Print head of DataFrame df
print(df.head())
Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)
Print head of DataFrame
print(df.head())
# Add your code snippets hereExplore Datasets
Try importing the remaining files to explore the data and practice your skills!
datasets/disarea.dtadatasets/ja_data2.matdatasets/L-L1_LOSC_4_V1-1126259446-32.hdf5datasets/mnist_kaggle_some_rows.csvdatasets/sales.sas7bdat