Streamlined Data Ingestion with pandas
Run the hidden code cell below to import the packages used in this course.
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.io.json import json_normalize
from sqlalchemy import create_engine
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Setting up data types at csv reading time through dictionary assigning.
# Create dict specifying data types for agi_stub and zipcode
data_types = {'zipcode' : 'str',
'agi_stub': 'category'}
# Load csv using dtype to set correct data types
data = pd.read_csv("vt_tax_data_2016.csv", dtype=data_types)
# Print data types of resulting frame
print(data.dtypes.head())
If our excel workbook has multiple sheets of similar format, we can import all into a dictionary where keys are the sheet names and the data are the values. We can exract only the values into a single dataframe using the below code and use that df for analysis.
# Create an empty dataframe
all_responses = pd.DataFrame()
print(type(responses))
# Set up for loop to iterate through values in responses
for df in responses.values():
# Print the number of rows being added
print("Adding {} rows".format(df.shape[0]))
# Append df to all_responses, assign result
all_responses = all_responses.append(df)
Explore Datasets
Try using the prompt below to explore the data and practice your skills!
There are three data files in the datasets/
directory of varying kinds: data.db
(NYC weather and 311 housing complaints), fcc-new-coder-survey.xlsx
( FreeCodeCamp New Developer Survey response subset), and vt_tax_data_2016.csv
(Vermont tax return data by ZIP code).
Import each of these files into a format useful for data analysis in Python, such as specifying data types, handling bad or missing data, and parsing dates. You can also practice importing a specific portion of data from the data files (e.g., certain sheet(s) of an Excel worksheet or using SQL to filter a database on conditions).
We can format parse the datetime columns as standard datetime format using parse_dates. For column having non-standard datetime, we can use standard datetime string to format the values.
import pandas as pd
survey_data["Part2EndTime"] = pd.to_datetime(survey_data["Part2EndTime"],
format="%m%d%Y %H:%M:%S")
We can set up the parameters and headers of an API request as per the specific API requirements. Convert the response to json and load as dataframe.
# Create dictionary that passes Authorization and key string
headers = {'Authorization': "Bearer {}".format(api_key)}
# Query the Yelp API with headers and params set
response = requests.get(api_url, params= params, headers = headers)
# Extract JSON data from response
data = response.json()
# Load "businesses" values to a dataframe and print names
cafes = pd.DataFrame(data['businesses'])
print(cafes.name)
We can flatten the nested json with json normalize imported from pandas.io.json. For nested structures we can mention the record path, seperator type and meta to fetch the key value pairs.
# Load other business attributes and set meta prefix
flat_cafes = json_normalize(data["businesses"],
sep="_",
record_path="categories",
meta=['name',
'alias',
'rating',
['coordinates', 'latitude'],
['coordinates','longitude']],
meta_prefix='biz_')
# View the data
print(flat_cafes.head())
APIs when accessed return only a limited no. of records and therefore we have to set an offset keyword in params dict to fetch the records multiple times from a range. We can merge dataframes as well based on certain columns. If columns names are different in both dfs, we'll have to mention their names.