Skip to content

Streamlined Data Ingestion with pandas

Run the hidden code cell below to import the packages used in this course.

# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.io.json import json_normalize
from sqlalchemy import create_engine

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Setting up data types at csv reading time through dictionary assigning.

# Create dict specifying data types for agi_stub and zipcode
data_types = {'zipcode' : 'str',
			  'agi_stub': 'category'}

# Load csv using dtype to set correct data types
data = pd.read_csv("vt_tax_data_2016.csv", dtype=data_types)

# Print data types of resulting frame
print(data.dtypes.head())

If our excel workbook has multiple sheets of similar format, we can import all into a dictionary where keys are the sheet names and the data are the values. We can exract only the values into a single dataframe using the below code and use that df for analysis.


# Create an empty dataframe
all_responses = pd.DataFrame()
print(type(responses))
# Set up for loop to iterate through values in responses
for df in responses.values():
  # Print the number of rows being added
  print("Adding {} rows".format(df.shape[0]))
  # Append df to all_responses, assign result
  all_responses = all_responses.append(df)

Explore Datasets

Try using the prompt below to explore the data and practice your skills!

There are three data files in the datasets/ directory of varying kinds: data.db (NYC weather and 311 housing complaints), fcc-new-coder-survey.xlsx ( FreeCodeCamp New Developer Survey response subset), and vt_tax_data_2016.csv (Vermont tax return data by ZIP code).

Import each of these files into a format useful for data analysis in Python, such as specifying data types, handling bad or missing data, and parsing dates. You can also practice importing a specific portion of data from the data files (e.g., certain sheet(s) of an Excel worksheet or using SQL to filter a database on conditions).

We can format parse the datetime columns as standard datetime format using parse_dates. For column having non-standard datetime, we can use standard datetime string to format the values.

import pandas as pd
survey_data["Part2EndTime"] = pd.to_datetime(survey_data["Part2EndTime"], 
                                             format="%m%d%Y %H:%M:%S")
Hidden output

We can set up the parameters and headers of an API request as per the specific API requirements. Convert the response to json and load as dataframe.

# Create dictionary that passes Authorization and key string
headers = {'Authorization': "Bearer {}".format(api_key)}

# Query the Yelp API with headers and params set
response = requests.get(api_url, params= params, headers = headers)



# Extract JSON data from response
data = response.json()

# Load "businesses" values to a dataframe and print names
cafes = pd.DataFrame(data['businesses'])
print(cafes.name)

We can flatten the nested json with json normalize imported from pandas.io.json. For nested structures we can mention the record path, seperator type and meta to fetch the key value pairs.

# Load other business attributes and set meta prefix
flat_cafes = json_normalize(data["businesses"],
                            sep="_",
                    		record_path="categories",
                    		meta=['name', 
                                  'alias',  
                                  'rating',
                          		  ['coordinates', 'latitude'], 
                          		  ['coordinates','longitude']],
                    		meta_prefix='biz_')

# View the data
print(flat_cafes.head())

APIs when accessed return only a limited no. of records and therefore we have to set an offset keyword in params dict to fetch the records multiple times from a range. We can merge dataframes as well based on certain columns. If columns names are different in both dfs, we'll have to mention their names.