How to Import JSON and HTML Data into pandas

To be an adept data scientist, one must know how to deal with many different kinds of data. Learn to read various formats of data like JSON and HTML using pandas.

31 mai 2019 · 13 min de lecture

Introduction

Importing data is one of the most essential and very first steps in any data related problem. The ability to import the data correctly is a must-have skill for every aspiring data scientist.

Data exists in many different forms, and not only should you know how to import various data formats but also how to analyze and manipulate the data to gain useful insights.

pandas is an open source Python library which is easy-to-use, provides high-performance, and a data analysis tool for various data formats.

It gives you the capability to read various types of data formats like CSV, JSON, Excel, Pickle, etc. It allows you to represent your data in a row and column tabular fashion, which makes the data readable and presentable.

pandas represent the data in a DataFrame form and provide you with extensive usage for data analysis and data manipulation. Once you start making sense out of the data using the various functionalities in pandas, you can then use this data for analyzing, forecasting, classifying, and much more!

pandas has an input and output API which has a set of top-level reader and writer functions. The reader function is accessed with pandas.read_json() that returns a pandas object, and the writer function is accessed with pandas.to_json() which is an object method.

DataFrame has a Reader and a Writer function. The Reader function allows you to read the different data formats, while the Writer function enables you to save the data in a particular format.

Below are data formats that DataFrame supports, which means if your data is in any of the below forms, you can use pandas to load that data format and even write into a particular format.

In today's tutorial, you will be working on a few of the above format types like JSON, HTML, and Pickle.

Note: Check out this tutorial to learn how to read CSV files using pandas.

Loading JSON Data

JSON, also known as JavaScript Object Notation, is a data-interchange text-serialization format. JSON is easy to read and write. It is based on a subset of the JavaScript Programming Language but uses conventions from Python, and many other languages outside of Python.

JSON is mostly used to store unstructured data, and SQL databases have a tough time saving it. JSON makes the data accessible for the machines to read.

JSON is mainly built on two structures:

A collection of key/value pairs. In Python, a key/value pair is referred to as a Dictionary, and a key is a unique attribute, whereas values are not.
An ordered list of values. The ordered list can also sometimes be a list of lists. Lists in Python are a set of values which can be a string, integer, etc.

So let's now see how you can load the JSON data in multiple ways.

The first JSON dataset is from this link. The data is in a key-value dictionary format. There are a total of three keys: namely integer, datetime, and category.

First, you will import the pandas library and then pass the URL to the pd.read_json() which will return a dataframe. The columns of the dataframes represent the keys, and the rows are the values of the JSON.

import pandas as pd

json = pd.read_json('https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json')

Let's quickly print the last few rows of the JSON that you read using the .tail() function.

json.tail(6)

	integer	datetime
94	5	2015-01-01 00:01:34
95	9	2015-01-01 00:01:35
96	8	2015-01-01 00:01:36
97	6	2015-01-01 00:01:37
98	8	2015-01-01 00:01:38
99	1	2015-01-01 00:01:39

json.shape

(100, 3)

From the above output, you can see that there are three total columns: integer, datetime, and category. Also, there are 100 samples in the dataset as verified from the .shape method which returned a 100 x 3 output.

Writing a JSON

Writing the JSON data is as simple as reading and is one line of code. Instead of read_json() you will use to_json() with a filename and that's all!

json.to_json('dataframe.json')

Parsing Nested JSON as a String

Next, you will use another type of JSON dataset, which is not as simple. It is a nested JSON structure. Nested JSON structure means that each key can have more keys associated with it.

Let's see the example dataset to understand it better.

nested_json = """{
   "article": [

      {
         "id":"01",
         "language": "JSON",
         "edition": "first",
         "author": "Allen"
      },

      {
         "id":"02",
         "language": "Python",
         "edition": "second",
         "author": "Aditya Sharma"
      }
   ],

   "blog":[
   {
       "name": "Datacamp",
       "URL":"datacamp.com"
   }
   ]
}"""

In the above dataset, you can observe that article and blog are two primary keys under which there are values of these keys. These values have their own key-value pair combination.

Note that the above dataset is enclosed with double-quotes and is in the form of a string.

Reading a nested JSON can be done in multiple ways.

First, you will use the json.loads function to read a JSON string by passing the data variable as a parameter to it. Then, you will use the json_normalize function to flatten the nested JSON data into a table.

You will import the json_normalize function from the pandas.io.json library.

import json  
from pandas.io.json import json_normalize  

nested = json.loads(nested_json)

nested

{'article': [{'id': '01',
   'language': 'JSON',
   'edition': 'first',
   'author': 'Allen'},
  {'id': '02',
   'language': 'Python',
   'edition': 'second',
   'author': 'Aditya Sharma'}],
 'blog': [{'name': 'Datacamp', 'URL': 'datacamp.com'}]}

Next, you will flatten the JSON using the normalize function. You will pass in the complete data for now and see how it looks.

nested_full = json_normalize(nested)

nested_full

	article	blog
0	[{'id': '01', 'language': 'JSON', 'edition': '...	[{'name': 'Datacamp', 'URL': 'datacamp.com'}]

Great! So as you can see from the above output, the primary keys are the columns of the dataframe while the keys represent the row.

However, the output looks a little clumsy, right? Let's see how you can further divide it into multiple dataframes.

You will again pass in the json.loads output to json_normalize but this time you will specify an extra element called record_path.

blog = json_normalize(nested,record_path ='blog')

blog

	URL	name
0	datacamp.com	Datacamp

article = json_normalize(nested,record_path ='article')

article

	author	edition	id	language
0	Allen	first	01	JSON
1	Aditya Sharma	second	02	Python

From the above outputs, you can observe that the primary keys are now divided into two different dataframes where the column names have now become the keys that were nested inside the primary keys as a key-value pair. Amazing, right?

Lastly, to wrap up JSON data loading, let's quickly find out how you can read JSON as a file and not as a string.

Reading a JSON as a File

nested_json = {
   "article": [

      {
         "id":"01",
         "language": "JSON",
         "edition": "first",
         "author": "Allen"
      },

      {
         "id":"02",
         "language": "Python",
         "edition": "second",
         "author": "Aditya Sharma"
      }
   ],

   "blog":[
   {
       "name": "Datacamp",
       "URL":"datacamp.com"
   }
   ]
}

If you notice in the above cell, there are no double-quotes in the start and end of the data, which means you will be treating this data as a file and read it.

You can achieve this with just one line of code. Let's find out how!

json_file = pd.DataFrame.from_dict(json_normalize(nested_json))

json_file

	article	blog
0	[{'id': '01', 'language': 'JSON', 'edition': '...	[{'name': 'Datacamp', 'URL': 'datacamp.com'}]

From the above output, you can see that the output is similar to when you read JSON as a string. The only change here is that you use pandas to both parse and flatten the JSON. Since the JSON is a dictionary you use the .from_dict() function.

Loading HTML Data

HTML is a Hypertext Markup Language that is mainly used for created web applications and pages. It tries to describe the structure of the web page semantically. The web browser receives an HTML document from a web server and renders it to a multimedia web page.

For web applications, HTML is used with cascading style sheets (CSS) while at the server end it collaborates with various web server frameworks like Flask, Django, etc.

HTML uses tags to define each block of code like a <p></p> tag for the start and end of a paragraph, <image></image> tag for adding content to the web page as an image and similarly there are many tags that together collate to form an HTML web page.

To read an HTML file, pandas dataframe looks for a tag. That tag is called a <td></td> tag. This tag is used for defining a table in HTML.

pandas uses read_html() to read the HTML document.

So, whenever you pass an HTML to pandas and expect it to output a nice looking dataframe, make sure the HTML page has a table in it!

About the Data: You will be using a Cryptocurrency website as an HTML dataset that has various crypto coins on it and has various details about each coin like:
- Last price of the coin (Last price)
- Whether the coin price has increased or decreased (in percentage %)
- The volume of 24 hours (how many coins were traded 24 volume)
- Total number of coins (# Coins)

So, let's get started!

You will first import requests library which will help you in sending a request to the URL from where you want to fetch the HTML content.

import requests

url = 'https://www.worldcoinindex.com/'

crypto_url = requests.get(url)

crypto_url

<Response [200]>

So, until now, you defined the URL and then using requests.get() you sent a request to that URL and received a response as an acknowledgement [200] OK which means that you were able to connect with that web server.

Now, to read the content of the HTML web page, all you need to do is call crypto_url.text which will give you the HTML code of that cryptocurrency web page.

Feel free to run the command crypto_url.text to see the output.

Finally, you will pass crypto_url.text to the pd.read_html() function which will return you a list of dataframes where each element in that list is a table (dataframe) the cryptocurrency webpage has in it.

crypto_data = pd.read_html(crypto_url.text)

Let's print the length and the type of the dataframe. The type should be a list.

len(crypto_data), type(crypto_data)

(1, list)

From the above output, it is clear that there is only 1 table with a type list.

crypto_data = crypto_data[0]

Let's remove the first and second column since they do not have any useful information in them and keep all the rows.

crypto_final = crypto_data.iloc[:,2:]

Finally, it's time to print the cryptocurrency dataframe!

crypto_final.head()

	Name	Ticker	Last price	%	24 high	24 low	Price Charts 7d	24 volume	# Coins	Market cap
0	bitcoin	BTC	$ 8,008.027	+1.83%	$ 8,056.630	$ 7,812.784	NaN	$ 12.04B	17.71M	$ 141.89B
1	ethereum	ETH	$ 251.72335	+2.68%	$ 253.84721	$ 242.95687	NaN	$ 6.93B	106.20M	$ 26.73B
2	litecoin	LTC	$ 98.633851	+11.08%	$ 99.946324	$ 88.618815	NaN	$ 3.46B	61.91M	$ 6.10B
3	bitcoincash	BCH	$ 411.94075	+2.31%	$ 418.60850	$ 394.18927	NaN	$ 2.10B	17.79M	$ 7.33B
4	eos	EOS	$ 6.4230717	+6.05%	$ 6.4765695	$ 6.0264079	NaN	$ 2.06B	1.01B	$ 6.50B

From the above table, you can observe that Bitcoin has the most Market Cap.

Dropping NaN (Not a Number)

There are some values in the dataframe that are not real values, so let's quickly remove them from the table.

However, first, let's completely drop (delete) the Price Charts 7d column since it is entirely NaN and has zero information in it.

del crypto_final['Price Charts 7d']

crypto_final.head()

	Name	Ticker	Last price	%	24 high	24 low	24 volume	# Coins	Market cap
0	bitcoin	BTC	$ 8,008.027	+1.83%	$ 8,056.630	$ 7,812.784	$ 12.04B	17.71M	$ 141.89B
1	ethereum	ETH	$ 251.72335	+2.68%	$ 253.84721	$ 242.95687	$ 6.93B	106.20M	$ 26.73B
2	litecoin	LTC	$ 98.633851	+11.08%	$ 99.946324	$ 88.618815	$ 3.46B	61.91M	$ 6.10B
3	bitcoincash	BCH	$ 411.94075	+2.31%	$ 418.60850	$ 394.18927	$ 2.10B	17.79M	$ 7.33B
4	eos	EOS	$ 6.4230717	+6.05%	$ 6.4765695	$ 6.0264079	$ 2.06B	1.01B	$ 6.50B

Now let's remove the NaN's.

crypto_final = crypto_final.dropna()

Visualizing the Crypto Coins (Ticker) Vs. Percentage Increase/Decrease in Price (%)

Let's first import matplotlib for plotting the graph.

import matplotlib.pyplot as plt
%matplotlib inline

The % column has a type string, and you need to convert the type string to float. First, you will remove the % sign from the column and then convert the type of the % column to float.

crypto_final['%'] = crypto_final['%'].apply(lambda x: x.strip('%'))

crypto_final['%'] = crypto_final['%'].astype('float')

Let's now plot the data.

plt.figure()
plt.figure(figsize=(16,10))
x = crypto_final.iloc[:20]['Ticker']
y = crypto_final.iloc[:20]['%']
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Percentage Increase/Decrease in Price',fontsize=20)
plt.ylabel('Ticker',fontsize=20)

plt.plot(x,y,label='% Increase/Decrease in Price')
plt.legend(loc='lower left',prop={'size': 15})

<matplotlib.legend.Legend at 0x1298a9630>

<Figure size 432x288 with 0 Axes>

From the above graph, you can observe that the Litecoin (LTC) and Huobitoken (HT) have the maximum increase in the price and on the other hand, Maticnetwork (MATIC) and Waves have the maximum decrease in the price of the crypto coins.

Loading Pickle Data

Pickle is a Python-specific binary serialization format which is not human-readable, unlike JSON. It is used for serializing and deserializing an object structure of Python. It serializes the object and Pickles it to save it on a disk. It converts the object like DataFrame, list, dictionary, etc. into a character stream.

The best part about Pickle is that it can store various kinds of Python data types.

Pickle is widely used for storing trained machine learning algorithm instances. Like JSON, Pickle also has handy functions like pickle.load() for loading a Pickle format file, and pickle.dump() for saving a Pickle or any other format in Pickle format.

Another import advantage of using Pickle is that Saving the dataframe as a Pickle file required less space on the disk and keeps the type of the data intact when reloaded.

So, let's quickly pickle the cryptocurrency dataframe you constructed earlier, and then you will read that pickled object using pandas.

import pickle

You will use pickle.dump to save the dataframe as a Pickle object and use the protocol as HIGHEST_PROTOCOL, which is backward compatible with Python 2 as well.

with open('crypto_final.pickle', 'wb') as sub_data:
    pickle.dump(crypto_final, sub_data, protocol=pickle.HIGHEST_PROTOCOL)

Finally, you will use the pandas read_pickle function to convert the pickle object into a dataframe.

crypto_final = pd.read_pickle('crypto_final.pickle')

crypto_final.head()

	Name	Ticker	Last price	%	24 high	24 low	24 volume	# Coins	Market cap
0	bitcoin	BTC	$ 7,776.567	+1.81%	$ 7,870.205	$ 7,506.518	$ 12.70B	17.71M	$ 137.78B
1	ethereum	ETH	$ 241.81372	-0.97%	$ 245.57293	$ 232.62523	$ 7.47B	106.18M	$ 25.67B
2	litecoin	LTC	$ 88.062811	+0.33%	$ 88.946501	$ 85.248641	$ 2.59B	61.90M	$ 5.45B
3	bitcoincash	BCH	$ 388.76089	+0.06%	$ 398.27697	$ 370.29831	$ 2.33B	17.79M	$ 6.91B
4	eos	EOS	$ 5.9327948	-0.41%	$ 6.0163442	$ 5.7789154	$ 1.99B	1.01B	$ 6.00B

Wow! Isn't that amazing? How straightforward that was. Moreover, with Pickle, you do not have to worry about the datatype conversion errors as the pickle serialization can handle that for you!

Go Further

Congratulations on finishing the tutorial.

This tutorial was a good starting point on how you can load different data formats in Python with the help of pandas.

There are still many data formats like Excel, SQL, HDF5, etc., that fall under the pandas data import umbrella. You should get your hands dirty with them by finding some interesting public datasets in these data formats and try experimenting with them.

If you would like to learn more about DataFrames in pandas, take DataCamp's data manipulation with pandas interactive course. DataCamp also has several other handy pandas tutorials covering joining dataframes, implementing moving averages and using the apply method. Or, if you're completely new to pandas, start with this beginners guide to pandas tutorial.

References:

Please feel free to ask any questions related to this tutorial in the comments section below.

Sujets

Python

Learn more about Python and pandas

Cours

Pandas Joins for Spreadsheet Users

4 h

4.2K

Learn how to effectively and efficiently join datasets in tabular format using the Python Pandas library.

Afficher les détails

Commencer le cours

Cours

Manipulation de données avec pandas

4 h

502.1K

Apprenez à importer et nettoyer des données, calculer des statistiques et créer des visualisations avec pandas.

Afficher les détails

Commencer le cours

Cours

Joindre des données avec pandas

4 h

198.5K

Apprenez à joindre des données issues de plusieurs tables avec pandas pour des analyses plus riches et précises.

Afficher les détails

Commencer le cours

Apparenté

cheat-sheet

Pandas Cheat Sheet for Data Science in Python

A quick guide to the basics of the Python data analysis library Pandas, including code samples.

Karlijn Willems

cheat-sheet

Importing Data in Python Cheat Sheet

With this Python cheat sheet, you'll have a handy reference guide to importing your data, from flat files to files native to other software and relational databases.

Karlijn Willems

Didacticiel

pandas read_csv() Tutorial: Importing Data

Importing data is the first step in any data science project. Learn why today's data scientists prefer the pandas read_csv() function to do this.

Kurtis Pykes

Didacticiel

Pandas Tutorial: DataFrames in Python

Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data.

Karlijn Willems

Didacticiel

Data Preparation with pandas

In this tutorial, you will learn why it is important to pre-process data and how to do it in pandas.

Sejal Jaiswal

Didacticiel

How to Import Data Into R

Find out how to import data into R, including CSV, JSON, Excel, HTML, databases, SAS, SPSS, Matlab, and other files using the popular R packages.

Abid Ali Awan

Voir plus Voir plus

Introduction

Loading JSON Data

Loading HTML Data

Dropping NaN (Not a Number)

Visualizing the Crypto Coins (Ticker) Vs. Percentage Increase/Decrease in Price (%)

Loading Pickle Data

Pandas Cheat Sheet for Data Science in Python

Importing Data in Python Cheat Sheet

pandas read_csv() Tutorial: Importing Data

Pandas Tutorial: DataFrames in Python

Data Preparation with pandas

How to Import Data Into R

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Pandas Joins for Spreadsheet Users

Manipulation de données avec pandas

Joindre des données avec pandas

Pandas Cheat Sheet for Data Science in Python

Importing Data in Python Cheat Sheet

pandas read_csv() Tutorial: Importing Data

Pandas Tutorial: DataFrames in Python

Data Preparation with pandas

How to Import Data Into R

Pandas Joins for Spreadsheet Users