Accéder au contenu principal

How to Import JSON and HTML Data into pandas

To be an adept data scientist, one must know how to deal with many different kinds of data. Learn to read various formats of data like JSON and HTML using pandas.
31 mai 2019  · 13 min de lecture

Introduction

Importing data is one of the most essential and very first steps in any data related problem. The ability to import the data correctly is a must-have skill for every aspiring data scientist.

Data exists in many different forms, and not only should you know how to import various data formats but also how to analyze and manipulate the data to gain useful insights.

pandas is an open source Python library which is easy-to-use, provides high-performance, and a data analysis tool for various data formats.

It gives you the capability to read various types of data formats like CSV, JSON, Excel, Pickle, etc. It allows you to represent your data in a row and column tabular fashion, which makes the data readable and presentable.

pandas represent the data in a DataFrame form and provide you with extensive usage for data analysis and data manipulation. Once you start making sense out of the data using the various functionalities in pandas, you can then use this data for analyzing, forecasting, classifying, and much more!

pandas has an input and output API which has a set of top-level reader and writer functions. The reader function is accessed with pandas.read_json() that returns a pandas object, and the writer function is accessed with pandas.to_json() which is an object method.

DataFrame has a Reader and a Writer function. The Reader function allows you to read the different data formats, while the Writer function enables you to save the data in a particular format.

Below are data formats that DataFrame supports, which means if your data is in any of the below forms, you can use pandas to load that data format and even write into a particular format.

In today's tutorial, you will be working on a few of the above format types like JSON, HTML, and Pickle.

Note: Check out this tutorial to learn how to read CSV files using pandas.

Loading JSON Data

JSON, also known as JavaScript Object Notation, is a data-interchange text-serialization format. JSON is easy to read and write. It is based on a subset of the JavaScript Programming Language but uses conventions from Python, and many other languages outside of Python.

JSON is mostly used to store unstructured data, and SQL databases have a tough time saving it. JSON makes the data accessible for the machines to read.

JSON is mainly built on two structures:

  • A collection of key/value pairs. In Python, a key/value pair is referred to as a Dictionary, and a key is a unique attribute, whereas values are not.

  • An ordered list of values. The ordered list can also sometimes be a list of lists. Lists in Python are a set of values which can be a string, integer, etc.

So let's now see how you can load the JSON data in multiple ways.

The first JSON dataset is from this link. The data is in a key-value dictionary format. There are a total of three keys: namely integer, datetime, and category.

  • First, you will import the pandas library and then pass the URL to the pd.read_json() which will return a dataframe. The columns of the dataframes represent the keys, and the rows are the values of the JSON.
import pandas as pd

json = pd.read_json('https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/data.json')

Let's quickly print the last few rows of the JSON that you read using the .tail() function.

json.tail(6)
  integer datetime category
94 5 2015-01-01 00:01:34 0
95 9 2015-01-01 00:01:35 0
96 8 2015-01-01 00:01:36 0
97 6 2015-01-01 00:01:37 0
98 8 2015-01-01 00:01:38 0
99 1 2015-01-01 00:01:39 0
json.shape
(100, 3)

From the above output, you can see that there are three total columns: integer, datetime, and category. Also, there are 100 samples in the dataset as verified from the .shape method which returned a 100 x 3 output.

  • Writing a JSON

Writing the JSON data is as simple as reading and is one line of code. Instead of read_json() you will use to_json() with a filename and that's all!

json.to_json('dataframe.json')
  • Parsing Nested JSON as a String

Next, you will use another type of JSON dataset, which is not as simple. It is a nested JSON structure. Nested JSON structure means that each key can have more keys associated with it.

Let's see the example dataset to understand it better.

nested_json = """{
   "article": [

      {
         "id":"01",
         "language": "JSON",
         "edition": "first",
         "author": "Allen"
      },

      {
         "id":"02",
         "language": "Python",
         "edition": "second",
         "author": "Aditya Sharma"
      }
   ],

   "blog":[
   {
       "name": "Datacamp",
       "URL":"datacamp.com"
   }
   ]
}"""

In the above dataset, you can observe that article and blog are two primary keys under which there are values of these keys. These values have their own key-value pair combination.

Note that the above dataset is enclosed with double-quotes and is in the form of a string.

Reading a nested JSON can be done in multiple ways.

First, you will use the json.loads function to read a JSON string by passing the data variable as a parameter to it. Then, you will use the json_normalize function to flatten the nested JSON data into a table.

You will import the json_normalize function from the pandas.io.json library.

import json  
from pandas.io.json import json_normalize  

nested = json.loads(nested_json)
nested
{'article': [{'id': '01',
   'language': 'JSON',
   'edition': 'first',
   'author': 'Allen'},
  {'id': '02',
   'language': 'Python',
   'edition': 'second',
   'author': 'Aditya Sharma'}],
 'blog': [{'name': 'Datacamp', 'URL': 'datacamp.com'}]}

Next, you will flatten the JSON using the normalize function. You will pass in the complete data for now and see how it looks.

nested_full = json_normalize(nested)
nested_full
  article blog
0 [{'id': '01', 'language': 'JSON', 'edition': '... [{'name': 'Datacamp', 'URL': 'datacamp.com'}]

Great! So as you can see from the above output, the primary keys are the columns of the dataframe while the keys represent the row.

However, the output looks a little clumsy, right? Let's see how you can further divide it into multiple dataframes.

You will again pass in the json.loads output to json_normalize but this time you will specify an extra element called record_path.

blog = json_normalize(nested,record_path ='blog')
blog
  URL name
0 datacamp.com Datacamp
article = json_normalize(nested,record_path ='article')
article
  author edition id language
0 Allen first 01 JSON
1 Aditya Sharma second 02 Python

From the above outputs, you can observe that the primary keys are now divided into two different dataframes where the column names have now become the keys that were nested inside the primary keys as a key-value pair. Amazing, right?

Lastly, to wrap up JSON data loading, let's quickly find out how you can read JSON as a file and not as a string.

  • Reading a JSON as a File
nested_json = {
   "article": [

      {
         "id":"01",
         "language": "JSON",
         "edition": "first",
         "author": "Allen"
      },

      {
         "id":"02",
         "language": "Python",
         "edition": "second",
         "author": "Aditya Sharma"
      }
   ],

   "blog":[
   {
       "name": "Datacamp",
       "URL":"datacamp.com"
   }
   ]
}

If you notice in the above cell, there are no double-quotes in the start and end of the data, which means you will be treating this data as a file and read it.

You can achieve this with just one line of code. Let's find out how!

json_file = pd.DataFrame.from_dict(json_normalize(nested_json))
json_file
  article blog
0 [{'id': '01', 'language': 'JSON', 'edition': '... [{'name': 'Datacamp', 'URL': 'datacamp.com'}]

From the above output, you can see that the output is similar to when you read JSON as a string. The only change here is that you use pandas to both parse and flatten the JSON. Since the JSON is a dictionary you use the .from_dict() function.

Loading HTML Data

HTML is a Hypertext Markup Language that is mainly used for created web applications and pages. It tries to describe the structure of the web page semantically. The web browser receives an HTML document from a web server and renders it to a multimedia web page.

For web applications, HTML is used with cascading style sheets (CSS) while at the server end it collaborates with various web server frameworks like Flask, Django, etc.

HTML uses tags to define each block of code like a <p></p> tag for the start and end of a paragraph, <image></image> tag for adding content to the web page as an image and similarly there are many tags that together collate to form an HTML web page.

To read an HTML file, pandas dataframe looks for a tag. That tag is called a <td></td> tag. This tag is used for defining a table in HTML.

pandas uses read_html() to read the HTML document.

So, whenever you pass an HTML to pandas and expect it to output a nice looking dataframe, make sure the HTML page has a table in it!

  • About the Data: You will be using a Cryptocurrency website as an HTML dataset that has various crypto coins on it and has various details about each coin like:
    • Last price of the coin (Last price)
    • Whether the coin price has increased or decreased (in percentage %)
    • The volume of 24 hours (how many coins were traded 24 volume)
    • Total number of coins (# Coins)

So, let's get started!

You will first import requests library which will help you in sending a request to the URL from where you want to fetch the HTML content.

import requests
url = 'https://www.worldcoinindex.com/'
crypto_url = requests.get(url)
crypto_url
<Response [200]>

So, until now, you defined the URL and then using requests.get() you sent a request to that URL and received a response as an acknowledgement [200] OK which means that you were able to connect with that web server.

Now, to read the content of the HTML web page, all you need to do is call crypto_url.text which will give you the HTML code of that cryptocurrency web page.

Feel free to run the command crypto_url.text to see the output.

Finally, you will pass crypto_url.text to the pd.read_html() function which will return you a list of dataframes where each element in that list is a table (dataframe) the cryptocurrency webpage has in it.

crypto_data = pd.read_html(crypto_url.text)

Let's print the length and the type of the dataframe. The type should be a list.

len(crypto_data), type(crypto_data)
(1, list)

From the above output, it is clear that there is only 1 table with a type list.

crypto_data = crypto_data[0]

Let's remove the first and second column since they do not have any useful information in them and keep all the rows.

crypto_final = crypto_data.iloc[:,2:]

Finally, it's time to print the cryptocurrency dataframe!

crypto_final.head()
  Name Ticker Last price % 24 high 24 low Price Charts 7d 24 volume # Coins Market cap
0 bitcoin BTC $ 8,008.027 +1.83% $ 8,056.630 $ 7,812.784 NaN $ 12.04B 17.71M $ 141.89B
1 ethereum ETH $ 251.72335 +2.68% $ 253.84721 $ 242.95687 NaN $ 6.93B 106.20M $ 26.73B
2 litecoin LTC $ 98.633851 +11.08% $ 99.946324 $ 88.618815 NaN $ 3.46B 61.91M $ 6.10B
3 bitcoincash BCH $ 411.94075 +2.31% $ 418.60850 $ 394.18927 NaN $ 2.10B 17.79M $ 7.33B
4 eos EOS $ 6.4230717 +6.05% $ 6.4765695 $ 6.0264079 NaN $ 2.06B 1.01B $ 6.50B

From the above table, you can observe that Bitcoin has the most Market Cap.

Dropping NaN (Not a Number)

There are some values in the dataframe that are not real values, so let's quickly remove them from the table.

However, first, let's completely drop (delete) the Price Charts 7d column since it is entirely NaN and has zero information in it.

del crypto_final['Price Charts 7d']
crypto_final.head()
  Name Ticker Last price % 24 high 24 low 24 volume # Coins Market cap
0 bitcoin BTC $ 8,008.027 +1.83% $ 8,056.630 $ 7,812.784 $ 12.04B 17.71M $ 141.89B
1 ethereum ETH $ 251.72335 +2.68% $ 253.84721 $ 242.95687 $ 6.93B 106.20M $ 26.73B
2 litecoin LTC $ 98.633851 +11.08% $ 99.946324 $ 88.618815 $ 3.46B 61.91M $ 6.10B
3 bitcoincash BCH $ 411.94075 +2.31% $ 418.60850 $ 394.18927 $ 2.10B 17.79M $ 7.33B
4 eos EOS $ 6.4230717 +6.05% $ 6.4765695 $ 6.0264079 $ 2.06B 1.01B $ 6.50B

Now let's remove the NaN's.

crypto_final = crypto_final.dropna()

Visualizing the Crypto Coins (Ticker) Vs. Percentage Increase/Decrease in Price (%)

Let's first import matplotlib for plotting the graph.

import matplotlib.pyplot as plt
%matplotlib inline

The % column has a type string, and you need to convert the type string to float. First, you will remove the % sign from the column and then convert the type of the % column to float.

crypto_final['%'] = crypto_final['%'].apply(lambda x: x.strip('%'))
crypto_final['%'] = crypto_final['%'].astype('float')

Let's now plot the data.

plt.figure()
plt.figure(figsize=(16,10))
x = crypto_final.iloc[:20]['Ticker']
y = crypto_final.iloc[:20]['%']
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Percentage Increase/Decrease in Price',fontsize=20)
plt.ylabel('Ticker',fontsize=20)

plt.plot(x,y,label='% Increase/Decrease in Price')
plt.legend(loc='lower left',prop={'size': 15})
<matplotlib.legend.Legend at 0x1298a9630>

<Figure size 432x288 with 0 Axes>

From the above graph, you can observe that the Litecoin (LTC) and Huobitoken (HT) have the maximum increase in the price and on the other hand, Maticnetwork (MATIC) and Waves have the maximum decrease in the price of the crypto coins.

Loading Pickle Data

Pickle is a Python-specific binary serialization format which is not human-readable, unlike JSON. It is used for serializing and deserializing an object structure of Python. It serializes the object and Pickles it to save it on a disk. It converts the object like DataFrame, list, dictionary, etc. into a character stream.

The best part about Pickle is that it can store various kinds of Python data types.

Pickle is widely used for storing trained machine learning algorithm instances. Like JSON, Pickle also has handy functions like pickle.load() for loading a Pickle format file, and pickle.dump() for saving a Pickle or any other format in Pickle format.

Another import advantage of using Pickle is that Saving the dataframe as a Pickle file required less space on the disk and keeps the type of the data intact when reloaded.

So, let's quickly pickle the cryptocurrency dataframe you constructed earlier, and then you will read that pickled object using pandas.

import pickle

You will use pickle.dump to save the dataframe as a Pickle object and use the protocol as HIGHEST_PROTOCOL, which is backward compatible with Python 2 as well.

with open('crypto_final.pickle', 'wb') as sub_data:
    pickle.dump(crypto_final, sub_data, protocol=pickle.HIGHEST_PROTOCOL)

Finally, you will use the pandas read_pickle function to convert the pickle object into a dataframe.

crypto_final = pd.read_pickle('crypto_final.pickle')
crypto_final.head()
  Name Ticker Last price % 24 high 24 low 24 volume # Coins Market cap
0 bitcoin BTC $ 7,776.567 +1.81% $ 7,870.205 $ 7,506.518 $ 12.70B 17.71M $ 137.78B
1 ethereum ETH $ 241.81372 -0.97% $ 245.57293 $ 232.62523 $ 7.47B 106.18M $ 25.67B
2 litecoin LTC $ 88.062811 +0.33% $ 88.946501 $ 85.248641 $ 2.59B 61.90M $ 5.45B
3 bitcoincash BCH $ 388.76089 +0.06% $ 398.27697 $ 370.29831 $ 2.33B 17.79M $ 6.91B
4 eos EOS $ 5.9327948 -0.41% $ 6.0163442 $ 5.7789154 $ 1.99B 1.01B $ 6.00B

Wow! Isn't that amazing? How straightforward that was. Moreover, with Pickle, you do not have to worry about the datatype conversion errors as the pickle serialization can handle that for you!

Go Further

Congratulations on finishing the tutorial.

This tutorial was a good starting point on how you can load different data formats in Python with the help of pandas.

There are still many data formats like Excel, SQL, HDF5, etc., that fall under the pandas data import umbrella. You should get your hands dirty with them by finding some interesting public datasets in these data formats and try experimenting with them.

If you would like to learn more about DataFrames in pandas, take DataCamp's data manipulation with pandas interactive course. DataCamp also has several other handy pandas tutorials covering joining dataframes, implementing moving averages and using the apply method. Or, if you're completely new to pandas, start with this beginners guide to pandas tutorial

References:

Please feel free to ask any questions related to this tutorial in the comments section below.

Sujets

Learn more about Python and pandas

cours

Pandas Joins for Spreadsheet Users

4 hr
3.7K
Learn how to effectively and efficiently join datasets in tabular format using the Python Pandas library.
Afficher les détailsRight Arrow
Commencer Le Cours
Voir plusRight Arrow
Apparenté

cheat-sheet

Pandas Cheat Sheet for Data Science in Python

A quick guide to the basics of the Python data analysis library Pandas, including code samples.
Karlijn Willems's photo

Karlijn Willems

4 min

cheat-sheet

Importing Data in Python Cheat Sheet

With this Python cheat sheet, you'll have a handy reference guide to importing your data, from flat files to files native to other software and relational databases.
Karlijn Willems's photo

Karlijn Willems

5 min

didacticiel

pandas read_csv() Tutorial: Importing Data

Importing data is the first step in any data science project. Learn why today's data scientists prefer the pandas read_csv() function to do this.
Kurtis Pykes 's photo

Kurtis Pykes

9 min

didacticiel

Pandas Tutorial: DataFrames in Python

Explore data analysis with Python. Pandas DataFrames make manipulating your data easy, from selecting or replacing columns and indices to reshaping your data.
Karlijn Willems's photo

Karlijn Willems

20 min

didacticiel

Data Preparation with pandas

In this tutorial, you will learn why it is important to pre-process data and how to do it in pandas.
Sejal Jaiswal's photo

Sejal Jaiswal

15 min

didacticiel

How to Import Data Into R: A Tutorial

Find out how to import data into R, including CSV, JSON, Excel, HTML, databases, SAS, SPSS, Matlab, and other files using the popular R packages.
Abid Ali Awan's photo

Abid Ali Awan

10 min

See MoreSee More