Skip to content

Intermediate Importing Data in Python

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Importing flat files from the web

The urllib package

The urllib package provides interface for fetching data across the web. The urllib package in Python is a standard library module that provides a collection of modules for working with URLs (Uniform Resource Locators). It offers various functions and classes for performing tasks such as sending HTTP requests, handling URLs, and working with network resources. urllib.request: This module provides classes and functions for opening and reading URLs. It allows you to send HTTP/HTTPS requests, handle cookies, handle redirects, and perform basic authentication. It is commonly used for web scraping, downloading files, and interacting with web APIs. urlopen() accepts URLs instead of file names.

Example:

# Import package from urllib.request import urlretrieve # Import pandas import pandas as pd # Assign url of file: url url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv' # Save file locally using urlretrieve function urlretrieve(url, 'winequality-red.csv') # Read file into a DataFrame and print its head df = pd.read_csv('winequality-red.csv', sep=';') print(df.head())

If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas. In particular, you can use the function pd.read_csv() with the URL as the first argument and the separator sep as the second argument:

# Assign url of file: url url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv' # Read file into a DataFrame: df df = pd.read_csv(url, sep=';')

HTTP requests to import files from the web

URL stands for Uniform or Universal Resource Locator and all they really are are references to web resources. The vast majority of URLs are web addresses, but they can refer to a few other things, such as file transfer protocols (FTP) and database access. We'll currently focus on those URLs that are web addresses OR the locations of websites. Such a URL consists of 2 parts, a protocol identifier http or https and a resource name such as datacamp.com. The combination of protocol identifier and resource name uniquely specifies the web address!

The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web." Note that HTTPS is a more secure form of HTTP. Each time you go to a website, you are actually sending an HTTP request to a server. This request is known as a GET request, by far the most common type of HTTP request. We are actually performing a GET request when using the function urlretrieve. The ingenuity of urlretrieve also lies in fact that it not only makes a GET request but also saves the relevant data locally.

Performing HTTP requests in Python using urllib

# Import packages from urllib.request import urlopen, Request # Specify the url url = "https://campus.datacamp.com/courses/1606/4135?ex=2" # This packages the request request = Request(url) # Sends the request and catches the response: response response = urlopen(request) # Extract the response: html html = response.read() # Print the datatype of response print(type(response)) # Print the html print(html) # Be polite and close the response! response.close()

Using the package requests

You can package the request to the URL, send the request and catch the response with a single function requests.get(). Note that unlike in the previous exercises using urllib, you don't have to close the connection when using requests!

Example:

# Import package import requests # Specify the url: url url = "http://www.datacamp.com/teach/documentation" # Packages the request, send the request and catch the response: r r = requests.get(url) # Extract the response: text text = r.text # Print the html print(text)

Scraping the web in Python

You've got the HTML of your page of interest but, generally HTML is a humble-jumble mix of both unstructured and structured data. A word on these terms: Structured data is data that has a pre-defined data model or that is organized in a defined manner. Unstructured data is data that does not possess either of these properties. HTML is interesting because, although much of it is unstructured text, it does contain tags that determine where, for examples, headings can be found, and hyperlinks. In general, to turn HTML that you have scraped from the world wide web into useful data, you'll need to parse it and extract structured data from it.

The BeautifulSoup() function is a core component of the bs4 (Beautiful Soup 4) package in Python, which is a widely used library for web scraping and parsing HTML or XML documents. It allows you to extract data from web pages by navigating and searching the document's structure.

Syntax:

bs_objexct = BeautifulSoup(markup, parser)

Parameters: markup: This parameter represents the markup or the document you want to parse. It can be a string containing HTML or XML code, or it can be a file-like object. parser: This parameter specifies the parser to be used for parsing the document. Beautiful Soup supports different parsers, such as Python's built-in html.parser, lxml, and html5lib. If not specified, it will use the default parser, html.parser.

Here is an example:

# Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # Create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Prettify the BeautifulSoup object: pretty_soup pretty_soup = soup.prettify() # Print the response print(pretty_soup)

The prettify() function is a method provided by the BeautifulSoup object. It helps in formatting the parsed document's HTML or XML structure, making it more readable. It adds indentation and line breaks to the document, making it easier to understand and work with.

soup.title: Represents the title of the document, accessed as an attribute. soup.get_text(separator): Retrieves the text content of an element and its descendants, and returns it as a single string. The separator parameter specifies the string used to join the individual text parts. soup.find_all(name, attrs, recursive, string, limit, **kwargs): Searches the document for all occurrences of elements with the given name and optional attrs or string, and returns a ResultSet object containing the matching elements.

Getting the Hyperlinks

# Import packages import requests from bs4 import BeautifulSoup # Specify url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Print the title of Guido's webpage print(soup.title) # Find all 'a' tags (which define hyperlinks): a_tags a_tags = soup.find_all('a') # Print the URLs to the shell for link in a_tags: print(link.get('href'))

Introduction to APIs and JSONs

APIs

API stands for Application Programming Interface. It is a set of protocols, routines, and tools for building software applications. APIs specify how software components should interact and APIs are used when programming graphical user interface (GUI) components.

JSON

JSON stands for JavaScript Object Notation. It is a lightweight data interchange format that is easy for humans to read and write and easy for machines to parse and generate. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others.

Loading and exploring a JSON Example:

# Load JSON: json_data with open("a_movie.json") as json_file: json_data = json.load(json_file) # Print each key-value pair in json_data for k in json_data.keys(): print(k + ': ', json_data[k])

The json.load() function returns a Python object that corresponds to the JSON data structure. The exact type of the returned object depends on the JSON data. For example, JSON arrays become Python lists, JSON objects become Python dictionaries, and JSON strings become Python strings.

APIs and interacting with the world wide web

Remember that in the URL, we have http which defines an HTTP request. What comes after defines the website/API we are trying to query. In the example: 'http://www.omdbapi.com/?t=hackers', the question mark in ?t=hackers represents a query and in this case it is querying for title = hackers. Other arguments can be added to the querry using an ampersand. For example:

# Import package import requests # Assign URL to variable: url url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network' # Package the request, send the request and catch the response: r r = requests.get(url) # Decode the JSON data into a dictionary: json_data json_data = r.json() # Print each key-value pair in json_data for k in json_data.keys(): print(k + ': ', json_data[k])