Skip to content

Scrape Text Data from Webpages

The web is full of untapped data! In this template, you can indicate the URL you want to scrape from and this template will turn it into analyzable text data using HTTP requests and Beautiful Soup.

%%capture
# Install Beautful Soup, https://pypi.org/project/bs4/
!pip install bs4 
# Load packages
import pandas
import requests
from bs4 import BeautifulSoup

# Specify the url you want to scrape
url = "https://workspace-docs.datacamp.com/work/working-in-the-workspace"

# Package the request, send the request and catch the response
r = requests.get(url)

# Extract the response as HTML
html_doc = r.text

# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html_doc)

Extracting text elements of the webpage

The soup object contains the HTML of the webpage, which will likely require more pre-processing to be useful to you. The code below extracts specific elements of the webpage, including title, text, and links. This is useful for natural language processing projects.

# Get the title of the webpage
soup.title.string
# Get the text of the webpage
soup.text
# Get and print the link of all 'a' HTML tags
for link in soup.find_all("a"):
    print(link.get("href"))

For more information on how to extract other elements of a webpage, visit Beautiful Soup documentation.