Tutorials
webscraping
+1

Scraping Reddit with Python and BeautifulSoup 4

In this tutorial, you'll learn how to get web pages using requests, analyze web pages in the browser, and extract information from raw HTML with BeautifulSoup.

You can find a finished working example of the script we will write here.

What's Web Scraping?

Right, so what exactly is web scraping? As the name implies, it's a method of 'scraping' or extracting data from webpages. Anything you can see on the internet with your browser, including this tutorial, can be scraped onto your local hard drive.

There are many uses for web scraping. For any data analysis, the first step is data acquisition. The internet is a vast repository of all of mankind's history and knowledge, and you have the means of extracting anything you want and doing with that information what you will.

In our tutorial, we'll be using Python and the BeautifulSoup 4 package to get information from a subreddit. We're interested in the datascience subreddit. We want to get the first 1000 posts on the subreddit and export them to a CSV file. We want to know who posted it, as well as how many likes and comments it has.

What we’ll be covering in the tutorial:

  • Getting web pages using requests
  • Analyzing web pages in browser for information
  • Extracting information from raw HTML with BeautifulSoup

Note: We'll be using the older version of Reddit's website because it is more lightweight to load, and hence less strenuous on your machine.

Pre-requisites

This tutorial assumes you know the following things:

  • Running Python scripts in your computer
    • A basic knowledge of HTML structure

You can learn the skills above in DataCamp's Python beginner course. That being said, the concepts used here are very minimal, and you can get away with a very little know-how of Python.

Now that that's done with, we can move onto the first part of making our web scraper. In fact, the first part of writing any Python script: imports.

In our scraper, we will be using the following packages:

  • requests
  • beautifulsoup4

You can install these packages with pip of course, like so:

pip install package_name

After you're done downloading the packages, go ahead and import them into your code.

import requests
import csv
import time
from bs4 import BeautifulSoup4

We will be using Python's built-in csv module to write our results to a CSV file. You may have noticed something quirky in the snippet above. That is, we downloaded a package called beautifulsoup4, but we imported from a module called bs4. This is legal in Python, and though it is generally frowned upon, it's not exactly against the law.

First Steps

So we have our environment set up and ready. Next, we need the url for the webpage that we want to scrape. For our tutorial, we're using Reddit's 'datascience' subreddit. Before we start writing the script, there’s some field work we need to do. Open a web browser, and go to the subreddit in question.

Note that we’ll be using the older version of the subreddit for our scraper. The newer version hides away some crucial information in the underbelly of the webpage. It’s possible to extract this information from the new site as well, but for the sake of simplicity, we’ll be using the older version which lays out everything bare.

Upon opening the link, you’re met with a flux of information overload. What exactly do we need in all of this? Well, upon probing all of the links, you’ll find that Reddit posts are of two types:

  • Inbound
  • Outbound

Inbound links are links pointing towards content on Reddit itself, and outbound is the exact opposite. This is important because we only want the inbound links. Posts that contain text written by users, not just links to other websites.

So, we know what we want, how do we go about extracting it? If you look at the title of the posts, you can see that it’s followed some text in brackets. The posts that we’re interested in is followed by ‘(self.datascience)’. Logically, we can assume that ‘self’ refers to the Reddit root directory, and ‘.datascience’ refers to the subreddit.

Great, so we have a way of identifying which posts are inbound and which are outbound. We now need to identify in the DOM structure. Is there a way we can find these identifiers in the DOM by searching the tags, classes, or ids in the DOM structure? Of course!

In your web browser, right click on any ‘self.datascience’ link and click on ‘inspect element’. Most modern web browsers have this great tool that lets web developers traverse the meaningful parts of the source code of the webpage dynamically. If you’re on Safari, make sure you have developer tools enabled in Safari preferences. You’ll be greeted by a pane that is scrolled down to the part of the source code that is responsible for that ‘(self.datascience)’ identifier.

In the pane, you can see the identifier is actually loaded by an anchor tag.

<a href="/r/datascience">self.datascience</a>

But this is enclosed by a span tag.

<span class="domain">
"("
<a href="/r/datascience/">self.datascience</a>
")"
</span>

This span tag is what contains the text that we see on the page, i.e. ‘(self.datascience)’. As you can see, it’s marked with the ‘domain’ class. You can check all the other links to see if they follow the same format, and sure enough; they do.

Getting The Page

We know what we want on the page, and that’s well and all, but how do we use Python to read the contents of the page? Well, it works pretty much the same way a human would read the contents of the page off of a web browser.

First, we need to request the web page using the ‘requests’ library.

url = "https://old.reddit.com/r/datascience/"
# Headers to mimic a browser visit
headers = {'User-Agent': 'Mozilla/5.0'}

# Returns a requests.models.Response object
page = requests.get(url, headers=headers)

Now we have a Response object which contains the raw text of the webpage. As of yet, we can’t do anything with this. All we have is a vast string that contains the entire source code of the HTML file. To make sense of this, we need to use BeautifulSoup 4.

The headers will allow us to mimic a browser visit. Since a response to a bot is different from the response to a browser, and our point of reference is a browser, it’s better to get the browser’s response.

BeautifulSoup will allow us to find specific tags, by searching for any combination of classes, ids, or tag names. This is done by creating a syntax tree, but the details of that are irrelevant to our goal (and out of the scope of this tutorial).

So let’s go ahead and create that syntax tree.

soup = BeautifulSoup(page.text, 'html.parser')

The soup is just a BeautifulSoup object that is created by taking a string of raw source code. Keep in mind that we need to specify the html parser. This is because BeautifulSoup can also create soup out of XML.

Finding Our Tags

We know what tags we want (the span tags with ‘domain’ class), and we have the soup. What comes next is traversing the soup and find all instances of these tags. You may laugh at how simple this is with BeautifulSoup.

domains = soup.find_all("span", class_="domain")

What did we just do? We called the ‘find_all’ method on the soup object, which looks into itself for all the anchor tags with the parameters passed in as the second argument. We only passed in one restraint (class has to be ‘domain’), but we can couple this with many other things, or even just use an id.

We use ‘class_=‘ because 'class' is a keyword reserved by Python for defining classes and such

If you wanted to pass in more than one parameter, all you have to do is make the second parameter a dictionary of the arguments you want to include, like so:

soup.find_all("span", {"class": "domain", "height", "100px"})

We have all the span tags in our ‘domains’ list, but what we want are the ‘(self.datascience)’ domains.

for domain in domains:
    if domain != "(self.datascience)":
        continue

    print(domain.text)

Right, now you should see a list of the types of the posts on the page excluding ones that aren’t ‘(self.datascience)’. But we basically have everything we want. We now have references to all the posts that are inbound.

Finding Our Information

It’s great and all that we’re printing a couple of lines of ‘(self.datascience)’, but what next? Well, think back to our initial aim. Get the post title, author, likes, and the number of comments.

For this, we have to go back to the browser. If you bring up the inspector on our identifier again, you’ll see that our span tag is nested inside a div tag… Which is in itself nested inside another div tag, and so on. Keep going up, and you’ll get to the parent div for the entire post.

<div class=" thing id-t3_8qccuv even  link self" ...>
    ...
    <div class="entry unvoted">
        <div class="top-matter">
            <p class="title">
                ...
                <span class="domain">
                ...
            </p>
        </div>
    </div>
</div>

As you can see, the common parent is 4 levels above them in the DOM structure. The ellipsis (…) represent unnecessary chatter. But what we need now, is a reference to the post div for each post in our domains list. We can find the parent of any element in our soup by using BeautifulSoup’s methods.

for domain in soup.find_all("span", class_="domain"):
    if domain != "(self.datascience)":
        continue

    parent_div = domain.parent.parent.parent.parent
    print(parent_div.text)

Running the script will print all the text of all the inbound posts. But you may think that this looks like bad code. You’d be right. It is never safe to rely solely on the structural integrity of the DOM. It’s the kind of hack and slash solution that needs constant updating and is the code equivalent of a ticking time bomb.

Instead, let’s take a look at the parent div for each post and see if we can segregate inbound and outbound links from the parent itself. Each parent div has an attribute called ‘data-domain’, whose value is exactly what we want! All the inbound posts have the data-domain set to ‘self.datascience’.

As we have mentioned before, we can search for tags with a combination of attributes with BeautifulSoup. Lucky for us, Reddit chose to class the parent div of each post with ‘thing’.

attrs = {'class': 'thing', 'data-domain': 'self.datascience'}

for post in soup.find_all('div', attrs=attrs):
    print(post.attrs['data-domain'])

The ‘attrs’ variable is a dictionary that contains what we want to search for attributes and their values. This is a much safer approach to finding the posts because it involves less moving parts. Since we’re already adding ‘self.datascience’ as an argument when searching, we don’t have to use the if statement to skip any iterations because we’re guaranteed to receive posts only with the ‘self.datascience’ data-domain attribute.

Now that we have all the information we want, what’s left is just extracting the information from the children. Which is as simple as you think it it is by the way.

Extracting Our Information

For each post, we need 4 pieces of information.

  • Title
  • Author
  • Likes
  • Comments

If we take a look at the structure of the post div, we can find all of this information.

The Title

This is the simplest one yet. In each post div, there exists a paragraph tag nested under a couple of layers of divs. It’s particularly easy to find because it has the ‘title’ class attached to it.

title = post.find('p', class_="title").text

The post object is within our for loop from earlier. It is also a BeautifulSoup object, but it contains only the DOM structure within the post div. The ‘find’ method only returns one object, as opposed to 'find_all', which return a list of objects meeting the criteria. After finding the tag, we only want the string containing the title, so we use it's text property to read it. We can also read other attributes by using ‘object.attrs[‘attribute’]’.

The Author

This is also relatively simple, open the inspector with any author’s name under the post title. You’ll see that the author name is in an anchor tag classed with ‘author’.

author = post.find('a', class_='author').text

The Comments

This one requires some extra work but is still simple. We can find the comments the same way we found the title and author.

comments = post.find('a', class_='comments').text

When you run this, you’ll get something like “49 comments”. This is okay, but it would be better if we only got the number. To do this, we need to use some more Python.

If we use Python’s ‘str.split()’ function, it will return an array of all the elements in the string separated by spaces. In our case, we will get the list ‘[“49", “comments”]’. Great! Now, all we have to do is get the first element and store it. All we have to do is append the functions to our line.

comments = post.find('a', class_='comments').text.split()[0]

But we’re still not done because sometimes, we don’t get a number. We get ‘comment’. This happens when there are no comments on a post. Since we know this, all we have to do is check if we’re getting ‘comments’ as a result and replace it with ‘0’.

if comments == "comment":
    comments = 0

The Likes

Finding the number of likes is a piece of cake, and falls in line with the logic we used above.

likes = post.find("div", attrs={"class": "score likes"}).text

Here though, you may notice we’ve used a combination of two classes. This is simply because there are multiple other divs with the class ‘score’ or ‘likes’, but only one with the combination of ‘score’ and ‘likes’.

If the number of likes is 0, we get 0. But something weird happens if the post is so new that it doesn’t have any likes. We get “•”. Let’s replace this with “None” so it doesn’t cause confusion in our final results.

if likes == "•":
    likes = "None"

Writing Our Results to CSV

So far, we have loop that extracts the title, author, likes, and comments for each post on the webpage. Python has a great built-in module for writing and reading CSV files named ‘csv’, following the pythonic way: keep it simple. Anyway, all we have to do is add a line at the end of the loop block that appends the details of the post to a CSV file.

counter = 1
for post in posts:
    ...
    post_line = [counter, title, author, likes, comments]
    with open('output.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerow(post_line)

    counter += 1

Pretty straightforward right? The post_line is used to create an array storing the elements that need to be separated by a comma. The counter variable is used to keep track of how many posts we've recorded.

Moving to The Next Page

Since we’re counting the number of posts we’re storing, all we have to do is enclose our entire logic in another loop that requests new pages until a certain amount of posts are recorded.

counter = 1

while (counter <= 100):
    for post in posts:
        # Getting the title, author, ...
        # Writing to CSV and incrementing counter...

    next_button = soup.find("span", class_="next-button")
    next_page_link = next_button.find("a").attrs['href']
    time.sleep(2)
    page = requests.get(next_page_link, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

We just did a number of things, didn’t we? Let’s look at it one line at a time. First off, we moved the counter variable up a block, so that the while loop can use it.

Next, we know what the for loop does, but what about the others. The ‘next_button' variable is used to find and store the 'next' button. After that, we can find the anchor tag within and get the 'href' attribute; which we store in 'next_page_link’.

We can use this link to request the next page and store it back in ‘page’ and make another soup with BeautifulSoup.

The snippet above stops after 100, but you can make it stop after any number of posts you like.

Scraping Responsibly

The line that we didn’t talk about above is ‘time.sleep(2)’. That was because that line deserves its own section. Any web server has a finite amount of resources, so it’s up to us to make sure we don’t use up all of them. Not only does it get you banned to request hundreds of pages in a matter of seconds, but it also isn’t nice.

It’s important to keep in mind that it’s pretty nice of the website even to allow you to scrape them because if they wanted, they could detect bots in the first 10 to 20 requests, or even catch you based on the request object that Python sends. It’s pretty nice of them that they allow you to scrape without the need of rotating your IP, so you should do them the service of slowing down your bot.

You can find a websites crawl policy by looking at their robots.txt file. This is usually found at the root of the website (eg. http://www.reddit.com/robots.txt).

What Next?

Well, whatever you want really. As we’ve just proven, anything you can see on the web can be scraped and stored locally. You can use the information we just acquired for a multitude of purposes. With just the fours pieces of information we found, we can draw up a lot of conclusions.

By analyzing our information further, we can figure which kind of posts receive the most likes, what words they contain, who they’re posted by. Or conversely, we can find make a controversy calculator by analyzing the ratio of likes to comments.

Consider a post that receives a lot of comments but no likes. This most likely means that the post contains something that strikes a chord in people, but not necessarily something that they like.

There are a lot of possibilities, and it’s up to you to choose how you will use the information.

If you want to learn more about Python, check out the following courses from DataCamp:

Intro to Python for Data Science

Importing Data in Python (Part 1)

Want to leave a comment?