Paul Love has completed

Web Scraping in R

4 hr

3,600 XP

Loved by learners at thousands of companies

Course Description

Have you ever come across a website that displays a lot of data such as statistics, product reviews, or prices in a format that’s not data analysis-ready? Often, authorities and other data providers publish their data in neatly formatted tables. However, not all of these sites include a download button, but don’t despair. In this course, you’ll learn how to efficiently collect and download data from any website using R. You'll learn how to automate the scraping and parsing of Wikipedia using the rvest and httr packages. Through hands-on exercises, you’ll also expand your understanding of HTML and CSS, the building blocks of web pages, as you make your data harvesting workflows less error-prone and more efficient.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Introduction to HTML and Web Scraping
Free
In this chapter, you'll be introduced to Hyper Text Markup Language (HTML), a declarative language used to structure modern websites. Using the rvest library, you'll learn how to query simple HTML elements and scrape your first table.
Play Chapter Now
Introduction to HTML
50 xp
Read in HTML
100 xp
Beware of syntax errors!
50 xp
Navigating HTML
50 xp
Select all children of a list
100 xp
Parse hyperlinks into a data frame
100 xp
Scrape your first table
50 xp
The right order of table elements
100 xp
Turn a table into a data frame with html_table()
100 xp
2
Navigation and Selection with CSS
Cascading Style Sheets (CSS) describe how HTML elements are displayed on a web page, including colors, fonts, and general layout. In this chapter, you'll learn why CSS selectors and combinators are a crucial ingredient for web scraping.
Play Chapter Now
Introduction to CSS
50 xp
Select multiple HTML types
100 xp
Order CSS selectors by the number of results
100 xp
CSS classes and IDs
50 xp
Identify the correct selector types
100 xp
Leverage the uniqueness of IDs
100 xp
Select the last child with a pseudo-class
100 xp
CSS combinators
50 xp
Select direct descendants with the child combinator
100 xp
How many elements get returned?
50 xp
Simply the best!
100 xp
Not every sibling is the same
100 xp
3
Advanced Selection with XPATH
The CSS selectors you got to know in the last chapter are powerful but have their limitations. For example, if you want to select nodes based on the properties of their descendants. XPath to the rescue! Using this query language, you can navigate and scrape even the most hideous HTML.
Play Chapter Now
Introduction to XPATH
50 xp
Find the correct CSS equivalent
100 xp
Select by class and ID with XPATH
100 xp
Use predicates to select nodes based on their children
100 xp
XPATH functions and advanced predicates
50 xp
Find a more elegant XPATH alternative
50 xp
Get to know the position() function
100 xp
Extract nodes based on the number of their children
100 xp
The XPATH text() function
50 xp
The shortcomings of html_table() with badly structured tables
100 xp
Select directly from a parent element with XPATH's text()
100 xp
Combine extracted data into a data frame
100 xp
Scrape an element based on its text
100 xp
4
Scraping Best Practices
Now that you know how to extract content from web pages, it's time to look behind the curtains. In this final chapter, you’ll learn why HTTP requests are the foundation of every scraping action and how they can be customized to comply with best practices in web scraping.
Play Chapter Now
The nature of HTTP requests
50 xp
Which of these statements about HTTP is false?
50 xp
Do it the httr way
100 xp
Houston, we got a 404!
100 xp
Telling who you are with custom user agents
50 xp
Check out your user agent
100 xp
Add a custom user agent
100 xp
How to be gentle and slow down your requests
50 xp
Custom arguments for throttled functions
50 xp
Apply throttling to a multi-page crawler
100 xp
Recap: Web Scraping in R
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

collaborators

Maggie Matsui

Amy Peterson

prerequisites

Intermediate R Introduction to the Tidyverse

Timo Grossenbacher

Head of Newsroom Automation at Tamedia

Timo Grossenbacher is Head of Newsroom Automation at Swiss publisher Tamedia. Prior to that, he used to be a data journalist working with the Swiss Public Broadcast (SRF), where he used scripting and databases for almost every data-driven story he published. He also teaches data journalism at the University of Zurich and is the creator of rddj.info – resources for doing data journalism with R. Follow him at grssnbchr on Twitter or visit his personal website.

Join over 18 million learners and start Web Scraping in R today!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Web Scraping in R

Loved by learners at thousands of companies

Course Description

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Introduction to HTML and Web Scraping

Navigation and Selection with CSS

Advanced Selection with XPATH

Scraping Best Practices

Training 2 or more people?

Join over .css-ou6dz6{color:#03ef62;}18 million learners and start Web Scraping in R today!

Create Your Free Account

Training 2 or more people?

Join over 18 million learners and start Web Scraping in R today!