Blog

Web Scraping with R and PhantomJS

Short tutorial on scraping Javascript generated data with R using PhantomJS.

Updated Mar 2015 · 5 min read

When you need to do web scraping, you would normally make use of Hadley Wickham’s rvest package. This package provides an easy to use, out of the box solution to fetch the html code that generates a webpage. However, when the website or webpage makes use of JavaScript to display the data you're interested in, the rvest package misses the required functionality. One solution is to make use of PhantomJS.

(Want to practice importing more data into R? Try this tutorial.)

Load the Necessary Packages

As a first step, you'll have to load all packages that we need for this analysis into the workspace (if you haven't installed these packages on your local system yet, use install.packages() to make them available):

library(rvest)
library(stringr)
library(plyr)
library(dplyr)
library(ggvis)
library(knitr)
options(digits = 4)

Scraping Javascript Generated Data with R

The next step is the collection of the TechStars data using PhantomJS. Check out the following basic .js file:

// scrape_techstars.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'techstars.html'

page.open('http://www.techstars.com/companies/stats/', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

The script basically renders the HTML page after the underlying javascript code has done its work, allowing you to fetch the HTML page, with all the tables in there. To stay in R for the rest of this analysis, we suggest you use the system() function to invoke PhantomJS (you'll have to download and install PhantomJS and put it in your working directory):

# Let phantomJS scrape techstars, output is written to techstars.html
system("./phantomjs scrape_techstars.js")

After this small detour, you finally have an HTML file, techstars.html, on our local system, that can be scrape with rvest. An inspection of the Techstars webpage reveals that the tables we're interested in are located in divs with the css class batch:

batches <- html("techstars.html") %>%
  html_nodes(".batch")

class(batches)

[1] "XMLNodeSet"

You now have a list of XMLNodeSet objects: each object contains the data for a single TechStars batch. In there, we can find information concerning the batch location, the year, the season, but also about the companies, their current headquarters, their current status and the amount of funding they raised in total. We will not go into detail on the data collection and cleaning steps below; you can execute the code yourself and inspect what they accomplish. You'll see that some custom cleaning is going on to make sure that each bit of information is nicely formatted:

batch_titles <- batches %>%
  html_nodes(".batch_class") %>%
  html_text()

batch_season <- str_extract(batch_titles, "(Fall|Spring|Winter|Summer)")
batch_year <- str_extract(batch_titles, "([[:digit:]]{4})")
# location info is everything in the batch title that is not year info or season info
batch_location <- sub("\\s+$", "",
                      sub("([[:digit:]]{4})", "",
                          sub("(Fall|Spring|Winter|Summer)","",batch_titles)))

# create data frame with batch info.
batch_info <- data.frame(location = batch_location,
                         year = batch_year,
                         season = batch_season)

breakdown <- lapply(batches, function(x) {
  company_info <- x %>% html_nodes(".parent")
  companies_single_batch <- lapply(company_info, function(y){
    as.list(gsub("\\[\\+\\]\\[\\-\\]\\s", "", y %>%
       html_nodes("td") %>%
       html_text()))
  })
  df <- data.frame(matrix(unlist(companies_single_batch),
                   nrow=length(companies_single_batch),
                   byrow=T,
                   dimnames = list(NULL, c("company","funding","status","hq"))))
  return(df)
})

# Add batch info to breakdown
batch_info_extended <- batch_info[rep(seq_len(nrow(batch_info)),
                                  sapply(breakdown, nrow)),]
breakdown_merged <- rbind.fill(breakdown)

# Merge all information
techstars <- tbl_df(cbind(breakdown_merged, batch_info_extended)) %>%
  mutate(funding = as.numeric(gsub(",","",gsub("\\$","",funding))))

With a combination of core R, rvest, plyr and dplyr functions, we now we have the techstars data frame; a data set of all TechStars company, with all publicly available information that is nicely formatted:

techstars

## Source: local data frame [535 x 7]
##
##          company funding   status                hq location year season
## 1    Accountable  110000   Active    Fort Worth, TX   Austin 2013   Fall
## 2          Atlas 1180000   Active        Austin, TX   Austin 2013   Fall
## 3        Embrace  110000   Failed        Austin, TX   Austin 2013   Fall
## 4  Filament Labs 1490000   Active        Austin, TX   Austin 2013   Fall
## 5        Fosbury  300000   Active        Austin, TX   Austin 2013   Fall
## 6          Gone!  840000   Active San Francisco, CA   Austin 2013   Fall
## 7     MarketVibe  110000 Acquired        Austin, TX   Austin 2013   Fall
## 8           Plum 1630000   Active        Austin, TX   Austin 2013   Fall
## 9  ProtoExchange  110000   Active        Austin, TX   Austin 2013   Fall
## 10       Testlio 1020000   Active        Austin, TX   Austin 2013   Fall
## ..           ...     ...      ...               ...      ...  ...    ...

names(techstars)

## [1] "company"  "funding"  "status"   "hq"       "location" "year"
## [7] "season"

Conclusion

Using the above combination of tools and code, we managed to scrape data from a website that uses a JavaScript script to generate its data. As one can see, this is a very structured process, that can be easily done once the initial code is available. Want to learn R, and looking for data science and R tutorials and courses? Check out DataCamp

Topics

R Programming

Data Science

Learn more about R

Certification available

Course

Introduction to R

4 hr

2.6M

Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.

See Details

Start Course

Certification available

Course

Intermediate R

6 hr

581.2K

Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

See Details

Start Course

Certification available

Course

Introduction to the Tidyverse

4 hr

304.9K

Get started on the path to exploring and visualizing your own data with the tidyverse, a powerful and popular collection of data science tools within R.

See Details

Start Course

Data Science in Finance: Unlocking New Potentials in Financial Markets

Discover the role of data science in finance, shaping tomorrow's financial strategies. Gain insights into advanced analytics and investment trends.

Shawn Plummer

9 min

5 Common Data Science Challenges and Effective Solutions

Emerging technologies are changing the data science world, bringing new data science challenges to businesses. Here are 5 data science challenges and solutions.

DataCamp Team

8 min

Navigating R Certifications in 2024: A Comprehensive Guide

Explore DataCamp's R programming certifications with our guide. Learn about Data Scientist and Data Analyst paths, preparation tips, and career advancement.

Matt Crabtree

8 min

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.

Mark Graus

10 min

R Markdown Tutorial for Beginners

Learn what R Markdown is, what it's used for, how to install it, what capacities it provides for working with code, text, and plots, what syntax it uses, what output formats it supports, and how to render and publish R Markdown documents.

Elena Kosourova

12 min

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

Learn to master DynamoDB with Node.js in this beginner's guide. Explore table creation, CRUD operations, and scalability in AWS's NoSQL database.

Gary Alway

11 min

See More See More

Load the Necessary Packages

Scraping Javascript Generated Data with R

Conclusion

Data Science in Finance: Unlocking New Potentials in Financial Markets

5 Common Data Science Challenges and Effective Solutions

Navigating R Certifications in 2024: A Comprehensive Guide

A Data Science Roadmap for 2024

R Markdown Tutorial for Beginners

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to R

Intermediate R

Introduction to the Tidyverse

Data Science in Finance: Unlocking New Potentials in Financial Markets

5 Common Data Science Challenges and Effective Solutions

Navigating R Certifications in 2024: A Comprehensive Guide

A Data Science Roadmap for 2024

R Markdown Tutorial for Beginners

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

Introduction to R