Scalable Data Processing in R

Learn how to write scalable code for working with big data in R using the bigmemory and iotools packages.

Start Course for Free
4 Hours15 Videos49 Exercises4,876 Learners
3950 XP

Create Your Free Account

GoogleLinkedInFacebook

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies


Course Description

Datasets are often larger than available RAM, which causes problems for R programmers since by default all the variables are stored in memory. You’ll learn tools for processing, exploring, and analyzing data directly from disk. You’ll also implement the split-apply-combine approach and learn how to write scalable code using the bigmemory and iotools packages. In this course, you'll make use of the Federal Housing Finance Agency's data, a publicly available data set chronicling all mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015.

  1. 1

    Working with increasingly large data sets

    Free

    In this chapter, we cover the reasons you need to apply new techniques when data sets are larger than available RAM. We show that importing and exporting data using the base R functions can be slow and some easy ways to remedy this. Finally, we introduce the bigmemory package.

    Play Chapter Now
    What is Scalable Data Processing?
    50 xp
    Why is your code slow?
    50 xp
    How does processing time vary by data size?
    100 xp
    Working with "Out-of-Core" Objects using the Bigmemory Project
    50 xp
    Reading a big.matrix object
    100 xp
    Attaching a big.matrix object
    100 xp
    Creating tables with big.matrix objects
    100 xp
    Data summary using bigsummary
    100 xp
    References vs. Copies
    50 xp
    Copying matrices and big matrices
    100 xp
  2. 2

    Processing and Analyzing Data with bigmemory

    Now that you've got some experience using bigmemory, we're going to go through some simple data exploration and analysis techniques. In particular, we'll see how to create tables and implement the split-apply-combine approach.

    Play Chapter Now

In the following tracks

Big Data

Collaborators

Sumedh PanchadharRichie Cotton
Michael Kane Headshot

Michael Kane

Assistant Professor at Yale University

Michael Kane is an Assistant Professor at Yale University. His research is in the area of scalable statistical/machine learning and applied probability.
See More
Simon Urbanek Headshot

Simon Urbanek

Member of the R-Core; Lead Inventive Scientist at AT&T Labs Research

Simon Urbanek is a member of the R-Core and Lead Inventive Scientist at AT&T Labs Research. His research is in the areas of R, statistical computing, visualization, and interactive graphics.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA