Loved by learners at thousands of companies
Datasets are often larger than available RAM, which causes problems for R programmers since by default all the variables are stored in memory. You’ll learn tools for processing, exploring, and analyzing data directly from disk. You’ll also implement the split-apply-combine approach and learn how to write scalable code using the bigmemory and iotools packages. In this course, you'll make use of the Federal Housing Finance Agency's data, a publicly available data set chronicling all mortgages that were held or securitized by both Federal National Mortgage Association (Fannie Mae) and Federal Home Loan Mortgage Corporation (Freddie Mac) from 2009-2015.
Working with increasingly large data setsFree
In this chapter, we cover the reasons you need to apply new techniques when data sets are larger than available RAM. We show that importing and exporting data using the base R functions can be slow and some easy ways to remedy this. Finally, we introduce the bigmemory package.What is Scalable Data Processing?50 xpWhy is your code slow?50 xpHow does processing time vary by data size?100 xpWorking with "Out-of-Core" Objects using the Bigmemory Project50 xpReading a big.matrix object100 xpAttaching a big.matrix object100 xpCreating tables with big.matrix objects100 xpData summary using bigsummary100 xpReferences vs. Copies50 xpCopying matrices and big matrices100 xp
Processing and Analyzing Data with bigmemory
Now that you've got some experience using bigmemory, we're going to go through some simple data exploration and analysis techniques. In particular, we'll see how to create tables and implement the split-apply-combine approach.The Bigmemory Suite of Packages50 xpTabulating using bigtable100 xpBorrower Race and Ethnicity by Year (I)100 xpSplit-Apply-Combine50 xpFemale Proportion Borrowing100 xpSplit100 xpApply100 xpCombine100 xpVisualize your results using the tidyverse50 xpVisualizing Female Proportion Borrowing100 xpThe Borrower Income Ratio100 xpTidy Big Tables100 xpLimitations of bigmemory50 xpWhere should you use bigmemory?50 xp
Working with iotools
We'll use the iotools package that can process both numeric and string data, and introduce the concept of chunk-wise processing.Introduction to chunk-wise processing50 xpCan you split-compute-combine it?50 xpFoldable operations (I)100 xpFoldable operations (II)100 xpA first look at iotools: Importing data50 xpCompare read.delim() and read.delim.raw()100 xpReading raw data and turning it into a data structure100 xpchunk.apply50 xpReading chunks in as a matrix100 xpReading chunks in as a data.frame100 xpParallelizing calls to chunk.apply100 xp
Case Study: A Preliminary Analysis of the Housing Data
In the previous chapters, we've introduced the housing data and shown how to compute with data that is about as big, or bigger than, the amount of RAM on a single machine. In this chapter, we'll go through a preliminary analysis of the data, comparing various trends over time.Overview of types of analysis for this chapter50 xpRace and Ethnic Representation in the Mortgage Data100 xpComparing the Borrower Race/Ethnicity and their Proportions100 xpAre the data missing at random?50 xpLooking for Predictable Missingness100 xpA little more about missingness50 xpAnalyzing the Housing Data50 xpBorrower Race and Ethnicity by Year (II)100 xpVisualizing the Adjusted Demographic Trends100 xpRelative change in demographic trend100 xpBorrower Lending Trends: City vs. Rural50 xpBorrower Region by Year100 xpWho is securing federally guaranteed loans?100 xpCongratulations!50 xp
In the following tracksBig Data
DatasetsMortgage data (sample)
PrerequisitesWriting Efficient R Code
Assistant Professor at Yale University
Michael Kane is an Assistant Professor at Yale University. His research is in the area of scalable statistical/machine learning and applied probability.
Member of the R-Core; Lead Inventive Scientist at AT&T Labs Research
Simon Urbanek is a member of the R-Core and Lead Inventive Scientist at AT&T Labs Research. His research is in the areas of R, statistical computing, visualization, and interactive graphics.