Loved by learners at thousands of companies
With an increasing amount of data and more complex algorithms available to scientists and practitioners today, parallel processing is almost always a must, and in fact, is expected in packages implementing time-consuming methods. This course introduces you to concepts and tools available in R for parallel computing and provides solutions to a few important non-trivial issues in parallel processing like reproducibility, generating random numbers and load balancing.
Can I Run My Application in Parallel?Free
In order to take advantage of parallel environment, the application needs to be split into pieces. In this introductory chapter, you will learn about different ways of partitioning and how it fits different hardware configurations. You will also be introduced to various R packages that support parallel programming.Partitioning problems into independent pieces50 xpPartitioning demographic model50 xpPartitioning probabilistic demographic model50 xpFind the most frequent words in a text100 xpModels of parallel computing50 xpA simple embarrassingly parallel application100 xpProbabilistic projection of migration (setup)50 xpProbabilistic projection of migration100 xpR packages for parallel computing50 xpPassing arguments via clusterApply()100 xpSum in parallel100 xpMore tasks than workers100 xp
The parallel Package
This chapter will dive deeper into the parallel package. You'll learn about the various backends and their differences and get a deep understanding about the workhorse of the package, namely the clusterApply() function. Strategies for task segmentation including their pitfalls will also be discussed.Cluster basics50 xpExploring the cluster object100 xpSocket vs. Fork100 xpThe core of parallel50 xpBenchmarking setup100 xpTask size matters100 xpInitialization of nodes50 xpLoading package on nodes100 xpSetting global variables100 xpExporting global objects100 xpSubsetting data50 xpPassing data as arguments100 xpChunking migration application on worker's side100 xpAlternative chunking100 xp
foreach, future.apply and Load Balancing
In this chapter, you will look at two user-contributed packages, namely foreach and future.apply, which make parallel programming in R even easier. They are built on top of the parallel and future packages. In the last lesson of this chapter, you will learn about the advantages and pitfalls of load balancing and scheduling.foreach50 xpCombining results50 xpWord frequency with foreach100 xpMultiple iterators in word frequency100 xpforeach and parallel backends50 xpUsing doParallel100 xpWord frequency with doParallel100 xpWord frequency with doFuture and benchmarking100 xpfuture and future.apply50 xpWord frequency with future.apply100 xpPlanning future100 xpBenchmark future100 xpLoad balancing and scheduling50 xpLoad balancing100 xpScheduling100 xp
Random Numbers and Reproducibility
Now you might ask, can I reproduce my results if the application uses random numbers? Can I generate the same results regardless of if the code runs sequentially or in parallel? This chapter will answer these questions. You will learn about a random number generator well suited to a parallel environment and how the various packages make use of it.Are my results reproducible?50 xpReproducibility (I)50 xpSOCK vs. FORK100 xpSOCK vs. FORK & random numbers0 xpParallel random number generators50 xpSetting an RNG100 xpReproducible results in parallel100 xpNon-reproducible results in parallel100 xpReproducibility in foreach and future.apply50 xpReproducibility (II)50 xpReproducing migration app with foreach100 xpReproducing migration app with future.apply100 xpNext steps50 xp
PrerequisitesWriting Efficient R Code
Senior Research Scientist, University of Washington
Hana works as a senior research scientist at the University of Washington’s Center for Statistics and the Social Sciences, in the area of statistical computing. She also works as a data scientist at the Puget Sound Regional Council in Seattle. She has been involved in projects advancing parallel computing in statistics. She contributed to the snow package which became the R core package parallel. She developed a methodology for fault tolerant and reproducible parallel computing, implemented in snowFT, as well as the first interface to the L’Ecuyer's random number generator, the rlecuyer package. Currently, she’s been working in the area of statistical demography, developing R packages used by the United Nations for projecting demographic indicators for all countries of the world.
What do other learners have to say?
I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.
Devon Edwards Joseph
Lloyds Banking Group
DataCamp is the top resource I recommend for learning data science.
Harvard Business School
DataCamp is by far my favorite website to learn from.
Decision Science Analytics, USAA