Parallel Programming with Dask in Python

Learn how to take the Python workflows you currently have and easily scale them up to large datasets without the need for distributed computing environments.
Start Course for Free
4 Hours17 Videos58 Exercises12,916 Learners
4650 XP

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA. You confirm you are at least 16 years old (13 if you are an authorized Classrooms user).

Loved by learners at thousands of companies

Course Description

Python is now well established as a major platform for data analysis and data science. For many data scientists, the largest limitation of Python is that all data must fit into the resident memory of the available workstation. Further, traditionally, Python has only been able to utilize one CPU. Data scientists constantly ask, "How can I read and process large amounts of data?" and "How can I make use of more computational processing resources?" This course will introduce you to Dask, a flexible parallel computing library for analytic computing. With Dask, you will be able to take the Python workflows you currently have and easily scale them up to large datasets on your workstation without the need to migrate to a distributed computing environment.

  1. 1

    Working with Big Data

    In this chapter you'll learn how to leverage traditional Python techniques for reading and processing large datasets stored in either a single file or in multiple files. Finally, you'll learn how the Dask library can be used to execute a pipeline of Python functions in parallel with the added goal of being able to process large amounts of data on modest computational resources. For this course, the data set sizes have been reduced so that the exercises can be completed rapidly. Many of these data sets were originally several Gigabytes in size.
    Play Chapter Now
  2. 2

    Working with Dask Arrays

    In this chapter we'll explore how we can use `dask.array` to read multiple data sources and perform computations with them as a single data array. We'll learn some advanced uses of NumPy arrays when dealing with high dimensional data that also work on Dask arrays. Finally, we'll examine climate patterns in the US from monthly weather data in the US.
    Play Chapter Now
  3. 3

    Working with Dask DataFrames

    The Dask DataFrame is built upon the Pandas DataFrame. Dask provides the ability to scale your Pandas workflows to large data sets stored in either a single file or separated across multiple files. In this chapter you'll learn how to build a pipeline of delayed computation with Dask DataFrame, and you'll use these skills to study how much NYC taxi riders tip their drivers.
    Play Chapter Now
  4. 4

    Working with Dask Bags for Unstructured Data

    Datasets that have not already been standardized and provided as CSV files can be challenging to work with. In this chapter you'll use the Dask Bag to read raw text files and perform simple text processing workflows over large datasets in parallel. Conceptually, the Dask Bag is a parallel list that can store any Python datatype with convenient functions that map over all of the elements.
    Play Chapter Now
  5. 5

    Case Study: Analyzing Flight Delays

    Now that you've learned how to utilize Dask to read and process large data sets in parallel, you'll put these skills together to search for correlations between flight delays and reported weather events at selected airports. You'll read files in multiple directories containing flight statistics for selected airports from the Bureau of Transportation Statistics and merge them with daily weather data from into a single Dask DataFrame.
    Play Chapter Now
In the following tracks
Python Programmer
Team Anaconda Headshot

Team Anaconda

Data Science Training
This course was created in collaboration with Anaconda. With over 6 million users, the open source Anaconda Distribution is the fastest and easiest way to do Python data science and machine learning. It's the industry standard for developing, testing, and training on a single machine.
See More

What do other learners have to say?

I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.

Devon Edwards Joseph
Lloyds Banking Group

DataCamp is the top resource I recommend for learning data science.

Louis Maiden
Harvard Business School

DataCamp is by far my favorite website to learn from.

Ronald Bowers
Decision Science Analytics, USAA