Ben Wynne-Morris has completed

Parallel Computing with Dask

4 hr

4,650 XP

Loved by learners at thousands of companies

Course Description

Python is now well established as a major platform for data analysis and data science. For many data scientists, the largest limitation of Python is that all data must fit into the resident memory of the available workstation. Further, traditionally, Python has only been able to utilize one CPU. Data scientists constantly ask, "How can I read and process large amounts of data?" and "How can I make use of more computational processing resources?" This course will introduce you to Dask, a flexible parallel computing library for analytic computing. With Dask, you will be able to take the Python workflows you currently have and easily scale them up to large datasets on your workstation without the need to migrate to a distributed computing environment.

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

1
Working with Big Data
Free
In this chapter you'll learn how to leverage traditional Python techniques for reading and processing large datasets stored in either a single file or in multiple files. Finally, you'll learn how the Dask library can be used to execute a pipeline of Python functions in parallel with the added goal of being able to process large amounts of data on modest computational resources. For this course, the data set sizes have been reduced so that the exercises can be completed rapidly. Many of these data sets were originally several Gigabytes in size.
Play Chapter Now
Understanding Computer Storage & Big Data
50 xp
How big is my DataFrame?
50 xp
NumPy transformations
100 xp
Thinking about Data in Chunks
50 xp
Filtering WDI data in chunks
100 xp
Concatenating & plotting WDI data
100 xp
Managing Data with Generators
50 xp
Computing percentage of delayed flights
100 xp
Generating & plotting delayed flights
100 xp
Delaying Computation with Dask
50 xp
Building a pipeline with delayed
100 xp
Computing pipelined results
100 xp
2
Working with Dask Arrays
In this chapter we'll explore how we can use dask.array to read multiple data sources and perform computations with them as a single data array. We'll learn some advanced uses of NumPy arrays when dealing with high dimensional data that also work on Dask arrays. Finally, we'll examine climate patterns in the US from monthly weather data in the US.
Play Chapter Now
Chunking Arrays in Dask
50 xp
Inspecting a Dask array
50 xp
Chunking a NumPy array
100 xp
Timing Dask array computations
100 xp
Computing with Multidimensional Arrays
50 xp
Predicting result of broadcasting
50 xp
Subtracting & broadcasting
100 xp
Computing aggregations
100 xp
Analyzing Weather Data
50 xp
Reading the data
100 xp
Stacking data & reading climatology
100 xp
Transforming, aggregating, and plotting
100 xp
3
Working with Dask DataFrames
The Dask DataFrame is built upon the Pandas DataFrame. Dask provides the ability to scale your Pandas workflows to large data sets stored in either a single file or separated across multiple files. In this chapter you'll learn how to build a pipeline of delayed computation with Dask DataFrame, and you'll use these skills to study how much NYC taxi riders tip their drivers.
Play Chapter Now
Using Dask DataFrames
50 xp
Inspecting a large DataFrame
50 xp
Building a pipeline of delayed tasks
100 xp
Grouping & aggregating by year
100 xp
Timing Dask DataFrame Operations
50 xp
Preparing the pipeline
100 xp
Comparing Dask & pandas execution times
100 xp
Analyzing NYC Taxi Rides
50 xp
Reading & cleaning files
100 xp
Filtering & grouping data
100 xp
Computing & plotting
100 xp
4
Working with Dask Bags for Unstructured Data
Datasets that have not already been standardized and provided as CSV files can be challenging to work with. In this chapter you'll use the Dask Bag to read raw text files and perform simple text processing workflows over large datasets in parallel. Conceptually, the Dask Bag is a parallel list that can store any Python datatype with convenient functions that map over all of the elements.
Play Chapter Now
Building Dask Bags & Globbing
50 xp
Inspecting Dask Bags
50 xp
Reading & counting
100 xp
Taking one element
100 xp
Functional Approaches using Dask Bags
50 xp
What is the preferred way to convert to uppercase?
50 xp
Splitting by word & count
100 xp
Filtering on a phrase
100 xp
Analyzing Congressional Legislation
50 xp
Loading & mapping from JSON
100 xp
Filtering vetoed bills
100 xp
Computing the average bill's lifespan
100 xp
5
Case Study: Analyzing Flight Delays
Now that you've learned how to utilize Dask to read and process large data sets in parallel, you'll put these skills together to search for correlations between flight delays and reported weather events at selected airports. You'll read files in multiple directories containing flight statistics for selected airports from the Bureau of Transportation Statistics and merge them with daily weather data from wunderground.com into a single Dask DataFrame.
Play Chapter Now
Preparing Flight Delay Data
50 xp
Delaying reading & cleaning
100 xp
Reading all flight data
100 xp
Preparing Weather Data
50 xp
Deferring reading weather data
100 xp
Building a weather DataFrame
100 xp
Which city gets the most snow?
100 xp
Merging & Persisting DataFrames
50 xp
Persisting merged DataFrame
100 xp
Finding sources of weather delays
100 xp
Final thoughts
50 xp

For Business

Training 2 or more people?

Get your team access to the full DataCamp platform, including all the features.

datasets

Congressional bills Flight delays NYC taxi rides Presidents (JSON)State of the Union addresses Texas electricity consumption (HDF5)WDI (World Development Indicators)Weather

prerequisites

Data Manipulation with pandas

Team Anaconda

Data Science Training

Join over 17 million learners and start Parallel Computing with Dask today!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Parallel Computing with Dask

Loved by learners at thousands of companies

Course Description

.css-10r9e5n{-webkit-margin-end:8px;margin-inline-end:8px;}.css-1309hh9{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;-webkit-margin-end:8px;margin-inline-end:8px;}Training 2 or more people?

Working with Big Data

Working with Dask Arrays

Working with Dask DataFrames

Working with Dask Bags for Unstructured Data

Case Study: Analyzing Flight Delays

Training 2 or more people?

Join over .css-ou6dz6{color:#03ef62;}17 million learners and start Parallel Computing with Dask today!

Create Your Free Account

Training 2 or more people?

Join over 17 million learners and start Parallel Computing with Dask today!