
Loved by learners at thousands of companies
Course Description
Python is now well established as a major platform for data analysis and data science. For many data scientists, the largest limitation of Python is that all data must fit into the resident memory of the available workstation. Further, traditionally, Python has only been able to utilize one CPU. Data scientists constantly ask, "How can I read and process large amounts of data?" and "How can I make use of more computational processing resources?" This course will introduce you to Dask, a flexible parallel computing library for analytic computing. With Dask, you will be able to take the Python workflows you currently have and easily scale them up to large datasets on your workstation without the need to migrate to a distributed computing environment.
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.- 1
Working with Big Data
FreeIn this chapter you'll learn how to leverage traditional Python techniques for reading and processing large datasets stored in either a single file or in multiple files. Finally, you'll learn how the Dask library can be used to execute a pipeline of Python functions in parallel with the added goal of being able to process large amounts of data on modest computational resources. For this course, the data set sizes have been reduced so that the exercises can be completed rapidly. Many of these data sets were originally several Gigabytes in size.
Understanding Computer Storage & Big Data50 xpHow big is my DataFrame?50 xpNumPy transformations100 xpThinking about Data in Chunks50 xpFiltering WDI data in chunks100 xpConcatenating & plotting WDI data100 xpManaging Data with Generators50 xpComputing percentage of delayed flights100 xpGenerating & plotting delayed flights100 xpDelaying Computation with Dask50 xpBuilding a pipeline with delayed100 xpComputing pipelined results100 xp - 2
Working with Dask Arrays
In this chapter we'll explore how we can use
dask.array
to read multiple data sources and perform computations with them as a single data array. We'll learn some advanced uses of NumPy arrays when dealing with high dimensional data that also work on Dask arrays. Finally, we'll examine climate patterns in the US from monthly weather data in the US.Chunking Arrays in Dask50 xpInspecting a Dask array50 xpChunking a NumPy array100 xpTiming Dask array computations100 xpComputing with Multidimensional Arrays50 xpPredicting result of broadcasting50 xpSubtracting & broadcasting100 xpComputing aggregations100 xpAnalyzing Weather Data50 xpReading the data100 xpStacking data & reading climatology100 xpTransforming, aggregating, and plotting100 xp - 3
Working with Dask DataFrames
The Dask DataFrame is built upon the Pandas DataFrame. Dask provides the ability to scale your Pandas workflows to large data sets stored in either a single file or separated across multiple files. In this chapter you'll learn how to build a pipeline of delayed computation with Dask DataFrame, and you'll use these skills to study how much NYC taxi riders tip their drivers.
Using Dask DataFrames50 xpInspecting a large DataFrame50 xpBuilding a pipeline of delayed tasks100 xpGrouping & aggregating by year100 xpTiming Dask DataFrame Operations50 xpPreparing the pipeline100 xpComparing Dask & pandas execution times100 xpAnalyzing NYC Taxi Rides50 xpReading & cleaning files100 xpFiltering & grouping data100 xpComputing & plotting100 xp - 4
Working with Dask Bags for Unstructured Data
Datasets that have not already been standardized and provided as CSV files can be challenging to work with. In this chapter you'll use the Dask Bag to read raw text files and perform simple text processing workflows over large datasets in parallel. Conceptually, the Dask Bag is a parallel list that can store any Python datatype with convenient functions that map over all of the elements.
Building Dask Bags & Globbing50 xpInspecting Dask Bags50 xpReading & counting100 xpTaking one element100 xpFunctional Approaches using Dask Bags50 xpWhat is the preferred way to convert to uppercase?50 xpSplitting by word & count100 xpFiltering on a phrase100 xpAnalyzing Congressional Legislation50 xpLoading & mapping from JSON100 xpFiltering vetoed bills100 xpComputing the average bill's lifespan100 xp - 5
Case Study: Analyzing Flight Delays
Now that you've learned how to utilize Dask to read and process large data sets in parallel, you'll put these skills together to search for correlations between flight delays and reported weather events at selected airports. You'll read files in multiple directories containing flight statistics for selected airports from the Bureau of Transportation Statistics and merge them with daily weather data from wunderground.com into a single Dask DataFrame.
Preparing Flight Delay Data50 xpDelaying reading & cleaning100 xpReading all flight data100 xpPreparing Weather Data50 xpDeferring reading weather data100 xpBuilding a weather DataFrame100 xpWhich city gets the most snow?100 xpMerging & Persisting DataFrames50 xpPersisting merged DataFrame100 xpFinding sources of weather delays100 xpFinal thoughts50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.datasets
Congressional billsFlight delaysNYC taxi ridesPresidents (JSON)State of the Union addressesTexas electricity consumption (HDF5)WDI (World Development Indicators)Weatherprerequisites
Data Manipulation with pandasJoin over 17 million learners and start Parallel Computing with Dask today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.