Start Learning for Free

Join over 1,000,000 other Data Science learners and start one of our interactive tutorials today!

Topic python small

Apache Spark in Python: Beginner's Guide

March 28th, 2017 in Python

You might already know Apache Spark as a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It’s well-known for its speed, ease of use, generality and the ability to run virtually everywhere. And even though Spark is one of the most asked tools for data engineers, also data scientists can benefit from Spark when doing exploratory data analysis, feature extraction, supervised learning and model evaluation.

Today’s post will introduce you to some basic Spark in Python topics, based on 9 of the most frequently asked questions, such as


A python tutorial on Spark for Python would be my highest priority to attend.
03/30/17 4:44 PM |
Hi Dennis! Thanks for your comment :) A practical tutorial on Spark for Python will be in the making soon!
04/12/17 3:49 PM |
Would you be able to give a tutorial on using sparklyr with R and Spark? I'd be curious to know what the key differences are, and how easy you can integrate machine learning like h2o into it.
03/29/17 7:35 PM |
An excellent idea! So good in fact, that the course is being created right now. Be patient!
03/31/17 2:50 PM |
Avidly waiting for a course on spark R, any tentative date??
This will really help..
04/11/17 8:07 AM |
Hi, Karlijn, thanks for this very nice and useful post! Do you know if DataCamp plans to open a tutorial on Spark for Python?
03/29/17 10:35 AM |
Hi pdemeulenaer, thanks for the kind words! I can't speak for the course curriculum, but if the demand is there from you and other DataCamp users, I might write a tutorial for on Spark for Python that can appear on the community! :)
03/29/17 12:23 PM |
You have a typo on the third bullet: 'and when you should which structure'. The word 'use' is missing
03/29/17 9:33 AM |
Another typo: 'But streaming data is nog the only performance consideration' - nog instead of not
And: 'By calling collect() on any RDD, you drags data back into'- drags instead of drag
And: 'When you have two daasets that are grouped by key' - daasets instead of datasets
03/29/17 9:57 AM |
Hi Steunenberg! Thanks for your feedback, I will fix this straight away!
03/29/17 11:59 AM |