Skip to main content

Apache Spark in Python: Beginner's Guide

Open in Workspace
Karlijn Willems,
March 28, 2017 30 min read
A beginner's guide to Spark in Python based on 9 popular questions, such as how to install PySpark in Jupyter Notebook, best practices,...

You might already know Apache Spark as a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. It’s well-known for its speed, ease of use, generality and the ability to run virtually everywhere. And even though Spark is one of the most asked tools for data engineers, also data scientists can benefit from Spark when doing exploratory data analysis, feature extraction, supervised learning and model evaluation.

Today’s post will introduce you to some basic Spark in Python topics, based on 9 of the most frequently asked questions, such as

If you are interested in learning more about PySpark, consider taking DataCamp’s Introduction to PySpark course.

Check out our Apache Spark Tutorial: ML with PySpark.