Start Learning for Free

Join over 1,000,000 other Data Science learners and start one of our interactive tutorials today!

Topic python small

PySpark Cheat Sheet: Spark DataFrames in Python

June 15th, 2017 in Python

You'll probably already know about Apache Spark, the fast, general and open-source engine for big data processing; It has built-in modules for streaming, SQL, machine learning and graph processing. Spark allows you to speed analytic applications up to 100 times faster compared to other technologies on the market today. Interfacing Spark with Python is easy with PySpark: this Spark Python API exposes the Spark programming model to Python. 

The PySpark Basics cheat sheet already showed you how to work with the most basic building blocks, RDDs.

Now, it's time to tackle the Spark SQL module, which is meant for structured data processing, and the DataFrame API, which is not only available in Python, but also in Scala, Java, and R. If you want to know more about the differences between RDDs, DataFrames, and DataSets, consider taking a look at Apache Spark in Python: Beginner's Guide.

Without further ado, here's the cheat sheet:

This PySpark SQL cheat sheet covers the basics of working with the Apache Spark DataFrames in Python: from initializing the SparkSession to creating DataFrames, inspecting the data, handling duplicate values, querying, adding, updating or removing columns, grouping, filtering or sorting data. You'll also see that this cheat sheet also on how to runSQL Queries programmatically, how to save your data to parquet and JSON files, and how to stop your SparkSession.

Make sure to check out our other Python cheat sheets for data science, which cover topics such as Python basicsNumpyPandasPandas Data Wrangling and much more! 

Comments

disha-b-shah
This cheat sheet is fantastic and exactly what I need. I would love to see a course on working with pySpark Dataframes in Python.
07/09/17 5:56 PM |
karlijn
Hi disha-b-shah, that's great to hear! We're currently working on adding courses on PySpark to the curriculum, so stay tuned!! :)
07/26/17 7:46 AM |
pk3291
Great post!, Thanks. But you should also provide the data sets used in the cheat sheet. makes it very easy to follow!
06/21/17 4:22 PM |
karlijn
Hi pk3291! Thanks for the feedback! I agree with you and I tried to fix it by printing out some intermediary results, but I understand that this is not enough; I'm working hard on getting a Jupyter Notebook out that should give some additional information about the data.
06/23/17 9:39 AM |
darkangelshg
<h1>aaa</h1>
06/21/17 4:16 AM |
darkangelshg
aaaaaaaa<>"'
06/21/17 4:14 AM |
darkangelshg
aaaaaaaa<>"'
06/21/17 4:12 AM |