Skip to main content

Installation of PySpark (All operating systems)

This tutorial will demonstrate the installation of PySpark and hot to manage the environment variables in Windows, Linux, and Mac Operating System.
Aug 2020  · 8 min read

banner

Pyspark = Python + Apache Spark

Apache Spark is a new and open-source framework used in the big data industry for real-time processing and batch processing. It supports different languages, like Python, Scala, Java, and R.

Apache Spark is initially written in a Java Virtual Machine(JVM) language called Scala, whereas Pyspark is like a Python API which contains a library called Py4J. This allows dynamic interaction with JVM objects.

Windows Installation

The installation which is going to be shown is for the Windows Operating System. It consists of the installation of Java with the environment variable and Apache Spark with the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit(JDK).

  2. Move to download section consisting of operating system Windows, and in my case, it's Windows Offline(64-bit). The installer file will be downloaded. Java installation

  3. Open the installer file, and the download begins. Java installation

  4. Go to "Command Prompt" and type "java -version" to know the version and know whether it is installed or not. Java installation

  5. Add the Java path Java installation

  6. Go to the search bar and "EDIT THE ENVIRONMENT VARIABLES. Java installation
  7. Click into the "Environment Variables' Java installation
  8. Click into "New" to create your new Environment variable. Java installation
  9. Use Variable Name as "JAVA_HOME' and your Variable Value as 'C:\Program Files (x86)\Java\jdk1.8.0_251'. This is your location of the Java file. Click 'OK' after you've finished the process. Java installation
  10. Let's add the User variable and select 'Path' and click 'New' to create it. Java installation
  11. Add the Variable name as 'PATH' and path value as 'C:\Program Files (x86)\Java\jdk1.8.0_251\bin', which is your location of Java bin file. Click 'OK' after you've finished the process. Java installation

Note: You can locate your Java file by going to C drive, which is C:\Program Files (x86)\Java\jdk1.8.0_251' if you've not changed location during the download. Java installation

Installing Pyspark

  1. Head over to the Spark homepage.

  2. Select the Spark release and package type as following and download the .tgz file.

Installing Pyspark Installing Pyspark

You can make a new folder called 'spark' in the C directory and extract the given file by using 'Winrar', which will be helpful afterward.

Download and setup winutils.exe

Go to Winutils choose your previously downloaded Hadoop version, then download the winutils.exe file by going inside 'bin'. The link to my Hadoop version is: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Make a new folder called 'winutils' and inside of it create again a new folder called 'bin'.Then put the file recently download 'winutils' inside it.

Environment variables

  1. Let's create a new environment where variable name as "hadoop_home" and variable value to be the location of winutils, which is "C:\winutils" and click "OK".
    Environment variables
  2. For spark, also let's create a new environment where the variable name is "Spark_home" and the variable value to be the location of spark, which is "C:\spark" and click "OK".
    Environment variables
  3. Finally, double click the 'path' and change the following as done below where a new path is created "%Spark_Home%\bin' is added and click "OK".
    Environment variables

Finalizing Pyspark Installation

  1. Open Command Prompt and type the following command.
    Finalizing Pyspark Installation
  2. Once everything is successfully done, the following message is obtained.
    Finalizing Pyspark Installation

Linux Installation

The installation which is going to be shown is for the Linux Operating System. It consists of the installation of Java with the environment variable along with Apache Spark and the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java Installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit (JDK).

  2. Move to the download section consisting of the operating system Linux and download it according to your system requirement.
    Java Installation
  3. Save the file and click "Ok" to save in your local machine.
    Java Installation
  4. Go to your terminal and check the recently downloaded file using 'ls' command.
    Java Installation
  5. Install the package using the following command, which will install the debian package of java, which is recently downloaded. Java Installation
  6. Finally, you can check your java version using 'java --version' command.
    Java Installation
  7. For configuring environment variables, let's open the 'gedit' text editor using the following command.
    Java Installation
  8. Let's make the change by providing the following information where the 'Java' path is specified.
    Java Installation
  9. To make a final change, let's type the following command. Java Installation

Installing Spark

  1. Head over to the Spark homepage.
  2. Select the Spark release and package type as following and download the .tgz file. Installing Spark
    Installing Spark
  3. Save the file to your local machine and click 'Ok'.
    Installing Spark
  4. Open your terminal and go to the recently downloaded file.
    Installing Spark
  5. Let's extract the file using the following command.
    Installing Spark
  6. After extracting the file, the new file is created and shown using the list('ls') command.
    Installing Spark

Configuring Environment Variable in Linux

  1. Let's open the 'bashrc' file using 'vim editor' by the command 'vim ~/.bashrc'.
    Configuring Environment Variable in Linux
  2. Provide the following information according to your suitable path on your computer. In my case, the following were the required path to my Spark location, Python path, and Java path. Also, first press 'Esc' and then type ":wq" to save and exit from vim.
    Configuring Environment Variable in Linux
  3. To make a final change, save, and exit. This results in accessing the pyspark command everywhere in the directory. Configuring Environment Variable in Linux
  4. Open pyspark using 'pyspark' command, and the final message will be shown as below. Configuring Environment Variable in Linux Configuring Environment Variable in Linux

Mac Installation

The installation which is going to be shown is for the Mac Operating System. It consists of the installation of Java with the environment variable along with Apache Spark and the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java Installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit (JDK).

  2. Move to download section consisting of the operating system Linux and download according to your system requirement.
    Java Installation
  3. The installation of Java can be confirmed by using $java --showversion in the Terminal.

Installing Apache Spark

  1. Head over to the Spark homepage.
  2. Select the Spark release and package type as following and download the .tgz file. Installing Apache Spark
    Installing Apache Spark
  3. Save the file to your local machine and click 'Ok'.
  4. Let's extract the file using the following command.
    $ tar -xzf spark-2.4.6-bin-hadoop2.7.tgz

Configuring Environment Variable for Apache Spark and Python

You need to open the ~/.bashrc or ~/.zshrc file depending upon your current Mac version.

export SPARK_HOME="/Downloads/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

Open pyspark using 'pyspark' command, and the final message will be shown as below. Configuring Environment Variable for Apache Spark and Python

Congratulations

Congratulations, you have made it to the end of this tutorial!

In this tutorial, you've learned about the installation of Pyspark, starting the installation of Java along with Apache Spark and managing the environment variables in Windows, Linux, and Mac Operating System.

If you would like to learn more about Pyspark, take DataCamp's Introduction to Pyspark.

Check out our Apache Spark Tutorial: ML with PySpark.

Introduction to PySpark

Beginner
4 hours
106,682
Learn to implement distributed data management and machine learning in Spark using the PySpark package.
See DetailsRight Arrow
Start Course

Feature Engineering with PySpark

Beginner
4 hours
10,819
Learn the gritty details that data scientists are spending 70-80% of their time on; data wrangling and feature engineering.

Building Recommendation Engines with PySpark

Beginner
4 hours
9,926
Learn tools and techniques to leverage your own big data to facilitate positive experiences for your users.
See all coursesRight Arrow
Related
Data Science Concept Vector Image

How to Become a Data Scientist in 8 Steps

Find out everything you need to know about becoming a data scientist, and find out whether it’s the right career for you!

Jose Jorge Rodriguez Salgado

12 min

DC Data in Soccer Infographic.png

How Data Science is Changing Soccer

With the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.
Richie Cotton's photo

Richie Cotton

The 23 Top Python Interview Questions & Answers

Essential Python interview questions with examples for job seekers, final-year students, and data professionals.
Abid Ali Awan's photo

Abid Ali Awan

22 min

Plotly Express Cheat Sheet

Plotly is one of the most widely used data visualization packages in Python. Learn more about it in this cheat sheet.
DataCamp Team's photo

DataCamp Team

0 min

Getting started with Python cheat sheet

Python is the most popular programming language in data science. Use this cheat sheet to jumpstart your Python learning journey.
DataCamp Team's photo

DataCamp Team

8 min

Python pandas tutorial: The ultimate guide for beginners

Are you ready to begin your pandas journey? Here’s a step-by-step guide on how to get started. [Updated November 2022]
Vidhi Chugh's photo

Vidhi Chugh

15 min

See MoreSee More