Skip to main content
HomeAbout PythonLearn Python

Installation of PySpark (All operating systems)

This tutorial will demonstrate the installation of PySpark and hot to manage the environment variables in Windows, Linux, and Mac Operating System.
Aug 2020  · 8 min read

banner

Pyspark = Python + Apache Spark

Apache Spark is a new and open-source framework used in the big data industry for real-time processing and batch processing. It supports different languages, like Python, Scala, Java, and R.

Apache Spark is initially written in a Java Virtual Machine(JVM) language called Scala, whereas Pyspark is like a Python API which contains a library called Py4J. This allows dynamic interaction with JVM objects.

Windows Installation

The installation which is going to be shown is for the Windows Operating System. It consists of the installation of Java with the environment variable and Apache Spark with the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit(JDK).

  2. Move to download section consisting of operating system Windows, and in my case, it's Windows Offline(64-bit). The installer file will be downloaded. Java installation

  3. Open the installer file, and the download begins. Java installation

  4. Go to "Command Prompt" and type "java -version" to know the version and know whether it is installed or not. Java installation

  5. Add the Java path Java installation

  6. Go to the search bar and "EDIT THE ENVIRONMENT VARIABLES. Java installation
  7. Click into the "Environment Variables' Java installation
  8. Click into "New" to create your new Environment variable. Java installation
  9. Use Variable Name as "JAVA_HOME' and your Variable Value as 'C:\Program Files (x86)\Java\jdk1.8.0_251'. This is your location of the Java file. Click 'OK' after you've finished the process. Java installation
  10. Let's add the User variable and select 'Path' and click 'New' to create it. Java installation
  11. Add the Variable name as 'PATH' and path value as 'C:\Program Files (x86)\Java\jdk1.8.0_251\bin', which is your location of Java bin file. Click 'OK' after you've finished the process. Java installation

Note: You can locate your Java file by going to C drive, which is C:\Program Files (x86)\Java\jdk1.8.0_251' if you've not changed location during the download. Java installation

Installing Pyspark

  1. Head over to the Spark homepage.

  2. Select the Spark release and package type as following and download the .tgz file.

Installing Pyspark Installing Pyspark

You can make a new folder called 'spark' in the C directory and extract the given file by using 'Winrar', which will be helpful afterward.

Download and setup winutils.exe

Go to Winutils choose your previously downloaded Hadoop version, then download the winutils.exe file by going inside 'bin'. The link to my Hadoop version is: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Make a new folder called 'winutils' and inside of it create again a new folder called 'bin'.Then put the file recently download 'winutils' inside it.

Environment variables

  1. Let's create a new environment where variable name as "hadoop_home" and variable value to be the location of winutils, which is "C:\winutils" and click "OK".
    Environment variables
  2. For spark, also let's create a new environment where the variable name is "Spark_home" and the variable value to be the location of spark, which is "C:\spark" and click "OK".
    Environment variables
  3. Finally, double click the 'path' and change the following as done below where a new path is created "%Spark_Home%\bin' is added and click "OK".
    Environment variables

Finalizing Pyspark Installation

  1. Open Command Prompt and type the following command.
    Finalizing Pyspark Installation
  2. Once everything is successfully done, the following message is obtained.
    Finalizing Pyspark Installation

Linux Installation

The installation which is going to be shown is for the Linux Operating System. It consists of the installation of Java with the environment variable along with Apache Spark and the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java Installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit (JDK).

  2. Move to the download section consisting of the operating system Linux and download it according to your system requirement.
    Java Installation
  3. Save the file and click "Ok" to save in your local machine.
    Java Installation
  4. Go to your terminal and check the recently downloaded file using 'ls' command.
    Java Installation
  5. Install the package using the following command, which will install the debian package of java, which is recently downloaded. Java Installation
  6. Finally, you can check your java version using 'java --version' command.
    Java Installation
  7. For configuring environment variables, let's open the 'gedit' text editor using the following command.
    Java Installation
  8. Let's make the change by providing the following information where the 'Java' path is specified.
    Java Installation
  9. To make a final change, let's type the following command. Java Installation

Installing Spark

  1. Head over to the Spark homepage.
  2. Select the Spark release and package type as following and download the .tgz file. Installing Spark
    Installing Spark
  3. Save the file to your local machine and click 'Ok'.
    Installing Spark
  4. Open your terminal and go to the recently downloaded file.
    Installing Spark
  5. Let's extract the file using the following command.
    Installing Spark
  6. After extracting the file, the new file is created and shown using the list('ls') command.
    Installing Spark

Configuring Environment Variable in Linux

  1. Let's open the 'bashrc' file using 'vim editor' by the command 'vim ~/.bashrc'.
    Configuring Environment Variable in Linux
  2. Provide the following information according to your suitable path on your computer. In my case, the following were the required path to my Spark location, Python path, and Java path. Also, first press 'Esc' and then type ":wq" to save and exit from vim.
    Configuring Environment Variable in Linux
  3. To make a final change, save, and exit. This results in accessing the pyspark command everywhere in the directory. Configuring Environment Variable in Linux
  4. Open pyspark using 'pyspark' command, and the final message will be shown as below. Configuring Environment Variable in Linux Configuring Environment Variable in Linux

Mac Installation

The installation which is going to be shown is for the Mac Operating System. It consists of the installation of Java with the environment variable along with Apache Spark and the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java Installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit (JDK).

  2. Move to download section consisting of the operating system Linux and download according to your system requirement.
    Java Installation
  3. The installation of Java can be confirmed by using $java --showversion in the Terminal.

Installing Apache Spark

  1. Head over to the Spark homepage.
  2. Select the Spark release and package type as following and download the .tgz file. Installing Apache Spark
    Installing Apache Spark
  3. Save the file to your local machine and click 'Ok'.
  4. Let's extract the file using the following command.
    $ tar -xzf spark-2.4.6-bin-hadoop2.7.tgz

Configuring Environment Variable for Apache Spark and Python

You need to open the ~/.bashrc or ~/.zshrc file depending upon your current Mac version.

export SPARK_HOME="/Downloads/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

Open pyspark using 'pyspark' command, and the final message will be shown as below. Configuring Environment Variable for Apache Spark and Python

Congratulations

Congratulations, you have made it to the end of this tutorial!

In this tutorial, you've learned about the installation of Pyspark, starting the installation of Java along with Apache Spark and managing the environment variables in Windows, Linux, and Mac Operating System.

If you would like to learn more about Pyspark, take DataCamp's Introduction to Pyspark.

Check out our Apache Spark Tutorial: ML with PySpark.

Topics

PySpark Courses

Course

Introduction to PySpark

4 hr
126K
Learn to implement distributed data management and machine learning in Spark using the PySpark package.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.
Mark Graus's photo

Mark Graus

10 min

Python NaN: 4 Ways to Check for Missing Values in Python

Explore 4 ways to detect NaN values in Python, using NumPy and Pandas. Learn key differences between NaN and None to clean and analyze data efficiently.
Adel Nehme's photo

Adel Nehme

5 min

Seaborn Heatmaps: A Guide to Data Visualization

Learn how to create eye-catching Seaborn heatmaps
Joleen Bothma's photo

Joleen Bothma

9 min

Test-Driven Development in Python: A Beginner's Guide

Dive into test-driven development (TDD) with our comprehensive Python tutorial. Learn how to write robust tests before coding with practical examples.
Amina Edmunds's photo

Amina Edmunds

7 min

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

See MoreSee More