Pular para o conteúdo principal
InicioTutoriaisPython

Installation of PySpark (All operating systems)

This tutorial will demonstrate the installation of PySpark and hot to manage the environment variables in Windows, Linux, and Mac Operating System.
ago. de 2020  · 8 min leer

banner

Pyspark = Python + Apache Spark

Apache Spark is a new and open-source framework used in the big data industry for real-time processing and batch processing. It supports different languages, like Python, Scala, Java, and R.

Apache Spark is initially written in a Java Virtual Machine(JVM) language called Scala, whereas Pyspark is like a Python API which contains a library called Py4J. This allows dynamic interaction with JVM objects.

Windows Installation

The installation which is going to be shown is for the Windows Operating System. It consists of the installation of Java with the environment variable and Apache Spark with the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit(JDK).

  2. Move to download section consisting of operating system Windows, and in my case, it's Windows Offline(64-bit). The installer file will be downloaded. Java installation

  3. Open the installer file, and the download begins. Java installation

  4. Go to "Command Prompt" and type "java -version" to know the version and know whether it is installed or not. Java installation

  5. Add the Java path Java installation

  6. Go to the search bar and "EDIT THE ENVIRONMENT VARIABLES. Java installation
  7. Click into the "Environment Variables' Java installation
  8. Click into "New" to create your new Environment variable. Java installation
  9. Use Variable Name as "JAVA_HOME' and your Variable Value as 'C:\Program Files (x86)\Java\jdk1.8.0_251'. This is your location of the Java file. Click 'OK' after you've finished the process. Java installation
  10. Let's add the User variable and select 'Path' and click 'New' to create it. Java installation
  11. Add the Variable name as 'PATH' and path value as 'C:\Program Files (x86)\Java\jdk1.8.0_251\bin', which is your location of Java bin file. Click 'OK' after you've finished the process. Java installation

Note: You can locate your Java file by going to C drive, which is C:\Program Files (x86)\Java\jdk1.8.0_251' if you've not changed location during the download. Java installation

Installing Pyspark

  1. Head over to the Spark homepage.

  2. Select the Spark release and package type as following and download the .tgz file.

Installing Pyspark Installing Pyspark

You can make a new folder called 'spark' in the C directory and extract the given file by using 'Winrar', which will be helpful afterward.

Download and setup winutils.exe

Go to Winutils choose your previously downloaded Hadoop version, then download the winutils.exe file by going inside 'bin'. The link to my Hadoop version is: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe

Make a new folder called 'winutils' and inside of it create again a new folder called 'bin'.Then put the file recently download 'winutils' inside it.

Environment variables

  1. Let's create a new environment where variable name as "hadoop_home" and variable value to be the location of winutils, which is "C:\winutils" and click "OK".
    Environment variables
  2. For spark, also let's create a new environment where the variable name is "Spark_home" and the variable value to be the location of spark, which is "C:\spark" and click "OK".
    Environment variables
  3. Finally, double click the 'path' and change the following as done below where a new path is created "%Spark_Home%\bin' is added and click "OK".
    Environment variables

Finalizing Pyspark Installation

  1. Open Command Prompt and type the following command.
    Finalizing Pyspark Installation
  2. Once everything is successfully done, the following message is obtained.
    Finalizing Pyspark Installation

Linux Installation

The installation which is going to be shown is for the Linux Operating System. It consists of the installation of Java with the environment variable along with Apache Spark and the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java Installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit (JDK).

  2. Move to the download section consisting of the operating system Linux and download it according to your system requirement.
    Java Installation
  3. Save the file and click "Ok" to save in your local machine.
    Java Installation
  4. Go to your terminal and check the recently downloaded file using 'ls' command.
    Java Installation
  5. Install the package using the following command, which will install the debian package of java, which is recently downloaded. Java Installation
  6. Finally, you can check your java version using 'java --version' command.
    Java Installation
  7. For configuring environment variables, let's open the 'gedit' text editor using the following command.
    Java Installation
  8. Let's make the change by providing the following information where the 'Java' path is specified.
    Java Installation
  9. To make a final change, let's type the following command. Java Installation

Installing Spark

  1. Head over to the Spark homepage.
  2. Select the Spark release and package type as following and download the .tgz file. Installing Spark
    Installing Spark
  3. Save the file to your local machine and click 'Ok'.
    Installing Spark
  4. Open your terminal and go to the recently downloaded file.
    Installing Spark
  5. Let's extract the file using the following command.
    Installing Spark
  6. After extracting the file, the new file is created and shown using the list('ls') command.
    Installing Spark

Configuring Environment Variable in Linux

  1. Let's open the 'bashrc' file using 'vim editor' by the command 'vim ~/.bashrc'.
    Configuring Environment Variable in Linux
  2. Provide the following information according to your suitable path on your computer. In my case, the following were the required path to my Spark location, Python path, and Java path. Also, first press 'Esc' and then type ":wq" to save and exit from vim.
    Configuring Environment Variable in Linux
  3. To make a final change, save, and exit. This results in accessing the pyspark command everywhere in the directory. Configuring Environment Variable in Linux
  4. Open pyspark using 'pyspark' command, and the final message will be shown as below. Configuring Environment Variable in Linux Configuring Environment Variable in Linux

Mac Installation

The installation which is going to be shown is for the Mac Operating System. It consists of the installation of Java with the environment variable along with Apache Spark and the environment variable.

The recommended pre-requisite installation is Python, which is done from here.

Java Installation

  1. Go to Download Java JDK.
    Visit Oracle's website for the download of the Java Development Kit (JDK).

  2. Move to download section consisting of the operating system Linux and download according to your system requirement.
    Java Installation
  3. The installation of Java can be confirmed by using $java --showversion in the Terminal.

Installing Apache Spark

  1. Head over to the Spark homepage.
  2. Select the Spark release and package type as following and download the .tgz file. Installing Apache Spark
    Installing Apache Spark
  3. Save the file to your local machine and click 'Ok'.
  4. Let's extract the file using the following command.
    $ tar -xzf spark-2.4.6-bin-hadoop2.7.tgz

Configuring Environment Variable for Apache Spark and Python

You need to open the ~/.bashrc or ~/.zshrc file depending upon your current Mac version.

export SPARK_HOME="/Downloads/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_PYTHON=python3

Open pyspark using 'pyspark' command, and the final message will be shown as below. Configuring Environment Variable for Apache Spark and Python

Congratulations

Congratulations, you have made it to the end of this tutorial!

In this tutorial, you've learned about the installation of Pyspark, starting the installation of Java along with Apache Spark and managing the environment variables in Windows, Linux, and Mac Operating System.

If you would like to learn more about Pyspark, take DataCamp's Introduction to Pyspark.

Check out our Apache Spark Tutorial: ML with PySpark.

Temas

PySpark Courses

Course

Introduction to PySpark

4 hr
134.8K
Learn to implement distributed data management and machine learning in Spark using the PySpark package.
See DetailsRight Arrow
Start Course
Ver maisRight Arrow
Relacionado

blog

Tutorial: How to Install Python on macOS and Windows

Learn how to install Python on your personal machine with this step-by-step tutorial. Whether you’re a Windows or macOS user, discover various methods for getting started with Python on your machine.

Richie Cotton

14 min

cheat-sheet

PySpark Cheat Sheet: Spark in Python

This PySpark cheat sheet with code samples covers the basics like initializing Spark in Python, loading data, sorting, and repartitioning.
Karlijn Willems's photo

Karlijn Willems

6 min

tutorial

Apache Spark Tutorial: ML with PySpark

Apache Spark tutorial introduces you to big data processing, analysis and ML with PySpark.
Karlijn Willems's photo

Karlijn Willems

34 min

tutorial

Pyspark Tutorial: Getting Started with Pyspark

Discover what Pyspark is and how it can be used while giving examples.
Natassha Selvaraj's photo

Natassha Selvaraj

10 min

tutorial

Installing Anaconda on Mac OS X

This tutorial will demonstrate how you can install Anaconda, a powerful package manager, on your Mac.
DataCamp Team's photo

DataCamp Team

7 min

tutorial

How to Install and Use Homebrew

Discover Homebrew for data science. Learn how you can use this package manager to install, update, and remove technologies such as Apache Spark and Graphviz.
DataCamp Team's photo

DataCamp Team

8 min

See MoreSee More