How to install Spark (PySpark) on Windows

Apache spark is a general-purpose cluster computing engine aimed mainly at distributed data processing. In this tutorial, we will walk you through the step by step process of setting up Apache Spark on Windows.

Spark supports a number of programming languages including Java, Python, Scala, and R. In this tutorial, we will set up Spark with Python Development Environment by making use of Spark Python API (PySpark) which exposes the Spark programming model to Python.

Required Tools and Technologies:

  • – Python Development Environment.
  • – Apache Spark.
  • – Java Development Kit (Java 8).
  • – Hadoop winutils.exe.

Pointers for smooth installation:

– As of writing of this blog, Spark is not compatible with Java version>=9. Please ensure that you install JAVA 8 to avoid encountering installation errors.
– Apache Spark version 2.4.0 has a reported inherent bug that makes Spark incompatible for Windows as it breaks worker.py.
Any other version>2.0 will do fine.
– Ensure Python 2.7 is not pre-installed independently if you are using a Python 3 Development Environment.

Steps to setup Spark with Python

1. Install Python Development Environment

Enthought canopy is one of the Python Development Environments just like Anaconda. If you are already using one, as long as it is Python 3 or higher development environment, you are covered. ( You can also go by installing Python 3 manually and setting up environment variables for your installation if you do not prefer using a development environment. )

Download your system compatible version 2.1.9 for Windows from Enthought Canopy.

(If you have pre-installed Python 2.7 version, it may conflict with the new installations by the development environment for python 3).

Follow the installation wizard to complete the installation.

Once done, right click on canopy icon and select Properties. Inside the Compatibility tab, ensure Run as Administrator is checked.

2. Install Java Development Kit

Java 8 is a prerequisite for working with Apache Spark. Spark runs on top of Scala and Scala requires Java Virtual Machine to execute.

Download JDK 8 based on your system requirements and run the installer. Ensure to install Java to a path that doesn’t contains spaces. For the purpose of this blog, we change the default installation location to c:\jdk (Earlier versions of spark cause trouble with spaces in paths of program files). The same applies when the installer proceeds to install JRE. Change the default installation location to c:\jre.

Important Note: If you have a previous installation of Java. Please ensure that you remove it from your system path. Spark won’t work if JAVA exists in some directory path that has a space in its name.

3. Install Apache Spark

Download the pre-built version of Apache Spark 2.3.0. The package downloaded will be packed as tgz file. Please extract the file using any utility such as WinRar.

Once unpacked, copy all the contents of unpacked folder and paste to a new location: c:\spark.

Now, inside the new directory c:\spark, go to conf directory and rename the log4j.properties.template file to log4j.properties.

It is advised to change log level for log4j from ‘INFO’ to ‘ERROR’ to avoid unnecessary console clutter in spark-shell. To achieve this, open log4j.properties in an editor and replace ‘INFO’ by ‘ERROR’ on line number 19.

4. Install winutils.exe

Spark uses Hadoop internally for file system access. Even if you are not working with Hadoop (or only using Spark for local development), Windows still needs Hadoop to initialize “Hive” context, otherwise Java will throw java.io.IOException. This can be fixed by adding a dummy Hadoop installation that tricks Windows to believe that Hadoop is actually installed.

Download Hadoop 2.7 winutils.exe. Create a directory winutils with subdirectory bin and copy downloaded winutils.exe into it such that its path becomes: c:\winutils\bin\winutils.exe. 

Spark SQL supports Apache Hive using HiveContext. Apache Hive is a data warehouse software meant for analyzing and querying large datasets, which are principally stored on Hadoop Files using SQL-like queries. HiveContext is a specialized SQLContext to work with Hive in Spark. The next step is to change access permissions to c:\tmp\hive directory using winutils.exe.

– Create tmp directory containing hive subdirectory if it does not already exist as such its path becomes: c:\tmp\hive.
– Run command prompt as administrator.
– Change directory to winutils\bin by executing: cd c\winutils\bin.
– Change access permissions using winutils.exe: winutils.exe chmod 777 \tmp\hive.

5. Setting up Environment Variables

The final step is to set up some environment variables.

From start menu, go to Control Panel > System > Advanced System Settings and click on Environment variables button from the dialog box.

Under the user variables, add three new variables:

JAVA_HOME: c:\jdk

SPARK_HOME: c:\spark

HADOOP_HOME: c:\winutils

 

Finally, edit the PATH user variable by adding two more paths in it:

%JAVA_HOME%\bin

%SPARK_HOME%\bin

 

That’s it. You are all ready to create your own Spark applications with Python on Windows.

Testing your Spark installation 

Before diving into Spark basics, let’s first test if our spark installation is running with Python by writing a simple Spark program to generate squares of a given list of numbers.

– Open Enthought Canopy.
– Go to Tools menu and select Canopy Command Prompt. This will open a command line interface with all the environment variables and permissions set up by Enthought Canopy already to run Python.
– Kick off spark interpreter by command pyspark. At this point, there should be no ERROR messages showing on the console. Now, run the following code:
> nums = sc.parallelize([2,4,6,8])
> nums.map(lambda x: x*x).collect()

The first command creates a resilient data set (RDD) by parallelizing a python list given as input argument [2, 4, 6, 8] and store it as ‘nums’. The second command uses the famous map function to transform the ‘nums’ RDD into a new RDD containing the list of squares of numbers. Finally, ‘collect’ action is called on the new RDD to return a classic Python list. By executing the second command, you should see a resulting list of squared numbers as:

[4, 16, 36, 64]

Congratulations! You have successfully set up PySpark on Windows.

Connect with us for more information at [email protected]

Leave a Reply
Previous Post
ETA Problem using Regression in Python

A look into ETA Problem using Regression in Python – Machine Learning

Next Post

Apache Spark in Big Data and Data Science Ecosystem

Related Posts