How to install Spark (PySpark) on Windows

How to install Spark (PySpark) on Windows

Mariam Jamal
Software Engineer

 

Owais Akbani
Senior Software Engineer

 

Feb 22, 2019

 

Apache spark is a general-purpose cluster computing engine aimed mainly at distributed data processing. In this tutorial, we will walk you through the step by step process of setting up Apache Spark on Windows. 

Spark supports a number of programming languages including Java, Python, Scala, and R. In this tutorial, we will set up Spark with Python Development Environment by making use of Spark Python API (PySpark) which exposes the Spark programming model to Python.

Required Tools and Technologies:

  • - Python Development Environment.
  • - Apache Spark.
  • - Java Development Kit (Java 8).
  • - Hadoop winutils.exe.

Pointers for smooth installation:

- As of writing of this blog, Spark is not compatible with Java version>=9. Please ensure that you install JAVA 8 to avoid encountering installation errors.
- Apache Spark version 2.4.0 has a reported inherent bug that makes Spark incompatible for Windows as it breaks worker.py.
Any other version>2.0 will do fine. 
- Ensure Python 2.7 is not pre-installed independently if you are using a Python 3 Development Environment.

 

Steps to setup Spark with Python

 

1. Install Python Development Environment

Enthought canopy is one of the Python Development Environments just like Anaconda. If you are already using one, as long as it is Python 3 or higher development environment, you are covered. ( You can also go by installing Python 3 manually and setting up environment variables for your installation if you do not prefer using a development environment. )

Download your system compatible version 2.1.9 for Windows from Enthought Canopy

(If you have pre-installed Python 2.7 version, it may conflict with the new installations by the development environment for python 3).

Follow the installation wizard to complete the installation.

                                   

                                      

Once done, right click on canopy icon and select Properties. Inside the Compatibility tab, ensure Run as Administrator is checked.

                                                  

                                                              

 

2. Install Java Development Kit

Java 8 is a prerequisite for working with Apache Spark. Spark runs on top of Scala and Scala requires Java Virtual Machine to execute.

Download JDK 8 based on your system requirements and run the installer. Ensure to install Java to a path that doesn’t contains spaces. For the purpose of this blog, we change the default installation location to c:\jdk (Earlier versions of spark cause trouble with spaces in paths of program files). The same applies when the installer proceeds to install JRE. Change the default installation location to c:\jre.

                                    

Important Note: If you have a previous installation of Java. Please ensure that you remove it from your system path. Spark won’t work if JAVA exists in some directory path that has a space in its name.

 

3. Install Apache Spark

                                      

Download the pre-built version of Apache Spark 2.3.0. The package downloaded will be packed as tgz file. Please extract the file using any utility such as WinRar. 

Once unpacked, copy all the contents of unpacked folder and paste to a new location: c:\spark.

Now, inside the new directory c:\spark, go to conf directory and rename the log4j.properties.template file to log4j.properties.

         

It is advised to change log level for log4j from ‘INFO’ to ‘ERROR’ to avoid unnecessary console clutter in spark-shell. To achieve this, open log4j.properties in an editor and replace ‘INFO’ by ‘ERROR’ on line number 19

          

 

4. Install winutils.exe

Spark uses Hadoop internally for file system access. Even if you are not working with Hadoop (or only using Spark for local development), Windows still needs Hadoop to initialize “Hive” context, otherwise Java will throw java.io.IOException. This can be fixed by adding a dummy Hadoop installation that tricks Windows to believe that Hadoop is actually installed.

Download Hadoop 2.7 winutils.exe. Create a directory winutils with subdirectory bin and copy downloaded winutils.exe into it such that its path becomes: c:\winutils\bin\winutils.exe. 

Spark SQL supports Apache Hive using HiveContext. Apache Hive is a data warehouse software meant for analyzing and querying large datasets, which are principally stored on Hadoop Files using SQL-like queries. HiveContext is a specialized SQLContext to work with Hive in Spark. The next step is to change access permissions to c:\tmp\hive directory using winutils.exe.

- Create tmp directory containing hive subdirectory if it does not already exist as such its path becomes: c:\tmp\hive.
- Run command prompt as administrator.
- Change directory to winutils\bin by executing: cd c\winutils\bin
- Change access permissions using winutils.exe: winutils.exe chmod 777 \tmp\hive.

                           

 

5. Setting up Environment Variables

The final step is to set up some environment variables.

From start menu, go to Control Panel > System > Advanced System Settings and click on Environment variables button from the dialog box.

Under the user variables, add three new variables:

JAVA_HOME: c:\jdk

SPARK_HOME: c:\spark

HADOOP_HOME: c:\winutils

                         

Finally, edit the PATH user variable by adding two more paths in it:

%JAVA_HOME%\bin

%SPARK_HOME%\bin

                                      

That’s it. You are all ready to create your own Spark applications with Python on Windows.

 

Testing your Spark installation 

Before diving into Spark basics, let’s first test if our spark installation is running with Python by writing a simple Spark program to generate squares of a given list of numbers.

- Open Enthought Canopy.
- Go to Tools menu and select Canopy Command Prompt. This will open a command line interface with all the environment variables and permissions set up by Enthought Canopy already to run Python.

                                                      

- Kick off spark interpreter by command pyspark. At this point, there should be no ERROR messages showing on the console. Now, run the following code: 
> nums = sc.parallelize([2,4,6,8])
> nums.map(lambda x: x*x).collect()      

         

The first command creates a resilient data set (RDD) by parallelizing a python list given as input argument [2, 4, 6, 8] and store it as ‘nums’. The second command uses the famous map function to transform the ‘nums’ RDD into a new RDD containing the list of squares of numbers. Finally, ‘collect’ action is called on the new RDD to return a classic Python list. By executing the second command, you should see a resulting list of squared numbers as:

[4, 16, 36, 64]

Congratulations! You have successfully set up PySpark on Windows.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@184.169.241.188

[sharethis-inline-buttons]

A look into ETA Problem using Regression in Python – Machine Learning

A look into ETA Problem using Regression in Python - Machine Learning

Farman Shah
Senior Software Engineer
Feb 22, 2019

The term "ETA" usually means "Estimated Time of Arrival" but in the technology realm it refers as "Estimated Completion Time" of a computational process in general. In particular, this problem is specific to estimating completion time of a batch of long scripts running parallel to each other.

Problem

A number of campaigns run together in parallel, to process data and prepare some lists. Running time for each campaign varies from 10 minutes to may be 10 hours or more, depending on the data. A batch of campaigns is considered as complete when execution of all campaigns is finished and then resorted to have mutually exclusive data.

What we will do is to provide a solution that can accurately estimate completion time of campaigns based on some past data. 

Data

We have very limited data available; per campaign, from the past executions of these campaigns: 

campaign_idstart_timeend_timeuses_recommendationsquery_count
6982018-07-31 10:20:022018-08-01 02:05:48048147
6982018-07-24 11:10:022018-07-25 05:42:37045223
6992018-07-31 11:05:032018-08-01 07:23:160121898
6992018-07-24 12:00:042018-07-25 10:21:480116721
7002018-07-31 10:50:032018-08-01 06:54:530400325
7002018-07-24 11:45:032018-07-25 09:53:030353497
8112018-07-31 15:20:032018-08-01 01:54:5112601500
8112018-07-24 11:00:02 2018-07-25 05:36:3012609112

 

Feature Engineering / Preprocessing

These are the campaigns for which past data is available. A batch can consist of one or many campaigns from the above list. The feature uses_recommendations resulted during feature engineering. This feature helps machine differentiate between the campaigns which are dependent on an over the network API and the ones which are not, so that machine can keep a variable which caters network lag implicitly. 

Is this a Time Series Forecasting problem?

It could have been, but from the analysis it can be observed that the time of the year doesn’t impact the data that much. So this problem can be tackled as a regression problem instead of time series forecasting problem. 

How it turned into a regression problem?

The absolute difference in seconds between start time and end time of a campaign will be a numeric variable, that can be made the target variable. This is what we are going to estimate using Regression.

Regression

Our input X is the data available, and output y is the time difference between start time and the end time. Now let’s import the dataset and start processing. 

import pandas as pd

dataset = pd.read_csv('batch_data.csv')

As it can be seen from our data that there are no missing entries as of now, but as this may be an automated process, so we better handle the NA values. The following command will fill the NA values with column mean.

dataset = dataset.fillna(dataset.mean())

 

The output y is the difference between start time and end time of the campaign. Let’s set up our output variable.

start = pd.to_datetime(dataset['start_time'])
process_end = pd.to_datetime(dataset[‘end_time'])
y = (process_end - start).dt.seconds

 

y is taken out from the dataset, and we won’t be needing start_time and end_time columns in the X.

X = dataset.drop(['start_time', ‘end_time'], 1)

 

You might have questioned that how would Machine differentiate between the campaign_ids, here in particular, or any such categorical data in general. 

A quick recall if you already know the concept of One-Hot-Encoding. It is a method to create a toggle variable for categorical instances of data. So that, a variable is 1 for the rows where data belongs to that categorical value, and all other variables are 0. 

If you don’t already know the One-Hot-Encoding concept, it is highly recommended to read more about it online and come back to continue. We’ll use OneHotEncoder from sklearn library.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0], handle_unknown = 'ignore')
X = onehotencoder.fit_transform(X).toarray()

start = pd.to_datetime(dataset['start_time'])
# Avoiding the Dummy Variable Trap is part of one hot encoding
X = X[:, 1:]

 

Now that input data is ready, one final thing to do is to separate out some data to later test how good our algorithm is performing. We are separating out 20% data at random.

 

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

 

After trying Linear regression, SVR (Support Vector Regressor) and XGBoosts and RandomForests on the data it turned out that Linear and SVR models doesn’t fit the data well. And the other finding was that the performance of XGBoost and RandomForest was close to each other for this data. With a slight difference, let’s move forward with RandomForest.

 

# Fitting Regression to the dataset
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=300, random_state=1, criterion='mae')
regressor.fit(X_train, y_train)

 

The regression has been performed and a regressor has been fitted on our data. What follows next is checking how good is the fit.

 

Performance Measure

We’ll use Root Mean Square Error (RMSE) as our unit of goodness. The lesser the RMSE, the better the regression.

Let’s ask our regressor to make predictions on our Training data, that is 80% of the total data we had. This will give a glimpse of training accuracy. Later we’ll make predictions on the Test data, the remaining 20%, which will tell us about the performance of this regressor on Unseen data.

If the performance on training data is very good, and the performance on unseen data is poor, then our model is Overfitting. So, ideally the performance on unseen data should be close to that on the training data.

from sklearn.metrics import mean_squared_error
from math import sqrt
training_predictions = regressor.predict(X_train)

training_mse = mean_squared_error(y_train, training_predictions)
training_rmse = sqrt(training_mse) / 60 # Divide by 60 to turn it into minutes

 

We got the training RMSE, you should print it and see how many minutes does it deviate on average from the actual.

Now, let’s get the test RMSE.

 

test_predictions = regressor.predict(X_test)
test_mse = mean_squared_error(y_test_pred, test_predictions)

test_rmse = sqrt(test_mse) / 60

 

Compare the test_rmse with training_rmse to see how good is the regression performing on seen and unseen data.

What’s next for you is to now try fitting XGBoost, SVR and any other Regression models that you think should fit well on this data and see how different is the performance of different models.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@184.169.241.188

[sharethis-inline-buttons]