Apache Spark in Big Data and Data Science Ecosystem

Sometimes ago Kaggle launched a very interesting pair of image recognition challenge for landmarks. This challenge has a large dataset of landmark images.

Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions are referred as “Big Data”

This blog post is dedicated to one of the most popular Big Data tool; Apache Spark, which as of today has over 500 contributors from more than 200 organizations. Technically speaking, Apache Spark is a cluster computing framework meant for large distributed datasets to promote general-purpose computing with high speed. However, to truly appreciate and utilize the potential of this fast-growing Big Data framework, it is vital to understand the concept behind Spark’s creation and the reason for its crowned place in Data Science.

Data Science itself is an interdisciplinary field that deals with processes and systems that extract inferences and insights from large volumes of data referred to as Big Data. Some of the methods used in data science for the purpose of analysis of data include machine learning, data mining, visualizations, and computer programming among others.

Before Spark

With the boom of Big Data came great challenges of data discovery, storage and analytics on gigantic amounts of data. Apache Hadoop appeared in the picture as one of the most comprehensive frameworks for addressing these challenges with its own Hadoop Distributed File System(HDFS) to deal with storage and the ability to parallelize the data across a cluster, YARN, that manages application runtimes along with MapReduce, the algorithm that makes seamless parallel processing of batch data possible. Hadoop MapReduce set the Big Data wheel rolling, by taking care of data batch processing.

However, the other usecases for data analysis like visualizing and streaming big data still needed a practical solution. Additionally, constant I/O operations going on in HDFS made latency another major concern in data processing with Hadoop.

To support other methods of data analysis, Apache took a leading role and introduced various frameworks:

However, while having multiple robust frameworks to aid in data processing, there was no unified powerful engine in the industry being able to process multiple types of data. Also, there was a vast room of improvement in the I/O latency aspect of dealing with data in batch mode in Hadoop.

Enters Spark

Apache Spark project enters Big Data world as a Unified, General-purpose data processing engine addressing real-time processing, interactive processing, graph processing, in-memory processing as well as batch processing, all under one umbrella.
Spark aims at speed, ease-of-use, extensibility and interactive analytics with the flexibility to run alone or in an existing cluster environment. It has introduced high-level APIs in Java, Scala, Python, and R in its core along with four other components empowering big data processing.

Components of Apache Spark

Spark Core API

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built atop. It contains the basic functionality of Spark including job scheduling, fault recovery, memory management and storage system interactions. Spark Core is where the API for resilient distributed datasets (RDDs) is defined which serves as Spark’s basic programming abstraction. A RDD in Spark is an immutable distributed collection of records.

Spark SQL and Dataframes

Keeping in mind a major reliance of technical users on SQL queries for data manipulation, Spark introduces a module, Spark SQL, for structured data processing which supports many sources of data including Hive tables, Parquet, and JSON.

It also provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Along with providing a SQL interface, Spark also allows intermixed SQL queries with programmatic data manipulations using its  RDDs in Python, Java, Scala and R.

Spark Streaming

Spark streaming is the component that brings the power of real-time data processing in Spark Framework; thus enabling programmers to work with applications that deal with data stored in memory, on disk or arriving in real time. Running atop Spark, Spark Streaming inherits Spark Core’s ease of use and fault tolerance characteristics.

MLlib

Machine learning has unfolded as a explicative element in mining Big Data for actionable insights. Spark comes with a library containing common machine learning (ML) functionality, called MLlib, delivering both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (much faster than MapReduce). It also provides some lower-level ML primitives, including a generic gradient descent optimization

Algorithm, all with the ability to scale out across multiple nodes.

GraphX

GraphX is a graph manipulation engine built on top of Spark enabling users to interactively build and transform graphical data at large as well as to perform graph-parallel computations. GraphX comes with a complete library of common graph algorithms (e.g. PageRank and triangle counting) and various operators for graph manipulation (e.g. subgraph and mapVertices).

How Spark has sped up data processing?

Beyond providing a general-purpose unified engine, the main innovation of Spark over classical big data tools is the capability of ‘in-memory data processing’. Unlike Hadoop MapReduce that persists entire dataset to disk after running each job, Spark takes a more holistic view of jobs pipeline by feeding the output of one operation directly to the next without writing it to persistent storage. Along with in-memory data processing, Spark introduced ‘in-memory caching abstraction’ that allows multiple operations to work with the same dataset so that they do not need to be read from memory for every single operation. Hence it has been titled as ‘lightning-fast’ analytics engine on its official site.

What filesystem does Spark use?

Apache Spark entered the Big Data ecosystem as a tool that enhanced existing frameworks without reinventing the wheel. Unlike Hadoop, Spark does not come with its own file system, instead, it can be integrated with many file systems including Hadoop’s HDFS, MongoDB, and Amazon’s S3. By providing a common platform for multiple types of data processing and replacing MapReduce to support iterative programming by introducing in-memory data processing, Spark is gaining considerable momentum in data analytics.

Has Spark replaced Hadoop?

Lastly, a common misconception worth mentioning is that Apache Spark is a replacement for Hadoop. Apache Spark can never be a replacement for Hadoop. Although Spark provides many additional features than Hadoop, yet being such comprehensive a framework, Spark is not necessarily the best choice for every use case. Due to its capability for in-memory data processing, Spark demands a lot of RAM and can become a bottleneck when it comes to cost-efficient processing of big data. Furthermore, Spark is not designed for a multi-user environment and hence lacks the capability of concurrent execution. Thus, it is important to be fully familiar with the use case in hand for data analyses to make a decision for a big data tool to work with.

This was a brief introduction of Apache Spark’s place in Big Data and Data Science Ecosystem. For a deeper understanding of Apache Spark programming and Big Data analytics, follow blogs on folio3.ai.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization for your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at [email protected]

Leave a Reply
Previous Post

How to install Spark (PySpark) on Windows

Next Post

Understanding MapReduce with Hadoop