Folio3 AI Meetup – Press Release

Folio3 AI and Lahore School of AI hosted a meetup in Lahore on Artificial Intelligence

PRESS RELEASE

[April 8th, 2019 Lahore, Pakistan] Folio3 AI and Lahore School of AI teamed up to host a meetup for AI enthusiasts hailing from enterprises, startups and academic institutions at Lahore Office of Folio3.

The AI meetup was an inaugural meet-up hosted by Folio3 AI to discuss the latest trends, challenges, opportunities etc. in the realm of Artificial Intelligence. With the collaboration of Lahore Chapter of School of AI which has been conducting similar kind of sessions in the past as well for the subject awareness and community building purpose within this field. 

The venue for AI Meetup was the new Lahore office of Folio3 in Gulberg III. The event took place on 5th April, 2019. AI enthusiasts from diverse backgrounds attended this event ranging from students, professionals, professors and entrepreneurs.

Ahmed Hassan, who is heading the Lahore office of Folio3, in his opening speech welcomed the audience introduce the guest speakers and spoke about Folio3 and its vision for innovation.

Followed by the introduction were the key speaker sessions, which was initiated by Kshitiz Rimal, who joined us remotely from Nepal. Mr. Rimal is a Google Developers Expert and Head of Research AID Nepal. He presented “Transfer Learning to Save the World” to the attendees and answered their queries with his vast experience on the particular subject.

Later on, a research excellency; Dr. Kashif Zafar Former Head of Department and Professor of CS Department – FAST Lahore contributed to the event with his informative presentation on “Evolution of AI and Industry 4.0.” And also informed about the processes and different initiatives for aspiring AI professionals.

Last speaker but not certainly the least, Muhammad Usman presented and showcased on the significance of Big Data with his topic; “Domesticating your Big Data.” He is a Data Scientist at IBM currently and addressed the attendees with a new perspective in regards to Big Data.

Concluding the event with memento presentation to the Speakers, refreshments to the attendees and a healthy continuous Q/A session focusing on the need for innovation and future possibilities with in AI.


About Lahore School of AI:  It is a community of AI practitioners spread across the globe working in tandem to solve real-world problems. It is part of the global community “School of AI” and their mission is to offer a world-class AI education to anyone on Earth for free. https://www.facebook.com/groups/LahoreAICommunity/

About Folio3 AI: It is the innovation wing of Folio3 Software Inc., which is a Silicon Valley based software development and technology solution provider. With a global presence in over 5 countries and a worldwide workforce of more than 250+ professionals.

Folio3 AI has a team of dedicated Data Scientists and Consultants that have delivered end-to-end projects related to machine learning, natural language processing, computer vision and predictive analysis. https://www.folio3.ai


For Press Inquiries – Please Contact:

Bakhtiar Shah

Marketing Department – Folio3 AI

sabakhtiar@folio3.com

+1 (408) 365 4638

Understanding MapReduce with Hadoop

Understanding MapReduce with Hadoop

Mariam Jamal
Software Engineer

 

Owais Akbani
Senior Software Engineer


April 06, 2019

To understand MapReduce algorithm, it is vital to understand the challenge it attempts to provide a solution for. With the rise of digital age and the capability of capturing and storing data, there has been an explosion in the amount of data at our disposal. Businesses and corporations were intuitive enough to realize the true potential of this data in terms of gaining insights about customer needs and making predictions to take informed decisions; yet, only within a few years, managing this gigantic amount of data posed a serious challenge for organizations. This is where Big Data comes into picture.

Big data refers to the gigantic volumes of structured and unstructured data and the ways of dealing with it to aid in strategic business planning, reduction in production costs, and smart decision making. However, with Big Data came great challenges of capturing, storing, analyzing and sharing this data with traditional database servers. As a major breakthrough in processing of immense data, Google came up with the MapReduce algorithm inspired by the classic technique: Divide and Conquer.

MapReduce Algorithm

MapReduce, when combined with Hadoop Distributed File System, plays a crucial role in Big Data Analytics. It introduces a way of performing multiple operations on large volumes of data parallely in batch mode using ‘key-value’ pair as the basic unit of data for processing.

MapReduce algorithm involves two major components; Map and Reduce.

The Map component (aka Mapper) is responsible for splitting large data in equal sized chunks of information which are then distributed among a number of nodes (computers) in such a way that the load is balanced and distributed as well as faults and failures are managed by rollbacks.

The Reduce component (aka Reducer) comes into play once the distributed computation is completed and acts as an accumulator to aggregate the results as final output.

Hadoop MapReduce

Hadoop MapReduce is an implementation of MapReduce algorithm by Apache Hadoop project to run applications where data is processed in a parallel way, in batches, across multiple CPU nodes.

The entire process of MapReduce includes four stages.

 

1. Input Split

In the first phase, the input file is located and transformed for processing by the Mapper.  The file gets split up in fixed-sized chunks on Hadoop Distributed File System. The input file format decides how to split up the data using a function called InputSplit. The intuition behind splitting data is simply that the time taken to process a split is always smaller than the time to process the whole dataset as well as to balance the load eloquently across multiple nodes within the cluster.

2. Mapping

Once all the data has been transformed in an acceptable form, each input split is passed to a distinct instance of mapper to perform computations which result in key-value pairs of the dataset. All the nodes participating in Hadoop cluster perform the same map computations on their respective local datasets simultaneously. Once mapping is completed, each node outputs a list of key-value pairs which are written on the local disk of the respective node rather than HDFS. These outputs are now fed as inputs to the Reducer.

3. Shuffling and Sorting

Before the reducer runs, the intermediate results of mapper are gathered together in a Partitioner to be shuffled and sorted so as to prepare them for optimal processing by the reducer.

4. Reducing

For each output, reduce is called to perform its task. The reduce function is user-defined. Reducer takes as input the intermediate shuffled output and aggregates all these result into the desired result set. The output of reduce stage is also a key-value pair but can be transformed in accordance to application requirements by making use of OutputFormat, a feature provided by Hadoop.

It is clear from the stages’ order that MapReduce is a sequential algorithm. Reducer cannot start its operation unless Mapper has completed its execution. Despite being prone to I/O latency and a sequential algorithm, MapReduce is thought of as the heart of Big Data Analytics owing to its capability of parallelism and fault-tolerance.

After getting familiar with the gist of MapReduce Algorithm, we will now move ahead to translate the Word Count Example as shown in figure in Python code.

MapReduce in Python 

We aim to write a simple MapReduce program for Hadoop in Python that is meant to count words by value in a given input file.

We will make use of Hadoop Streaming API to be able to pass data between different phases of MapReduce through STDIN (Standard Input) and STDOUT (Standard Output).

1. First of all, we need to create an example input file.

Create a text file named dummytext.txt and copy the simple text in it:           

            Folio3 introduces ML.

            Folio3 introduces BigData.

            BigData facilitates ML.

2. Now, create mapper.py to be executed in the Map phase.

Mapper.py will read data from standard input and will print on standard output a list of tuples for each word occuring in the input file.

            “mapper.py”

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # split the line into words
    words = line.split()
   
    for word in words:
        # write the results to STDOUT (standard output)
        # tab-delimited words
with default count 1
        print '%s\t%s' % (word, 1)

3. Next, create a file named reducer.py to be executed in Reduce phase. Reducer.py will take the output of mapper.py as its input and will sum the occurrences of each word to a final count.

                       

“reducer.py”

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # prepare mapper.py output to be sorted by Hadoop
    # by key before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# to output the last word if needed
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

4. Make sure you make the two programs executable by using the following commands:

           

          > chmod +x mapper.py

          > chmod +x reducer.py

 

 You can find the full code at Folio3 AI repository.

Running MapReduce Locally

> cat dummytext.txt | python mapper.py | sort -k1 | python reducer.py

Running MapReduce on Hadoop Cluster 

We assume that the default user created in Hadoop is f3user.

1. Firstly, we will copy local dummy file to Hadoop Distributed file system by running:

           

> hdfs dfs -put /src/dummytext.txt /user/f3user

2. Finally, we run our MapReduce job on Hadoop cluster by leveraging the support of streaming API to support standard I/O.

 

> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

 -file /src/mapper.py -mapper “python mapper.py”  \

-file /src/reducer.py -reducer “python reducer.py”  \

-input /user/f3user/dummytext.txt -output /user/f3user/wordcount

 

The job will take input from ‘user/f3user/dummytext.txt’ and write output to ‘user/f3user/wordcount’.

Running this job will produce the output as:

Congratulations, you just completed your first MapReduce application on Hadoop with Python!

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization for your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@folio3.ai

[sharethis-inline-buttons]

Apache Spark in Big Data and Data Science Ecosystem

Apache Spark in Big Data and Data Science Ecosystem

Mariam Jamal 
Software Engineer
April 04, 2019

Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions are referred as “Big Data”

This blog post is dedicated to one of the most popular Big Data tool; Apache Spark, which as of today has over 500 contributors from more than 200 organizations. Technically speaking, Apache Spark is a cluster computing framework meant for large distributed datasets to promote general-purpose computing with high speed. However, to truly appreciate and utilize the potential of this fast-growing Big Data framework, it is vital to understand the concept behind Spark’s creation and the reason for its crowned place in Data Science.

Data Science itself is an interdisciplinary field that deals with processes and systems that extract inferences and insights from large volumes of data referred to as Big Data. Some of the methods used in data science for the purpose of analysis of data include machine learning, data mining, visualizations, and computer programming among others.

 

Before Spark

With the boom of Big Data came great challenges of data discovery, storage and analytics on gigantic amounts of data. Apache Hadoop appeared in the picture as one of the most comprehensive frameworks for addressing these challenges with its own Hadoop Distributed File System(HDFS) to deal with storage and the ability to parallelize the data across a cluster, YARN, that manages application runtimes along with MapReduce, the algorithm that makes seamless parallel processing of batch data possible. Hadoop MapReduce set the Big Data wheel rolling, by taking care of data batch processing.

However, the other usecases for data analysis like visualizing and streaming big data still needed a practical solution. Additionally, constant I/O operations going on in HDFS made latency another major concern in data processing with Hadoop.

To support other methods of data analysis, Apache took a leading role and introduced various frameworks:

 

 

However, while having multiple robust frameworks to aid in data processing, there was no unified powerful engine in the industry being able to process multiple types of data. Also, there was a vast room of improvement in the I/O latency aspect of dealing with data in batch mode in Hadoop.

 

Enters Spark

Apache Spark project enters Big Data world as a Unified, General-purpose data processing engine addressing real-time processing, interactive processing, graph processing, in-memory processing as well as batch processing, all under one umbrella.
Spark aims at speed, ease-of-use, extensibility and interactive analytics with the flexibility to run alone or in an existing cluster environment. It has introduced high-level APIs in Java, Scala, Python, and R in its core along with four other components empowering big data processing.

 

 

Components of Apache Spark 

Spark Core API

Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built atop. It contains the basic functionality of Spark including job scheduling, fault recovery, memory management and storage system interactions. Spark Core is where the API for resilient distributed datasets (RDDs) is defined which serves as Spark’s basic programming abstraction. A RDD in Spark is an immutable distributed collection of records.

Spark SQL and Dataframes

Keeping in mind a major reliance of technical users on SQL queries for data manipulation, Spark introduces a module, Spark SQL, for structured data processing which supports many sources of data including Hive tables, Parquet, and JSON.

It also provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. Along with providing a SQL interface, Spark also allows intermixed SQL queries with programmatic data manipulations using its  RDDs in Python, Java, Scala and R.

  

Spark Streaming

Spark streaming is the component that brings the power of real-time data processing in Spark Framework; thus enabling programmers to work with applications that deal with data stored in memory, on disk or arriving in real time. Running atop Spark, Spark Streaming inherits Spark Core’s ease of use and fault tolerance characteristics.

  

MLlib

Machine learning has unfolded as a explicative element in mining Big Data for actionable insights. Spark comes with a library containing common machine learning (ML) functionality, called MLlib, delivering both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (much faster than MapReduce). It also provides some lower-level ML primitives, including a generic gradient descent optimization

Algorithm, all with the ability to scale out across multiple nodes.

  

GraphX

GraphX is a graph manipulation engine built on top of Spark enabling users to interactively build and transform graphical data at large as well as to perform graph-parallel computations. GraphX comes with a complete library of common graph algorithms (e.g. PageRank and triangle counting) and various operators for graph manipulation (e.g. subgraph and mapVertices).

 

How Spark has sped up data processing?

Beyond providing a general-purpose unified engine, the main innovation of Spark over classical big data tools is the capability of ‘in-memory data processing’. Unlike Hadoop MapReduce that persists entire dataset to disk after running each job, Spark takes a more holistic view of jobs pipeline by feeding the output of one operation directly to the next without writing it to persistent storage. Along with in-memory data processing, Spark introduced ‘in-memory caching abstraction’ that allows multiple operations to work with the same dataset so that they do not need to be read from memory for every single operation. Hence it has been titled as ‘lightning-fast’ analytics engine on its official site.

 

What filesystem does Spark use?

Apache Spark entered the Big Data ecosystem as a tool that enhanced existing frameworks without reinventing the wheel. Unlike Hadoop, Spark does not come with its own file system, instead, it can be integrated with many file systems including Hadoop’s HDFS, MongoDB, and Amazon’s S3. By providing a common platform for multiple types of data processing and replacing MapReduce to support iterative programming by introducing in-memory data processing, Spark is gaining considerable momentum in data analytics.

 

Has Spark replaced Hadoop?

Lastly, a common misconception worth mentioning is that Apache Spark is a replacement for Hadoop. Apache Spark can never be a replacement for Hadoop. Although Spark provides many additional features than Hadoop, yet being such comprehensive a framework, Spark is not necessarily the best choice for every use case. Due to its capability for in-memory data processing, Spark demands a lot of RAM and can become a bottleneck when it comes to cost-efficient processing of big data. Furthermore, Spark is not designed for a multi-user environment and hence lacks the capability of concurrent execution. Thus, it is important to be fully familiar with the use case in hand for data analyses to make a decision for a big data tool to work with.

 

This was a brief introduction of Apache Spark’s place in Big Data and Data Science Ecosystem. For a deeper understanding of Apache Spark programming and Big Data analytics, follow blogs on folio3.ai.

 

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization for your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@folio3.ai

[sharethis-inline-buttons]