Up & Running with Cloudera Quickstart on Docker

Up & Running with Cloudera Quickstart on Docker

Mariam Jamal
Software Engineer

 

Owais Akbani
Senior Software Engineer


May 08, 2019

Although as big a beast as Apache Hadoop is in Big Data world, with the ability to scale from a single server to thousands, it is often needed to use Hadoop with versatile additional functionalities to accomplish diverse Big Data tasks. By taking advantage of Hadoop’s modular architecture and open-ended framework, many organizations packaged Hadoop by bundling it up together with various big data tools to cater diverse set of enterprise needs. One of the most widely used distributions of Hadoop is Cloudera that provides all of the required software and administrative UIs along with Hadoop core as a single package CDH.

In this blogpost we will be setting up Cloudera on Docker and will run a simple MapReduce job on Hadoop. We will point out the nitty-gritty aspects of installation and the issues that can consume a lot of time in troubleshooting.

Setting up Cloudera Quickstart

In the first part, we will focus on setting up Cloudera Quickstart image to run inside Docker. As a prerequisite, Docker must be set up on your machine. 

1. Make sure Docker is running on your system and pull Cloudera Quickstart image from cloud using Docker pull Cloudera/quickstart:latest 

As of now, cloudera/quickstart image is 4.4GBs and downloading may take a bit long depending on your connection bandwidth. Once done, please check by running sudo docker images You will see a cloudera/quickstart image too in the images listed.

2. Once the image has been pulled, we will go ahead and run the image using the following command:

docker run --hostname=quickstart.cloudera --privileged=true -t -i \
 -v /home/mariam/project:/src -p 8888:8888 -p 80:80 -p 7180:7180 cloudera/quickstart /usr/bin/docker-quickstart

The command contains various flags explained below:

--hostname=quickstart.clouderaHostname for pseudo-distributed configuration.
Hostname should not be changed, otherwise some services might not start correctly.
--privileged=trueGives extended privileges to container for Hbase, Hive metastore, etc.
-tStarts a terminal emulator for running services.
-iKeeps Standard Input open to be able to use the terminal.
-vMounts a local volume to a directory on cloudera container server.
-pPublishes container’s ports to the host.
Cloudera exposes different services to different ports:
● 8888: Hue
● 7180: Cloudera Manager
● 80: Cloudera Tutorial

Credentials for cloudera quickstart administrative services are:
Username: cloudera
Password: cloudera

Running the container will start various services exposed by Cloudera.


Upon successful execution, the mounted volume with files is now available in /src directory inside Cloudera container.

Sometimes HUE (Hadoop User Interface) fails while various other services start fine during the container startup.

Along with various other packages that come bundled with CDH, HUE is one of the most widely used package. HUE provides a web based query editor to analyze, visualize and share data in Hadoop stack.

Even though, sometimes, Failed message is shown for HUE server, it actually starts a while later. To independently start HUE or to check if it is actually running use command:

sudo service hue start

The command will either start the service if it is not already up or will show the running status of the service. 

3. Now we access HUE from the web browser on the port 8888 using default credentials:

Username: cloudera 

Password: cloudera

Logging in will take you to HUE home with default Quick Start active tab. Go through the quick start by following the wizard using Next button.

When gone through all steps, check ‘Skip Wizard Next Time’ option and click Done.


Being a web platform for analyzing and visualizing data, HUE provides graphical interface to manage Hadoop Distributed File System. Use the first icon on top right of title bar to go to HDFS manager.

 

Start Growing with Folio3 AI Today.

We are the Pioneers in the Cognitive Arena - Do you want to become a pioneer yourself ? Get In Touch

Running MapReduce job with Cloudera Quickstart

Once Cloudera Quickstart is running smoothly, it is time to move ahead to run a simple MapReduce job on Hadoop with Cloudera Quickstart. If you would like to understand exactly what map reduce is, please check here: Understanding MapReduce with Hadoop.

The job is meant to count by value all the words appearing in an example text file. Full code for the exercise can be accessed from Folio3 AI Github repo.

1. We will use dummytext.txt file as input text, mapper.py as Python script for Map Phase and reducer.py as Python script for Reduce Phase.

dummytext.txt

Folio3 introduces ML.
Folio3 introduces BigData.
BigData facilitates ML.

mapper.py

"""mapper.py"""

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output)
        # tab-delimited words with default count 1
        print '%s\t%s' % (word, 1)

reducer.py

"""reducer.py"""

import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # prepare mapper.py output to be sorted by Hadoop
    # by key before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# to output the last word if needed
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

Save these files in the mounted volume.

2. The first step to run the job with Hadoop is to make the input file available to HDFS. Achieve this by using command:

hdfs dfs -put /src/dummytext.txt /user/cloudera

You can view the file in HDFS by accessing HUE and going to the HDFS manager.

3. Finally, it is time to actually start the job using command:

sudo hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-file /src/mapper.py -mapper "python mapper.py" \ 
-file /src/reducer.py -reducer "python reducer.py" \
-input /user/cloudera/dummytext.txt -output /user/cloudera/wordcount

Seems like a lot is going in the command! Let’s break it down to bits and understand what it entails.

hadoop jar /path/to/hadoop/jar
(Given your Cloudera version, path to Streaming jar may be different).
We will be using STDIN and STDOUT for
passing data to different phases during the MapReduce job. To achieve this, we will be making use of Hadoop Streaming API. In this part of command, we specify the path for streaming jar in CDH in the format:
Hadoop jar /path/to/jar/in/CDH
-file /src/mapper.py -mapper "python mapper.py"In this part, we provide essential details for the mapper in the format:
-file -mapper
-file /src/reducer.py -reducer "python reducer.py"Next details for reducer are given in the format:
-file -reducer
-input /user/cloudera/word.txtThis part specifies the initial input file’s path for which we are counting words by values.
-input /path/to/input/file/in/hdfs
-output /user/cloudera/wordcountFinally, we specify the path where resulting file will be saved
-outptut /path/to/output/directory/in/hdfs

Once the job completes successfully, the logs will show the path of output directory.

It is important to note that we provided the executor statement for mapper as:

-file /src/mapper.py -mapper “python mapper.py”

It is possible to provide only the filename to execute the file without the following python like:

-file /src/mapper.py -mapper mapper.py

But for that, you need to include #!usr/bin/python at the top of all your python scripts depending where your python installation lies. The #! is called a shebang and allows your script to be executed like a standalone executable without typing python in front of it.

4. Go to HUE interface once again and check the HDFS manager. A new directory named ‘wordcount’ will appear.

Go to the ‘wordcount’ directory, a file named ‘part-00000’ will be there.

Open the file and you will see the final output.

And that’s it! In this blogpost we learnt setting up Cloudera Quickstart and run a simple MapReduce job in it. Fiddle around with the code and get your hands dirty with Cloudera Quickstart. Happy programming!

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@184.169.241.188

[sharethis-inline-buttons]

Folio3 AI Meetup – Press Release

Folio3 AI and Lahore School of AI hosted a meetup in Lahore on Artificial Intelligence

PRESS RELEASE

[April 8th, 2019 Lahore, Pakistan] Folio3 AI and Lahore School of AI teamed up to host a meetup for AI enthusiasts hailing from enterprises, startups and academic institutions at Lahore Office of Folio3.

The AI meetup was an inaugural meet-up hosted by Folio3 AI to discuss the latest trends, challenges, opportunities etc. in the realm of Artificial Intelligence. With the collaboration of Lahore Chapter of School of AI which has been conducting similar kind of sessions in the past as well for the subject awareness and community building purpose within this field. 

The venue for AI Meetup was the new Lahore office of Folio3 in Gulberg III. The event took place on 5th April, 2019. AI enthusiasts from diverse backgrounds attended this event ranging from students, professionals, professors and entrepreneurs.

Ahmed Hassan, who is heading the Lahore office of Folio3, in his opening speech welcomed the audience introduce the guest speakers and spoke about Folio3 and its vision for innovation.

Followed by the introduction were the key speaker sessions, which was initiated by Kshitiz Rimal, who joined us remotely from Nepal. Mr. Rimal is a Google Developers Expert and Head of Research AID Nepal. He presented "Transfer Learning to Save the World" to the attendees and answered their queries with his vast experience on the particular subject.

Later on, a research excellency; Dr. Kashif Zafar Former Head of Department and Professor of CS Department - FAST Lahore contributed to the event with his informative presentation on "Evolution of AI and Industry 4.0." And also informed about the processes and different initiatives for aspiring AI professionals.

Last speaker but not certainly the least, Muhammad Usman presented and showcased on the significance of Big Data with his topic; "Domesticating your Big Data." He is a Data Scientist at IBM currently and addressed the attendees with a new perspective in regards to Big Data.

Concluding the event with memento presentation to the Speakers, refreshments to the attendees and a healthy continuous Q/A session focusing on the need for innovation and future possibilities with in AI.


About Lahore School of AI:  It is a community of AI practitioners spread across the globe working in tandem to solve real-world problems. It is part of the global community "School of AI" and their mission is to offer a world-class AI education to anyone on Earth for free. https://www.facebook.com/groups/LahoreAICommunity/

About Folio3 AI: It is the innovation wing of Folio3 Software Inc., which is a Silicon Valley based software development and technology solution provider. With a global presence in over 5 countries and a worldwide workforce of more than 250+ professionals.

Folio3 AI has a team of dedicated Data Scientists and Consultants that have delivered end-to-end projects related to machine learning, natural language processing, computer vision and predictive analysis. https://www.folio3.ai


For Press Inquiries – Please Contact:

Bakhtiar Shah

Marketing Department – Folio3 AI

sabakhtiar@folio3.com

+1 (408) 365 4638

Understanding MapReduce with Hadoop

Understanding MapReduce with Hadoop

Mariam Jamal
Software Engineer

 

Owais Akbani
Senior Software Engineer


April 06, 2019

To understand MapReduce algorithm, it is vital to understand the challenge it attempts to provide a solution for. With the rise of digital age and the capability of capturing and storing data, there has been an explosion in the amount of data at our disposal. Businesses and corporations were intuitive enough to realize the true potential of this data in terms of gaining insights about customer needs and making predictions to take informed decisions; yet, only within a few years, managing this gigantic amount of data posed a serious challenge for organizations. This is where Big Data comes into picture.

Big data refers to the gigantic volumes of structured and unstructured data and the ways of dealing with it to aid in strategic business planning, reduction in production costs, and smart decision making. However, with Big Data came great challenges of capturing, storing, analyzing and sharing this data with traditional database servers. As a major breakthrough in processing of immense data, Google came up with the MapReduce algorithm inspired by the classic technique: Divide and Conquer.

MapReduce Algorithm

MapReduce, when combined with Hadoop Distributed File System, plays a crucial role in Big Data Analytics. It introduces a way of performing multiple operations on large volumes of data parallely in batch mode using ‘key-value’ pair as the basic unit of data for processing.

MapReduce algorithm involves two major components; Map and Reduce.

The Map component (aka Mapper) is responsible for splitting large data in equal sized chunks of information which are then distributed among a number of nodes (computers) in such a way that the load is balanced and distributed as well as faults and failures are managed by rollbacks.

The Reduce component (aka Reducer) comes into play once the distributed computation is completed and acts as an accumulator to aggregate the results as final output.

Hadoop MapReduce

Hadoop MapReduce is an implementation of MapReduce algorithm by Apache Hadoop project to run applications where data is processed in a parallel way, in batches, across multiple CPU nodes.

The entire process of MapReduce includes four stages.

 

1. Input Split

In the first phase, the input file is located and transformed for processing by the Mapper.  The file gets split up in fixed-sized chunks on Hadoop Distributed File System. The input file format decides how to split up the data using a function called InputSplit. The intuition behind splitting data is simply that the time taken to process a split is always smaller than the time to process the whole dataset as well as to balance the load eloquently across multiple nodes within the cluster.

2. Mapping

Once all the data has been transformed in an acceptable form, each input split is passed to a distinct instance of mapper to perform computations which result in key-value pairs of the dataset. All the nodes participating in Hadoop cluster perform the same map computations on their respective local datasets simultaneously. Once mapping is completed, each node outputs a list of key-value pairs which are written on the local disk of the respective node rather than HDFS. These outputs are now fed as inputs to the Reducer.

3. Shuffling and Sorting

Before the reducer runs, the intermediate results of mapper are gathered together in a Partitioner to be shuffled and sorted so as to prepare them for optimal processing by the reducer.

4. Reducing

For each output, reduce is called to perform its task. The reduce function is user-defined. Reducer takes as input the intermediate shuffled output and aggregates all these result into the desired result set. The output of reduce stage is also a key-value pair but can be transformed in accordance to application requirements by making use of OutputFormat, a feature provided by Hadoop.

It is clear from the stages’ order that MapReduce is a sequential algorithm. Reducer cannot start its operation unless Mapper has completed its execution. Despite being prone to I/O latency and a sequential algorithm, MapReduce is thought of as the heart of Big Data Analytics owing to its capability of parallelism and fault-tolerance.

After getting familiar with the gist of MapReduce Algorithm, we will now move ahead to translate the Word Count Example as shown in figure in Python code.

MapReduce in Python 

We aim to write a simple MapReduce program for Hadoop in Python that is meant to count words by value in a given input file.

We will make use of Hadoop Streaming API to be able to pass data between different phases of MapReduce through STDIN (Standard Input) and STDOUT (Standard Output).

1. First of all, we need to create an example input file.

Create a text file named dummytext.txt and copy the simple text in it:           

            Folio3 introduces ML.

            Folio3 introduces BigData.

            BigData facilitates ML.

2. Now, create mapper.py to be executed in the Map phase.

Mapper.py will read data from standard input and will print on standard output a list of tuples for each word occuring in the input file.

            “mapper.py”

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # split the line into words
    words = line.split()
   
    for word in words:
        # write the results to STDOUT (standard output)
        # tab-delimited words
with default count 1
        print '%s\t%s' % (word, 1)

3. Next, create a file named reducer.py to be executed in Reduce phase. Reducer.py will take the output of mapper.py as its input and will sum the occurrences of each word to a final count.

                       

"reducer.py"

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # prepare mapper.py output to be sorted by Hadoop
    # by key before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# to output the last word if needed
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

4. Make sure you make the two programs executable by using the following commands:

           

          > chmod +x mapper.py

          > chmod +x reducer.py

 

 You can find the full code at Folio3 AI repository.

Running MapReduce Locally

> cat dummytext.txt | python mapper.py | sort -k1 | python reducer.py

Running MapReduce on Hadoop Cluster 

We assume that the default user created in Hadoop is f3user.

1. Firstly, we will copy local dummy file to Hadoop Distributed file system by running:

           

> hdfs dfs -put /src/dummytext.txt /user/f3user

2. Finally, we run our MapReduce job on Hadoop cluster by leveraging the support of streaming API to support standard I/O.

 

> hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \

 -file /src/mapper.py -mapper "python mapper.py"  \

-file /src/reducer.py -reducer "python reducer.py"  \

-input /user/f3user/dummytext.txt -output /user/f3user/wordcount

 

The job will take input from ‘user/f3user/dummytext.txt’ and write output to ‘user/f3user/wordcount’.

Running this job will produce the output as:

Congratulations, you just completed your first MapReduce application on Hadoop with Python!

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization for your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@184.169.241.188

[sharethis-inline-buttons]

How to install Spark (PySpark) on Windows

How to install Spark (PySpark) on Windows

Mariam Jamal
Software Engineer

 

Owais Akbani
Senior Software Engineer

 

Feb 22, 2019

 

Apache spark is a general-purpose cluster computing engine aimed mainly at distributed data processing. In this tutorial, we will walk you through the step by step process of setting up Apache Spark on Windows. 

Spark supports a number of programming languages including Java, Python, Scala, and R. In this tutorial, we will set up Spark with Python Development Environment by making use of Spark Python API (PySpark) which exposes the Spark programming model to Python.

Required Tools and Technologies:

  • - Python Development Environment.
  • - Apache Spark.
  • - Java Development Kit (Java 8).
  • - Hadoop winutils.exe.

Pointers for smooth installation:

- As of writing of this blog, Spark is not compatible with Java version>=9. Please ensure that you install JAVA 8 to avoid encountering installation errors.
- Apache Spark version 2.4.0 has a reported inherent bug that makes Spark incompatible for Windows as it breaks worker.py.
Any other version>2.0 will do fine. 
- Ensure Python 2.7 is not pre-installed independently if you are using a Python 3 Development Environment.

 

Steps to setup Spark with Python

 

1. Install Python Development Environment

Enthought canopy is one of the Python Development Environments just like Anaconda. If you are already using one, as long as it is Python 3 or higher development environment, you are covered. ( You can also go by installing Python 3 manually and setting up environment variables for your installation if you do not prefer using a development environment. )

Download your system compatible version 2.1.9 for Windows from Enthought Canopy

(If you have pre-installed Python 2.7 version, it may conflict with the new installations by the development environment for python 3).

Follow the installation wizard to complete the installation.

                                   

                                      

Once done, right click on canopy icon and select Properties. Inside the Compatibility tab, ensure Run as Administrator is checked.

                                                  

                                                              

 

2. Install Java Development Kit

Java 8 is a prerequisite for working with Apache Spark. Spark runs on top of Scala and Scala requires Java Virtual Machine to execute.

Download JDK 8 based on your system requirements and run the installer. Ensure to install Java to a path that doesn’t contains spaces. For the purpose of this blog, we change the default installation location to c:\jdk (Earlier versions of spark cause trouble with spaces in paths of program files). The same applies when the installer proceeds to install JRE. Change the default installation location to c:\jre.

                                    

Important Note: If you have a previous installation of Java. Please ensure that you remove it from your system path. Spark won’t work if JAVA exists in some directory path that has a space in its name.

 

3. Install Apache Spark

                                      

Download the pre-built version of Apache Spark 2.3.0. The package downloaded will be packed as tgz file. Please extract the file using any utility such as WinRar. 

Once unpacked, copy all the contents of unpacked folder and paste to a new location: c:\spark.

Now, inside the new directory c:\spark, go to conf directory and rename the log4j.properties.template file to log4j.properties.

         

It is advised to change log level for log4j from ‘INFO’ to ‘ERROR’ to avoid unnecessary console clutter in spark-shell. To achieve this, open log4j.properties in an editor and replace ‘INFO’ by ‘ERROR’ on line number 19

          

 

4. Install winutils.exe

Spark uses Hadoop internally for file system access. Even if you are not working with Hadoop (or only using Spark for local development), Windows still needs Hadoop to initialize “Hive” context, otherwise Java will throw java.io.IOException. This can be fixed by adding a dummy Hadoop installation that tricks Windows to believe that Hadoop is actually installed.

Download Hadoop 2.7 winutils.exe. Create a directory winutils with subdirectory bin and copy downloaded winutils.exe into it such that its path becomes: c:\winutils\bin\winutils.exe. 

Spark SQL supports Apache Hive using HiveContext. Apache Hive is a data warehouse software meant for analyzing and querying large datasets, which are principally stored on Hadoop Files using SQL-like queries. HiveContext is a specialized SQLContext to work with Hive in Spark. The next step is to change access permissions to c:\tmp\hive directory using winutils.exe.

- Create tmp directory containing hive subdirectory if it does not already exist as such its path becomes: c:\tmp\hive.
- Run command prompt as administrator.
- Change directory to winutils\bin by executing: cd c\winutils\bin
- Change access permissions using winutils.exe: winutils.exe chmod 777 \tmp\hive.

                           

 

5. Setting up Environment Variables

The final step is to set up some environment variables.

From start menu, go to Control Panel > System > Advanced System Settings and click on Environment variables button from the dialog box.

Under the user variables, add three new variables:

JAVA_HOME: c:\jdk

SPARK_HOME: c:\spark

HADOOP_HOME: c:\winutils

                         

Finally, edit the PATH user variable by adding two more paths in it:

%JAVA_HOME%\bin

%SPARK_HOME%\bin

                                      

That’s it. You are all ready to create your own Spark applications with Python on Windows.

 

Testing your Spark installation 

Before diving into Spark basics, let’s first test if our spark installation is running with Python by writing a simple Spark program to generate squares of a given list of numbers.

- Open Enthought Canopy.
- Go to Tools menu and select Canopy Command Prompt. This will open a command line interface with all the environment variables and permissions set up by Enthought Canopy already to run Python.

                                                      

- Kick off spark interpreter by command pyspark. At this point, there should be no ERROR messages showing on the console. Now, run the following code: 
> nums = sc.parallelize([2,4,6,8])
> nums.map(lambda x: x*x).collect()      

         

The first command creates a resilient data set (RDD) by parallelizing a python list given as input argument [2, 4, 6, 8] and store it as ‘nums’. The second command uses the famous map function to transform the ‘nums’ RDD into a new RDD containing the list of squares of numbers. Finally, ‘collect’ action is called on the new RDD to return a classic Python list. By executing the second command, you should see a resulting list of squared numbers as:

[4, 16, 36, 64]

Congratulations! You have successfully set up PySpark on Windows.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@184.169.241.188

[sharethis-inline-buttons]

A look into ETA Problem using Regression in Python – Machine Learning

A look into ETA Problem using Regression in Python - Machine Learning

Farman Shah
Senior Software Engineer
Feb 22, 2019

The term "ETA" usually means "Estimated Time of Arrival" but in the technology realm it refers as "Estimated Completion Time" of a computational process in general. In particular, this problem is specific to estimating completion time of a batch of long scripts running parallel to each other.

Problem

A number of campaigns run together in parallel, to process data and prepare some lists. Running time for each campaign varies from 10 minutes to may be 10 hours or more, depending on the data. A batch of campaigns is considered as complete when execution of all campaigns is finished and then resorted to have mutually exclusive data.

What we will do is to provide a solution that can accurately estimate completion time of campaigns based on some past data. 

Data

We have very limited data available; per campaign, from the past executions of these campaigns: 

campaign_idstart_timeend_timeuses_recommendationsquery_count
6982018-07-31 10:20:022018-08-01 02:05:48048147
6982018-07-24 11:10:022018-07-25 05:42:37045223
6992018-07-31 11:05:032018-08-01 07:23:160121898
6992018-07-24 12:00:042018-07-25 10:21:480116721
7002018-07-31 10:50:032018-08-01 06:54:530400325
7002018-07-24 11:45:032018-07-25 09:53:030353497
8112018-07-31 15:20:032018-08-01 01:54:5112601500
8112018-07-24 11:00:02 2018-07-25 05:36:3012609112

 

Feature Engineering / Preprocessing

These are the campaigns for which past data is available. A batch can consist of one or many campaigns from the above list. The feature uses_recommendations resulted during feature engineering. This feature helps machine differentiate between the campaigns which are dependent on an over the network API and the ones which are not, so that machine can keep a variable which caters network lag implicitly. 

Is this a Time Series Forecasting problem?

It could have been, but from the analysis it can be observed that the time of the year doesn’t impact the data that much. So this problem can be tackled as a regression problem instead of time series forecasting problem. 

How it turned into a regression problem?

The absolute difference in seconds between start time and end time of a campaign will be a numeric variable, that can be made the target variable. This is what we are going to estimate using Regression.

Regression

Our input X is the data available, and output y is the time difference between start time and the end time. Now let’s import the dataset and start processing. 

import pandas as pd

dataset = pd.read_csv('batch_data.csv')

As it can be seen from our data that there are no missing entries as of now, but as this may be an automated process, so we better handle the NA values. The following command will fill the NA values with column mean.

dataset = dataset.fillna(dataset.mean())

 

The output y is the difference between start time and end time of the campaign. Let’s set up our output variable.

start = pd.to_datetime(dataset['start_time'])
process_end = pd.to_datetime(dataset[‘end_time'])
y = (process_end - start).dt.seconds

 

y is taken out from the dataset, and we won’t be needing start_time and end_time columns in the X.

X = dataset.drop(['start_time', ‘end_time'], 1)

 

You might have questioned that how would Machine differentiate between the campaign_ids, here in particular, or any such categorical data in general. 

A quick recall if you already know the concept of One-Hot-Encoding. It is a method to create a toggle variable for categorical instances of data. So that, a variable is 1 for the rows where data belongs to that categorical value, and all other variables are 0. 

If you don’t already know the One-Hot-Encoding concept, it is highly recommended to read more about it online and come back to continue. We’ll use OneHotEncoder from sklearn library.

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0], handle_unknown = 'ignore')
X = onehotencoder.fit_transform(X).toarray()

start = pd.to_datetime(dataset['start_time'])
# Avoiding the Dummy Variable Trap is part of one hot encoding
X = X[:, 1:]

 

Now that input data is ready, one final thing to do is to separate out some data to later test how good our algorithm is performing. We are separating out 20% data at random.

 

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

 

After trying Linear regression, SVR (Support Vector Regressor) and XGBoosts and RandomForests on the data it turned out that Linear and SVR models doesn’t fit the data well. And the other finding was that the performance of XGBoost and RandomForest was close to each other for this data. With a slight difference, let’s move forward with RandomForest.

 

# Fitting Regression to the dataset
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=300, random_state=1, criterion='mae')
regressor.fit(X_train, y_train)

 

The regression has been performed and a regressor has been fitted on our data. What follows next is checking how good is the fit.

 

Performance Measure

We’ll use Root Mean Square Error (RMSE) as our unit of goodness. The lesser the RMSE, the better the regression.

Let’s ask our regressor to make predictions on our Training data, that is 80% of the total data we had. This will give a glimpse of training accuracy. Later we’ll make predictions on the Test data, the remaining 20%, which will tell us about the performance of this regressor on Unseen data.

If the performance on training data is very good, and the performance on unseen data is poor, then our model is Overfitting. So, ideally the performance on unseen data should be close to that on the training data.

from sklearn.metrics import mean_squared_error
from math import sqrt
training_predictions = regressor.predict(X_train)

training_mse = mean_squared_error(y_train, training_predictions)
training_rmse = sqrt(training_mse) / 60 # Divide by 60 to turn it into minutes

 

We got the training RMSE, you should print it and see how many minutes does it deviate on average from the actual.

Now, let’s get the test RMSE.

 

test_predictions = regressor.predict(X_test)
test_mse = mean_squared_error(y_test_pred, test_predictions)

test_rmse = sqrt(test_mse) / 60

 

Compare the test_rmse with training_rmse to see how good is the regression performing on seen and unseen data.

What’s next for you is to now try fitting XGBoost, SVR and any other Regression models that you think should fit well on this data and see how different is the performance of different models.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.

Connect with us for more information at Contact@184.169.241.188

[sharethis-inline-buttons]

HOW TO USE MACHINE LEARNING FOR ANOMALY DETECTION

HOW TO USE MACHINE LEARNING FOR ANOMALY DETECTION

Farman Shah
Senior Software Engineer
Dec 19

Anomaly Detection is a widely used for Machine Learning as a service to find out the abnormalities in a system. The idea is to create a model under a probabilistic distribution. In our case, we will be dealing with the Normal (Gaussian) distribution. So, when a system works normally it’s features reside under a normal curve. And when it behaves abnormally, it’s features move to the far ends of the normal distribution curve. Middle area shows distribution of normal behavior and the red areas on the far ends show distribution of abnormal behavior. If you already don’t know, you should read the concepts of Mean, Variance and Standard Deviation first. In the next paragraphs I’ll be addressing how do we create a distribution curve for our system? The system I work on generates a file, daily. Having different number of lines in it every day. There is no defined range for the number of lines it should have. So, my problem was how to auto-detect if the file for today had too low number of lines or too high number of lines. 

 

Now that I had data for two weeks. I could find out the mean (average) number of lines. On the distribution curve in Figure 1, this would be the middle of the curve horizontally, i-e 0 on the x axis. But in the list of line counts above, it can be seen that actual values deviate from the mean, which is 55728.722222 in this case. For example, take 68336 which is reasonably away from the mean. I had the valid data, but I no false examples. That is, the examples that will guage the accuracy of my anomaly detection system. What I did was added a few examples that I consider as anomalous, and see if my system learns and predicts correctly.

 

It could be seen that our original data follows a pattern. Whereas the false examples we added later are scattered away. Those are the outliers we want to catch!! Let’s do some calculations to get mean and variance of our training dataset. What we do here is use mean and variance to model a normal (Gaussian) distribution like the one shown in Figure 1. And then we calculate f1score to find out a value (Epsilon) which we can set as best decisive threshold between our normal and abnormal values.

[sharethis-inline-buttons]