HOW TO USE MACHINE LEARNING FOR ANOMALY DETECTION

HOW TO USE MACHINE LEARNING FOR ANOMALY DETECTION

Farman Shah
Senior Software Engineer
Dec 19

Anomaly Detection is a widely used for Machine Learning as a service to find out the abnormalities in a system. The idea is to create a model under a probabilistic distribution. In our case, we will be dealing with the Normal (Gaussian) distribution. So, when a system works normally it’s features reside under a normal curve. And when it behaves abnormally, it’s features move to the far ends of the normal distribution curve. Middle area shows distribution of normal behavior and the red areas on the far ends show distribution of abnormal behavior. If you already don’t know, you should read the concepts of Mean, Variance and Standard Deviation first. In the next paragraphs I’ll be addressing how do we create a distribution curve for our system? The system I work on generates a file, daily. Having different number of lines in it every day. There is no defined range for the number of lines it should have. So, my problem was how to auto-detect if the file for today had too low number of lines or too high number of lines.

Now that I had data for two weeks. I could find out the mean (average) number of lines. On the distribution curve in Figure 1, this would be the middle of the curve horizontally, i-e 0 on the x axis. But in the list of line counts above, it can be seen that actual values deviate from the mean, which is 55728.722222 in this case. For example, take 68336 which is reasonably away from the mean. I had the valid data, but I no false examples. That is, the examples that will guage the accuracy of my anomaly detection system. What I did was added a few examples that I consider as anomalous, and see if my system learns and predicts correctly.

It could be seen that our original data follows a pattern. Whereas the false examples we added later are scattered away. Those are the outliers we want to catch!! Let’s do some calculations to get mean and variance of our training dataset. What we do here is use mean and variance to model a normal (Gaussian) distribution like the one shown in Figure 1. And then we calculate f1score to find out a value (Epsilon) which we can set as best decisive threshold between our normal and abnormal values.

[sharethis-inline-buttons]

Motasim Rasikh
Dec 03, 2018

Background

A while back Kaggle launched a very interesting landmark retrieval challenge. This challenge had a large dataset of landmark images. Given an image query, the program should find similar landmark images from the dataset. For example, when an image of Alhambra Palace of Spain is given as a query image, the program should find and bring other images of the same landmark images from the dataset. For further details please visit Kaggle.

Motivation

Google's Inception, popularly known as GoogLeNet is a Deep Learning Convolutional Neural Network (CNN) architecture. It was originally designed for ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). It has been battle-tested, delivering world-class results in that challenge.Pre-requisites

We are going to use Python 3.6 and TensorFlow. Once python is installed, make sure it is accessible in command prompt. If it is not accessible, we need to add it in the system path.

We also need to install TensorFlow. Installation guide for TensorFlow can be found here

Preparing Training set

Kaggle has provided index.csv file containing around 1600 different landmarks images which can be found here. For simplicity we are going to hand pick around 20 landmarks from this 1600 list. This idea can be extended to Kaggle’s complete dataset as well. Our simplified version of training dataset can be found here. It’s a CSV file which contains URLs of images from different landmarks. Let’s first download the images using the CSV file provided above. We have provided a python script which will help downloading this large set of images. This script can be found here. To run this script, we need to install a dependency via PIP.

> pip install tqdm

After installing tdqm we need to run this script with following arguments.

Are you Ready to leverage Machine Learning ?

We offer our clients the opportunity to create their own machine learning applications. INNOVATE NOW !!

Training Inception Model

We need to install TensorFlow Hub. TensorFlow Hub is a library for reusable parts of machine learning models. We will train our Inception model on top of existing training data from TensorFlow Hub.

> pip install tensorflow-hub

> python ./retrain.py --image_dir ./training_images/ --output_graph ./retrained_graph.pb --intermediate_output_graphs_dir ./intermediate_graph/ --output_labels ./retrained_labels.txt --how_many_training_steps 500 --summaries_dir ./summaries/ --bottleneck_dir ./bottleneck_data/ --final_tensor_name "final_result" --saved_model_dir ./model/

This command will take approximately 30 minutes to run depending on system configuration. We are running it with lower number of how_many_training_steps for quick training. For production purposes increase it to 4000 or above. Once this command finishes our Inception model is ready for testing.

Testing Inception Trained Model

Our testing images can be found here. These pictures are slightly different from the ones we trained our model on. And we have to test how much close our Inception model can figure out its resemblance.  Let’s download a script from TensorFlow example which will feed our test image to our trained Inception model and predict result. This script can be found here. Create a folder called “testing_images” and put all testing images there and run the following command.

> python.exe .\label_image.py --graph .\retrained_graph.pb --labels .\retrained_labels.txt -- input_layer "Placeholder" --output_layer "final_result" --image .\test_images\1.jpg

It will give the result in probability showing how much likely the given image matches with dataset categories. Below is an example of result.

Alhambra Spain 0.73449653

Atlantis Hotel 0.053302728

Karnak 0.052239873

Peace Palace 0.03108669

Flinders Street Station 0.020717677

Conclusion

When testing on a dataset containing pictures of 40-50 landmarks and every landmark having sufficient images to train on, Inception gives very high accuracy. You can use it off-the-shelf just by training it on a data set of landmark images.

Please feel free to reach out to us, if you have any questions. In case you need any help with development, installation, integration, up-gradation and customization of your Business Solutions.

We have expertise in Deep learning, Computer Vision, Predictive learning, CNN, HOG and NLP.