How to use Machine Learning for Anomaly Detection

Anomaly Detection is a widely used for Machine Learning as a service to find out the abnormalities in a system. The idea is to create a model under a probabilistic distribution. In our case, we will be dealing with the Normal (Gaussian) distribution. So, when a system works normally it’s features reside under a normal curve. And when it behaves abnormally, it’s features move to the far ends of the normal distribution curve.

Magento 2 customization

Figure 1

In Figure 1 above, the middle area shows distribution of normal behavior and the red areas on the far ends show distribution of abnormal behavior. If you already don’t know, you should read the concepts of Mean, Variance and Standard Deviation first. In the next paragraphs I’ll be addressing how do we create a distribution curve for our system?

The system I work on generates a file, daily. Having different number of lines in it every day. There is no defined range for the number of lines it should have. So, my problem was how to auto-detect if the file for today had too low number of lines or too high number of lines.

I collected the number of lines in the file for 14 days. And created my training data (you can copy it and name it as train.csv):

Now that I had data for two weeks. I could find out the mean (average) number of lines. On the distribution curve in Figure 1, this would be the middle of the curve horizontally, i-e 0 on the x axis. But in the list of line counts above, it can be seen that actual values deviate from the mean, which is 55728.722222 in this case. For example, take 68336 which is reasonably away from the mean.

I had the valid data, but I no false examples. That is, the examples that will guage the accuracy of my anomaly detection system. What I did was added a few examples that I consider as anomalous, and see if my system learns and predicts correctly. Add these in the bottom of train.csv:

We have the examples, we can mark the target variable as correct/incorrect so that out model can learn which entries are correct and which entries are incorrect. Edit the file to add the “truths” column:

lines,truths
55991,0
62434,0
57203,0
67110,0
50265,0
60579,0
50517,0
43081,0
47147,0
68336,0
59376,0
50273,0
46045,0
59760,0
10000,1
100000,1
5000,1
110000,1

The training data is ready. Let’s begin building our Anomaly Detection system. First we’ll import the libraries that we are going to need:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy.stats import multivariate_normal
from sklearn.metrics import f1_score

Now, let’s load our training dataset using pandas and plot it to see how our data looks like:

# Load data
data = pd.read_csv('train.csv')
train_data = data['lines']
truth_data = data['truths']
# Let’s Plot the loaded data
plt.figure()
plt.ylabel("Line Count")
plt.plot(train_data, "bx")
plt.show()

This would plot a figure like:

In the above Figure 2, it could be seen that our original data follows a pattern. Whereas the false examples we added later are scattered away. Those are the outliers we want to catch!!

Let’s do some calculations to get mean and variance of our training dataset. What we do here is use mean and variance to model a normal (Gaussian) distribution like the one shown in Figure 1. And then we calculate f1score to find out a value (Epsilon) which we can set as best decisive threshold between our normal and abnormal values.

mu = np.mean(train_data, axis=0)
sigma = np.cov(train_data.T) # .T takes the transpose
curve = multivariate_normal(mean=mu, cov=sigma)
curve_pdf = curve.pdf(dataset) #Probablistic Density Function

# Finding the threshold
step = (max(curve_pdf) - min(curve_pdf)) / 10; 
epsilons = np.nditer(np.arange(min(curve_pdf),max(curve_pdf),step))
Ep = fscore = temp_fscore = 0
for e in epsilons:
    temp_fscore = f1_score(truth_data,curve_pdf < e)
    if fscore < temp_fscore:
        Ep = e
        fscore = temp_fscore

Now that we have found the thresholds, we can spot the anomalies by checking the probability of our examples to be under the normal curve from Figure 1. The epsilon (Ep) is the threshold such that if probability P(X) < Ep, then it’s highly likely that X is anomalous. We already had anomalous examples pre-defined in our training set. Let’s spot those outliers:

anomalies = np.asarray(np.where(curve_pdf < ep))
# And plot the anomalies
plt.figure()
plt.ylabel("Line Count")
plt.plot(train_data,"bx")
plt.plot(train_data.loc[anomalies[0]],"ro")
plt.show()

This would plot the following figure:

Bingo!! All the anomalies are captured correctly and spotted in red.

How to use Machine Learning for Anomaly Detection

Posted by: Blog Post April 11, 2018

COMMENTS ()

Blog Post

CALL

EMAIL

VISIT

Contact us