Anomaly Detection is a widely used for Machine Learning as a service to find out the abnormalities in a system. The idea is to create a model under a probabilistic distribution. In our case, we will be dealing with the Normal (Gaussian) distribution. So, when a system works normally it’s features reside under a normal curve. And when it behaves abnormally, it’s features move to the far ends of the normal distribution curve.
Figure 1
In Figure 1 above, the middle area shows distribution of normal behavior and the red areas on the far ends show distribution of abnormal behavior. If you already don’t know, you should read the concepts of Mean, Variance and Standard Deviation first. In the next paragraphs I’ll be addressing how do we create a distribution curve for our system?
The system I work on generates a file, daily. Having different number of lines in it every day. There is no defined range for the number of lines it should have. So, my problem was how to auto-detect if the file for today had too low number of lines or too high number of lines.
I collected the number of lines in the file for 14 days. And created my training data (you can copy it and name it as train.csv):
lines 55991 62434 57203 67110 50265 60579 50517 43081 47147 68336 59376 50273 46045 59760
Now that I had data for two weeks. I could find out the mean (average) number of lines. On the distribution curve in Figure 1, this would be the middle of the curve horizontally, i-e 0 on the x axis. But in the list of line counts above, it can be seen that actual values deviate from the mean, which is 55728.722222 in this case. For example, take 68336 which is reasonably away from the mean.
I had the valid data, but I no false examples. That is, the examples that will guage the accuracy of my anomaly detection system. What I did was added a few examples that I consider as anomalous, and see if my system learns and predicts correctly. Add these in the bottom of train.csv:
10000 100000 5000 110000
We have the examples, we can mark the target variable as correct/incorrect so that out model can learn which entries are correct and which entries are incorrect. Edit the file to add the “truths” column:
lines,truths 55991,0 62434,0 57203,0 67110,0 50265,0 60579,0 50517,0 43081,0 47147,0 68336,0 59376,0 50273,0 46045,0 59760,0 10000,1 100000,1 5000,1 110000,1
The training data is ready. Let’s begin building our Anomaly Detection system. First we’ll import the libraries that we are going to need:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from scipy.stats import multivariate_normal from sklearn.metrics import f1_score
Now, let’s load our training dataset using pandas and plot it to see how our data looks like:
# Load data data = pd.read_csv('train.csv') train_data = data['lines'] truth_data = data['truths'] # Let’s Plot the loaded data plt.figure() plt.ylabel("Line Count") plt.plot(train_data, "bx") plt.show()
This would plot a figure like:
In the above Figure 2, it could be seen that our original data follows a pattern. Whereas the false examples we added later are scattered away. Those are the outliers we want to catch!!
Let’s do some calculations to get mean and variance of our training dataset. What we do here is use mean and variance to model a normal (Gaussian) distribution like the one shown in Figure 1. And then we calculate f1score to find out a value (Epsilon) which we can set as best decisive threshold between our normal and abnormal values.
mu = np.mean(train_data, axis=0) sigma = np.cov(train_data.T) # .T takes the transpose curve = multivariate_normal(mean=mu, cov=sigma) curve_pdf = curve.pdf(dataset) #Probablistic Density Function # Finding the threshold step = (max(curve_pdf) - min(curve_pdf)) / 10; epsilons = np.nditer(np.arange(min(curve_pdf),max(curve_pdf),step)) Ep = fscore = temp_fscore = 0 for e in epsilons: temp_fscore = f1_score(truth_data,curve_pdf < e) if fscore < temp_fscore: Ep = e fscore = temp_fscore
Now that we have found the thresholds, we can spot the anomalies by checking the probability of our examples to be under the normal curve from Figure 1. The epsilon (Ep) is the threshold such that if probability P(X) < Ep, then it’s highly likely that X is anomalous. We already had anomalous examples pre-defined in our training set. Let’s spot those outliers:
anomalies = np.asarray(np.where(curve_pdf < ep)) # And plot the anomalies plt.figure() plt.ylabel("Line Count") plt.plot(train_data,"bx") plt.plot(train_data.loc[anomalies[0]],"ro") plt.show()
This would plot the following figure:
Bingo!! All the anomalies are captured correctly and spotted in red.
USA408 365 4638
1301 Shoreway Road, Suite 160,
Belmont, CA 94002
Whether you are a large enterprise looking to augment your teams with experts resources or an SME looking to scale your business or a startup looking to build something.
We are your digital growth partner.
Tel:
+1 408 365 4638
Support:
+1 (408) 512 1812
COMMENTS ()
Tweet