Introduction
Anomaly or outlier detection has many applications, ranging from preventing credit card fraud to detecting computer network intrusions or uncovering medical problems. It can also be used to flag observations in your machine learning pipeline that are different from the data used to train the algorithms.
Seldon Core is a language and toolkit agnostic open source machine learning deployment platform which runs within a Kubernetes cluster. Seldon Core allows DevOps and data scientists to easily manage machine learning models in production. It comes with a suite of components, including outlier detection, which allows you to monitor your models. The anomaly detectors can be run as standalone models or can identify outliers in another model’s observations.
Anomaly Types
Before digging into specific algorithms used to spot anomalies, let us first take a step back and define what constitutes an outlier:
“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.” [1]
This means that normal data, or inliers, follow a different statistical generation process than outliers. We can further classify anomalies in three categories:
- Global or point anomalies
When the value of an observation falls significantly outside the range of the whole data set. For example a 30°C day on Antarctica. - Contextual or conditional outliers
When the value of an observation deviates significantly from the rest of the data set in the same context. The value might however be seen as normal in a different context. Data with seasonal patterns are sensitive to contextual outliers. For example a high level of credit card spending on gifts can be seen as normal in the run up to Christmas but would be a contextual anomaly in mid-January. - Collective outliers
When a number of typically sequential observations in a data set deviate significantly from the rest of the data but the individual values of each observation are not seen as an outlier. Figure 1 shows an ECG with a collective outlier (red). While the red values on the ECG are not anomalous on their own, the sequence clearly forms an outlier.
Algorithm Types
Each of the anomalous categories can occur in many shapes and forms like tabular data, images or time series. As a result, there is no off the shelf one size fits all solution to all your outlier problems and there are tonnes of algorithms out there to spot anomalies. Broadly speaking, they fall under the following categories: [2]
- Statistical models
We fit a model to represent the normal data points. Observations are classified as anomalies if they do not fit this model. On a more granular level, we can distinguish probabilistic tests which explicitly model a distribution and label observations as outliers if they have a low probability to be generated by this distribution. The Mahalanobis distance is an example of a probabilistic test. Depth-based (tree) models like Isolation Forests and deviation-based models like Auto-Encoders also fall under statistical models and will be discussed later. - Proximity based approaches
These algorithms rely on the spatial proximity between observations. We can further distinguish between distance (e.g. kNN) and density (e.g. Local Outlier Factor) based models. Some proximity based approaches require data to be kept in memory, which is a significant drawback.
The choice of algorithm and hyperparameters strongly depend on the use case. Knowledge about the underlying data distribution and domain expertise are often crucial to make an outlier detection system work in practice.
Challenges
Detecting outliers is a tricky task for various reasons. First of all, labelled data is typically scarce, making it an unsupervised or semi-supervised problem. We often don’t know what an outlier looks like before it happens. Or even worse: in case of malicious attempts like credit card fraud, the anomalies are made to look as normal as possible so the transactions would go unnoticed and slip through the cracks. All of this needs to be done in real time so no money would disappear from your account. The behaviour of outliers is also constantly evolving. Fraudsters constantly finetune their tactics and hackers will try to find new ways to gain access to your computer network.
Outlier Detectors on Seldon Core
The open source Seldon Core project provides a number of outlier detectors suitable for different use cases. The detectors are part of the inference graph on Seldon Core and can be used either as standalone models or to spot anomalies in another model’s input features. The inference graph can then be deployed using REST or gRPC APIs.
In this section, we will dig a bit deeper in the 4 implemented algorithms before testing them in 2 case studies.
1. Sequence-to-Sequence LSTM
The seq2seq anomaly detection algorithm is suitable for time series data and predicts whether a sequence of input features is an outlier or not, dependent on a threshold level set by the user. The algorithm needs to be pretrained first on a batch of -preferably- inliers. Seq2seq models (Figure 4) convert sequences from one domain into sequences in another domain. A typical example would be sentence translation between different languages. The model consists of 2 main building blocks: an encoder and a decoder. The encoder processes the input sequence and initializes the decoder. The decoder then makes sequential predictions for the output sequence. In our case, the decoder aims to reconstruct the input sequence. Both the encoder and decoder are typically implemented with recurrent or 1D convolutional neural networks. Our implementation uses a type of recurrent neural network called LSTM networks. The loss function to be minimized with stochastic gradient descent is the mean squared error between the input and output sequence, and is called the reconstruction error. If we train the seq2seq model with inliers, it will be able to replicate new inlier data well with a low reconstruction error. However, if outliers are fed to the seq2seq model, the reconstruction error becomes large and we can classify the sequence as an anomaly.
2. Variational Auto-Encoder (VAE)
Similar to the seq2seq model, an auto-encoder consists of an encoder and decoder. The encoder tries to find a compressed representation of the input data. The compressed data is then fed into the decoder, which aims to replicate the input data. The encoder and decoder are usually defined by neural networks. If the reconstruction error between the input and output of the decoder is above a user-defined threshold, the observation is flagged as an outlier.
A variational auto-encoder (Figure 5) adds constraints to the encoded representations of the input. The encodings are parameters of a probability distribution modeling the data. The decoder can then generate new data by sampling from the learned distribution.
We train the VAE with mainly inliers so it is able to replicate normal data well. The reconstruction error for anomalies on the other hand should be above the threshold. The implemented VAE is mainly suitable for tabular data.
3. Isolation Forest
Like the other algorithms, isolation forests predict whether an observation is an outlier based on a predefined threshold. The algorithm needs to be pre-trained first on a representable batch of data and is mainly suitable for tabular data.
Isolation forests isolate observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of random trees, is a measure of normality and is used to define an anomaly score. Outliers can typically be isolated quicker, leading to shorter paths.
4. Mahalanobis Distance
The Mahalanobis anomaly detector calculates an outlier score, which is a measure of distance from the center of the feature distribution. If this distance is high, the observation is likely an outlier. The algorithm is online, which means that it starts without knowledge about the distribution of the features and learns as requests arrive. Consequently, you should expect the output to be bad at the start and improve over time.
As new observations arrive, the algorithm will update the mean and sample covariance matrix of the dataset. Then we apply principal component analysis on the covariance matrix and project the new observations on the first N principal components. This allows us to calculate the Mahalanobis distance from the projections to the projected mean and flag the observation as an outlier if the distance is larger than a threshold level.
To limit the impact of outliers on the estimated mean and covariance matrix, we clip new requests. This can be particularly helpful when outliers arrive in sequences instead of uniformly distributed over time.
Now we can apply the detectors to some case studies. The Variational Auto-Encoder will be used to spot computer network intrusions while the seq2seq detector will flag up anomalies in ECG’s. Demos of these use cases deployed on Seldon Core, as well as examples for Isolation Forests and the Mahalanobis Distance, can be found in our open source.
Detecting computer network intrusions with VAEs
We apply an anomaly detector to identify computer network intrusions using TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack.
There are 4 types of attacks in the dataset:
- DOS: denial-of-service, e.g. syn flood;
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges;
- probing: surveillance and other probing, e.g., port scanning.
There are 3 types of features:
- basic features of individual connections, e.g. duration of connection
- content features within a connection, e.g. number of failed login attempts
- traffic features within a 2 second window, e.g. number of connections to the same host as the current connection
The dataset contains about 5 million connection records. For this problem we only focus on the continuous (18 out of 41) features.
The Variational Auto-Encoder has 1 hidden layer in the encoder and decoder with 9 hidden units, while the mean, standard deviation and latent vector layers each consist of 2 units. The architecture is shown in Figure 7. In our investigation we found that the model performance is not that sensitive to the choice of hyperparameters on this problem.
The Variational Auto-Encoder is trained on 50,000 normal samples. During test time, we set the fraction of outliers at roughly 10%.
We deploy the model on Seldon Core and track the detector’s performance on a Grafana dashboard. The snapshot below shows the number of predicted outliers as well as true labels. The bottom left chart flags whether an observation is classified as an outlier while the chart on the right shows the reconstruction errors compared to the anomaly threshold.
The network intrusion dataset does not prove to be very difficult for the VAE with these standard performance metrics being close to 1 during testing on unseen examples.
Using a seq2seq model to detect anomalies in ECG’s
In this case study we use an outlier detector to spot anomalies in electrocardiograms (ECG’s). The dataset “ECG5000” contains 5000 ECG’s, originally obtained from Physionet under the name “BIDMC Congestive Heart Failure Database(chfdb)”, record “chf07”. The data has been pre-processed in 2 steps: first each heartbeat is extracted, and then each beat is made equal length via interpolation. The data is labeled and contains 5 classes. The first class which contains almost 60% of the observations is seen as “normal” while the others are outliers. The seq2seq algorithm with 1 LSTM unit in both the encoder and decoder is trained on some heartbeats from the first class and needs to flag the other classes as anomalies. Figure 10 shows an example ECG for each of the classes.
The plot below shows a typical prediction (red line) of an inlier (class 1) ECG compared to the original (blue line) after training the seq2seq model.
On the other hand, the model is not good at fitting ECG’s from the other classes, as illustrated in the following chart:
The predictions in the above charts are made on ECG’s the model has not seen before. The differences in scale are due to the sigmoid output layer and do not affect the prediction accuracy.
Similar to the previous use case, we deploy the model on Seldon Core and track it on a Grafana dashboard.
The model is trained on approximately 2600 ECGs. Despite a lack of hyperparameter fine tuning, the model performs well on the previously unseen ECGs with the performance metrics around 0.9 in a sample containing 45% anomalies. Try the demo out yourself on our open source!
Roadmap
Going forward, we would like to expand our outlier detector offering on Seldon Core across different data types. For time series, convolutional seq2seq models and algorithms taking seasonality more explicitly into account like the Seasonal Hybrid ESD are interesting options. To spot outliers in images, convolutional auto-encoders will be investigated while more baselines for tabular data like Local Outlier Factors or One-Class Neural Networks will be considered as well. The strengths of different models can also be leveraged by ensembling models on Seldon Core using combiners, a predefined type of predictive unit.
Outlier detection is part of a wider model monitoring effort. Once anomalies are identified, we want to find the cause using model explainers. Another application would be to alert the user that model retraining is needed if there are too many outliers, resulting in concept drift.
References
[1] Hawkins, D. (1980), Identification of outliers. Chapman and Hall, London.
[2] Kriegel, Kroger, Zimek. Outlier Detection Techniques. 2010 SIAM International Conference on Data Mining.
[3] https://github.com/farizrahman4u/seq2seq