Outlier detection is a key consideration within the development and deployment of machine learning algorithms. Models are often developed and leveraged to perform outlier detection for different organisations that rely on large datasets to function. Economic modelling, financial forecasting, scientific research, and ecommerce campaigns are some of the varied areas that machine learning-driven outlier detection is used.
Identifying and dealing with outliers is an integral part of working with data, and machine learning is no different. Algorithm development usually relies on huge arrays of training data to achieve a high level of accuracy. Once deployed, models will process huge amounts of data, providing insights into trends and patterns. In this data-rich environment, organisations can expect to have to deal with outlier data. Outliers can skew trends and have a serious impact on the accuracy of models. The presence of outliers can be a sign of concept drift, so ongoing outlier analysis in machine learning is needed.
Machine learning models learn from data to understand the trends and relationship between data points. Outliers can skew results, and anomalies in training data can impact overall model effectiveness. Outlier detection is a key tool in safeguarding data quality, as anomalous data and errors can be removed and analysed once identified.
Outlier detection is an important part of each stage of the machine learning process. Accurate data is integral during the development and training of algorithms, and outlier detection is performed after deployment to maintain the effectiveness of models. This guide explores the basics of outlier detection techniques in machine learning, and how they can be applied to identify different types of outlier.
What is an outlier in machine learning?
An outlier is an individual point of data that is distant from other points in the dataset. It is an anomaly in the dataset that may be caused by a range of errors in capturing, processing or manipulating data. Outliers can skew overall data trends, so outlier detection methods are an important part of statistics. Outliers will be a consideration for any area that uses data to make decisions. If an organisation is gaining insight from data, outliers are a real risk.
Outlier detection is particularly important within machine learning. Models are trained on huge arrays of training data. The model understands the relationship between data points to help predict future events or categorise live data. Outliers in the training data may skew the model, lowering its accuracy and overall effectiveness. Outlier analysis and resolution can lengthen the training time too. Outliers can be present in any data or machine learning use case, whether that’s financial modelling or business performance analysis.
What are the different types of outliers?
There are three main types of outliers relevant to machine learning models. Each type differs by how the anomalous data can be observed and what makes the data point stand apart from the rest of the data set. Types are an important consideration for outlier analysis as each has a different pattern to identify.
The three main types of outliers are:
- Point outliers
- Contextual outliers
- Collective outliers
What is a point outlier?
A point outlier is an individual data point that sits outside of the range of the rest of the dataset. There may be a clear pattern, trend or grouping within the dataset, and an outlier as a data point will be significantly different to this. Point outliers can often be attributed to an error with the measurement or input of the data.
For example, an outlier may occur in the health sector data if a mistake is made when recording a unit of measurement within patient data. Missing a digit when recording the height of a patient would cause a noticeable point outlier within the dataset. This type of outlier can be relatively straightforward to identify through visual means. Point outliers are visible if the dataset is plotted across two or three dimensions, as the outlier as a data point would sit far apart from the rest of the dataset.
What is a contextual outlier?
A contextual outlier is when a data point is significantly different from the dataset, but only within a specific context. The context of a dataset may change seasonally or fluctuate with wider economic trends or behaviour. A contextual outlier will be noticeable when the context of the dataset changes. This could be seasonal weather changes, economic fluctuations, changes in customer behaviour for key holidays, or even the time of the day. For this reason, a contextual outlier may seem like a normal data point in other contexts.
For example, in a dataset of UK temperatures over time, encompassing different years and seasons. A temperature reading below zero degrees at noon could be seen as normal during the winter. But this same reading would be deemed a contextual outlier if recorded at the height of summer during a heat wave. The data is contextualised within wider trends that are impacting the dataset.
What is a collective outlier?
A collective outlier is when a series of data points differ significantly from the trends in the rest of the dataset. The individual data points within a collective outlier may not seem like a point outlier or a contextual outlier. It’s when the data points are considered as a collection that anomalous patterns are observed. For this reason collective outliers can be the hardest type of outlier to identify. Collective outliers are an integral part of monitoring for concept drift in machine learning. A sequence of data has shifted away from expected behaviour within the model.
For example, a time series plotting subscribers and unsubscribers to an email marketing list showing seasonal or daily fluctuations. A collective outlier could be flagged if the level of subscribed users stayed entirely static for many weeks with no fluctuation. Individual users unsubscribing and new users subscribing is a normal occurrence so a static count would be flagged as an outlier.
Taken in isolation, each data point is within the expected boundaries of the data so would not be flagged as a contextual or point outlier. But when taken as a series, the data’s behaviour is flagged as anomalous. Once a collective outlier is identified, steps can be taken to investigate any systematic errors in the process.
What is outlier detection in machine learning?
Outlier detection is an important consideration in both the development of algorithms and the deployment of machine learning models. The detection of outliers in training datasets is an integral part of ensuring high quality data. Machine learning algorithms rely on large arrays of accurate data to learn trends and spot patterns. High quality training data means a more accurate machine learning model in most cases.
For supervised machine learning models, a data scientist may identify and remove outliers when preparing and labelling training data. For unsupervised machine learning models used to categorise unlabelled datasets, outliers may be identified later on in the process. This can add extra time and resources to the machine learning development process.
Outlier detection also plays an important role in the ongoing monitoring and maintenance of machine learning models. When deployed, machine learning models need to be regularly monitored to ensure ongoing accuracy. Recurring outliers or an increase in anomalous data within predictive models can be a sign of concept drift. The challenge is deciding whether an outlier is a sign of a systemic issue with the model. If this is the case then the model can be recalibrated or retrained to be more effective.
Beyond training and monitoring, outlier detection is often a major part of a machine learning algorithm’s intended use. Algorithms trained to categorise data or identify trends can also flag anomalies through outlier detection. A key example would be the use of machine learning in finance to identify fraudulent purchases. The model will identify activity which is outside of normal account behaviour through outlier detection methods. This information can be used to trigger an account freeze and escalate the issue for human intervention.
Two types of outlier detection methods
Machine learning algorithms exist for a range of different file types and tasks. Whether the model is trained to categorise images into clusters or predict marketing budget based on historic campaigns, the type of data and potential outlier will shift depending on the model. However, there are two broad outlier detection methods that help to explain the basics of how outliers are observed and classified.
The two main types of outlier detection methods are:
- Using distance and density of data points for outlier detection.
- Building a model to predict data point distribution and highlighting outliers which don’t meet a user-defined threshold
Distance and density
This outlier detection technique uses the proximity of data points within two or three dimensions to identify outliers. This will usually be a contextual or point outlier as explained earlier in this guide. Different outlier detection models will use different approaches. Distance flags a data point as an outlier if it is mapped beyond a certain threshold away from other data points. The density approach groups together data points as clusters, using the distance between each cluster point to set the boundary of the grouping. Outliers will be data points that exist outside of the cluster, beyond the user-defined threshold.
Predicting data point distribution
This outlier detection technique uses statistical models to predict the probability of a dataset’s distribution. The approach can be used to model the probable distribution and highlight data points as outliers if they do not meet this distribution. All types of outlier can be identified through this technique, but is particularly useful for identifying collective outliers. Once identified, the organisation can undergo outlier analysis to understand the underlying issue.
Why do we need outlier analysis?
Data and analysis is increasingly becoming an integral part of everyday business decisions and management. Organisations rely on setting and measuring key performance indicators to evaluate business performance. Monitoring datasets is key to maintaining product quality, achieving high-value marketing campaigns, driving sales decisions, or ensuring user experience statistics are positive. With a growing emphasis on data-led decision making across different organisations, trust in the quality of data is vital. Outlier analysis plays an important role in maintaining this trust.
Outliers can skew trends and forecasts modelled from datasets, negatively impacting the quality and accuracy of decisions. Actively monitoring and performing outlier detection can flag errors in datasets and combat concept drift in machine learning models. If outliers are not identified and removed, models can become less accurate and effective.
What are the main causes of outliers?
Outliers can be caused by specific errors in data collection and processing, but may also be caused by an unknown feature within the dataset itself. Understanding the reason for outliers through outlier analysis helps organisations troubleshoot underlying issues in the data.
Different types of machine learning rely on different types of data to train machine learning algorithms and models. Human error can often be a cause of outliers, especially if data needs to be labelled and prepared as in supervised machine learning. But outliers caused by errors in measurement or data extraction can be present in all types of datasets and machine learning use cases.
Common causes of outliers in machine learning include:
- Human error when entering or labelling data.
- Errors in measuring or collecting the data.
- Errors in data extraction, processing or manipulation.
- Man-made outliers for testing outlier detection processes.
- Natural occurrences of outliers that aren’t errors, which can be called dataset novelties.
Outlier detection techniques and algorithms used by Seldon
Seldon Core is a framework for machine learning deployment on Kubernetes. Alongside Seldon Deploy, the platform provides a basis for containerised machine learning deployment and management. The platform has inbuilt control and auditing tools including outlier detection and alerts. It provides access to different outlier detection techniques for each situation and environment by leveraging different algorithms. Other inference features include model explanations and drift detection tools.
Alibi Detect is a library of outlier detection packages created by Seldon. It provides a range of outlier detection algorithms for different use cases, depending on the type of machine learning model and file type.
A sample of the outlier detection algorithms available through Alibi Detect include:
- Mahalanobis Distance
- Isolation Forest
- Variational Auto-Encoder
- Sequence to Sequence
Mahalanobis Distance
Mahalanobis Distance is an online outlier detection model focusing on tabular data. It works by calculating the distance between a data point and the centre point of the distribution. This calculation is used to generate an outlier score. There is a user-defined threshold for this score, beyond which the data point is identified as an outlier.
Because it is an online model, it’s a blank slate when it comes to understanding the features of your data. The algorithm learns from the dataset as the data arrives, so the accuracy of the model will improve the more it is used.
Isolation Forest
Isolation Forest is an outlier detection algorithm which provides data points with an anomaly score. It does this by repeatedly partitioning a data point by random attributes until it is isolated. Normal data may need many random partitions to achieve isolation because of the close grouping with other data. Outlier data which is adrift from other data points will generally need fewer partitions to achieve isolation. It is called Isolation Forest because the repeated partitioning can be seen as a tree structure.
Variational Auto-Encoder
Variational Auto-encoder is an outlier detection algorithm used with images and tabular data. The model works by learning the relationship between data points within the context of the distribution. It can then model a ‘clean’ distribution which can be used to identify anomalous data. Variational Auto-Encoders are popular outlier detection methods because they can work on unlabelled data and in an unsupervised manner.
Sequence to Sequence
The Sequence to Sequence outlier detection algorithm consists of an encoder and decoder. The model processes input data, before the decoder reconstructs the input sequence. Errors in reconstruction means the data can be identified as an outlier. An acceptable amount of reconstruction error is estimated for normal data. Outliers are flagged when reconstruction error is over this defined threshold.
Outlier detection and machine learning deployment with Seldon
Seldon moves machine learning from POC to production to scale, reducing time-to-value so models can get to work up to 85% quicker. In this rapidly changing environment, Seldon can give you the edge you need to supercharge your performance.
With Seldon Deploy, your business can efficiently manage and monitor machine learning for outlier detection, minimise risk, and understand how machine learning models impact decisions and business processes. Meaning you know your team has done its due diligence in creating a more equitable system while boosting performance.
Detect outliers in your machine learning models effectively and efficiently. Talk to our team about machine learning solutions today.