What is Covariate Shift?

Covariate shift is a specific type of dataset shift often encountered in machine learning. It is when the distribution of input data shifts between the training environment and live environment. Although the input distribution may change, the output distribution or labels remain the same. Covariate shift is also known as covariate drift, and is a very common issue encountered in machine learning. Models are usually trained in offline or local environments on a sample of labelled training data. It’s not unusual for the distribution of inputs in a live and dynamic environment to be different from the controlled training environment.  

Covariate shift can occur gradually over time or suddenly after deployment. In both cases the issue will have an adverse effect on the accuracy of the model. The algorithms will have been trained to map input to output data, and may not recognise input features on a different distribution. This means the model may be less accurate, or completely ineffective. It is a major consideration in machine learning as it means a high performing model on training data may not be accurate once deployed. 

Covariate shift is a common problem faced within the supervised type of machine learning methodology. It will occur when a model has been trained on a dataset with a distribution which is much different to new datasets. Because the distribution of input variables has shifted, the model may misclassify data points in a live environment. 

Addressing covariate shift is an important consideration within machine learning, so that models can be refitted for better accuracy. This guide will explore the topic of covariate shift, including what it is, why it happens, and how it impacts machine learning. 

What is covariate shift in machine learning?

Covariate shift in machine learning is a type of model drift which occurs when the distribution of independent variables changes between the training environment and live environment. There are many different types of dataset drift including covariate drift. Other examples include concept shift, which is the changing in relationship between the input and target output. Like all model drift, covariate drift can happen gradually over time or suddenly. It is a common issue which is faced to some degree by most machine learning deployment 

In supervised machine learning a model will learn the relationship between input and output data through training datasets in an offline or local environment. The model can then be used to make predictions or classify new data using the patterns it has learned. Covariate shift occurs when the distribution of variables in the training data is different to real-world or testing data. This means that the model may make the wrong predictions once it is deployed, and its accuracy will be significantly lower. Detecting and addressing covariate shift is therefore a key step to the machine learning process. 

Covariate shift can be a sign that the model lacks the ability to generalise adequately. Generalisation is the ability of a model to apply itself to new data using features learned from training data. Low levels of generalisation can occur from overfitting, an issue where the model is too aligned with the training data. This makes it ineffective with new data with a different distribution. Covariate drift affects most machine learning models to some degree, as test data is never going to be exactly the same as training data. The aim is to establish to what degree it impacts the model, to take steps to resolve the issues and improve the model’s accuracy. 

The risk from covariate drift as with all dataset drift is that the model will become less accurate when deployed. The training data may be governed by a different distribution compared to the live data. Although the labels of the model remain the same, the features of the input data may have shifted. This can be an extreme change or a gradual shift. For example, changes in elements like lighting can mean an image-categorisation model performs poorly once machine learning model deployment in a live environment. The different levels of lighting in live data will cause a different distribution to the training data, which can make it less accurate with its classification task. 

Why does covariate shift happen?

Covariate shift is a common occurrence when deploying machine learning models. It happens when there is a difference in input distribution between the training data and live or test data, and this can happen for a number of reasons. For example, facial recognition models may have just been trained on the faces of people aged 20 to 30. When deployed, the model will naturally be less accurate when trying to map the faces of older people with different facial structures.  

Supervised machine learning models are trained with labelled training data. It’s often resource-intensive to properly prepare and label this training data, as the majority of data is raw and unlabelled. A data scientist will usually prepare and label the training data, detecting and analyzing outliers to maintain a high level of data quality. This means that the availability and the quantity of training data can be limited. The distribution of input data in this subset of training data is therefore unlikely to exactly mirror the features of data in a real-world environment. 

A key example would be a model designed to categorize and identify images of dogs from other animals. The model will have been trained to recognize features of dogs from labelled training datasets. However, the training data may not be comprehensive and could miss out specific breeds. Once deployed, the model won’t accurately recognise breeds of dogs not present in the training data, as the distribution of features will be different.  

The training environment and real-world environment are different. Practitioners won’t have direct control of the input data once the model has been deployed. Training data will have been cleaned and prepared by data scientists, but the same level of oversight can not be expected in a live environment. For example, an image categorization model may process image subjects with different degrees of lighting or coloring when compared with the training data. This means the distribution of input data would be different enough to affect accuracy if not addressed. 

Three examples of covariate shift in machine learning

Covariate shift can occur in a range of different machine learning models, used for different tasks. Machine learning models are generally used to either classify data or to predict trends from data. The distribution of input data is integral to the learning process of these models.   

Detection of covariate drift and other types of model drift is an integral part of the machine learning optimization process. If left undetected, covariate drift can have a serious impact on the utility of machine learning models. Covariate shift can occur in most instances of machine learning, used in a range of settings.  

Examples of covariate shift in common machine learning use cases include: 

  • Image categorization and facial recognition 
  • Speech recognition and translation software 
  • Healthcare diagnosis and screening 

In image categorization and facial recognition

A popular use for machine learning models is in the categorization or classification of objects within a range of different file types. This can be as varied as natural language or text files, but is often used with the identification of image files. For example deep learning algorithms designed to identify human faces or categorize leaves by tree type. Models may achieve a high degree of accuracy on a labelled training dataset, identifying and categorising the object of an image. However, when deployed with live data changes to input distribution can have a serious impact on model accuracy. 

Something as subtle as a change in lighting could shift the distribution of data points, and thereby lower the accuracy of the model. In the case of facial recognition, training data may lack specific ethnicities or ages of subject. When deployed in a live environment, subjects that aren’t in line with training data may have an unrecognizable feature distribution.  

In speech recognition

Machine learning models are utilized to recognize human speech, either to improve human to system interactions or as part of a translation system. Covariate drift can cause serious issues with speech recognition models because of the diversity of voices, dialects and accents in spoken word. For example, a model may be trained on English speakers from a specific area with a specific accent. Although the model may achieve a high degree of accuracy with the training data, it will become inaccurate when processing spoken language in a live environment. This is because processing speech with new dialects or accents will be a different input distribution to the training data.  

In healthcare diagnosis

Machine learning models can be used to automate the detection of health issues or diseases in patient data. A model could be trained to automatically screen patient data samples against known diseases or health issues. The input data could be image files such as x-rays, or any number of health-related measurements. A model could also be used to flag patients at risk from certain diseases based on recorded lifestyle choices. 

However, if the model is trained on data from a type of patient that isn’t representative of the real-world use case, covariate drift can occur. For example, a model trained on available training data made up of patients in their 20’s won’t be as accurate at screening data from patients in their 50’s.  

Real-Time Deployment at Scale, Managed Your Way

Seldon moves machine learning from POC to production to scale, reducing time-to-value so models can get to work up to 85% quicker. In this rapidly changing environment, Seldon can give you the edge you need to supercharge your performance.

With Seldon, your business can efficiently manage and monitor machine learning, minimize risk, and understand how machine learning models impact decisions and business processes. Meaning you know your team has done its due diligence in creating a more equitable system while boosting performance.

Talk to our team about machine learning solutions today –> 

Contents