What is Cross Validation in Machine Learning

Cross validation is the use of various techniques to evaluate a machine learning model’s ability to generalize when processing new and unseen datasets. Generalization is a key aim of machine learning development as it directly impacts the model’s ability to function in a live environment. Cross validation is therefore a key step in ensuring a machine learning model is accurate before machine learning deployment to a live environment.  

Many types of machine learning approaches include the training of the model on labelled training data in an offline or local environment. Although a model can achieve high accuracy on this training data, practitioners need to know whether the same accuracy will be  achieved with new data. Cross validation is a useful method for discovering overfitting during the machine learning optimization process. Overfitted models are too closely aligned to training data, so will be inaccurate with any new data once deployed.  

Cross validation requires the partitioning of a dataset into training and validation datasets. The difference in results between the training and testing or validation datasets can then be measured. Different types of cross validation techniques will have different ways of partitioning the original dataset. Often, the testing phase will include multiple iterations of the model with different subsamples of training and testing data. 

This guide explores cross validation in machine learning, including what it is, why it’s needed, and examples of the different cross validation techniques available.  

What is cross validation in machine learning?

Cross validation is the process of assessing a machine learning model’s accuracy with new data. It is a technique mostly used with predictive machine learning models, which use an understanding of input and output data to predict trends with new and unseen data. Predictive models are usually trained in an offline or local environment on labelled training data. These datasets will be a sample of labelled data prepared by a data scientist. 

The model may seem to be trained to a high accuracy on the training data, producing accurate predictions of output data. However, the accuracy of the model with new or unseen datasets is difficult to quantify. Cross validation is the process which tests model accuracy on new and unseen data beyond the training dataset. The technique gives an indication of the level of generalization of the model, or its ability to process unseen data.  

Cross validation in machine learning compares a model’s accuracy with training data against testing or validation datasets. The training dataset is used to train the machine learning model, then the testing dataset is used to evaluate the effectiveness of the model with unseen data. Cross validation provides an insight into how well a model will perform on new and unseen datasets. This will highlight whether the model is accurate with new data before deployment. Common issues highlighted by cross validation include a model being overfitted from a training dataset.  

Why is cross validation important?

Cross validation for machine learning is important as it helps to assess whether a model is accurate in a real-world environment with new and dynamic data. The amount of available training data is often limited. High quality labelled datasets take time and resources to properly prepare. It can be expected that a model trained on training data in an offline environment may drop in accuracy when deployed to a new live environment. Cross validation is the process of testing a model with new data, to assess predictive accuracy with unseen data. 

Cross validation is therefore an important step in the process of developing a machine learning model. The technique is a useful method for flagging either overfitting or selection bias in the training data. The process of machine learning optimization means an algorithm will have been trained to be as effective as possible with the training data. However, the model may be over optimized or overfit to the training data. This means the model will be less accurate when exposed to a live environment which isn’t as closely controlled as a training environment. Cross validation is a way of highlighting and measuring this issue with generalization. The technique will help organisations estimate the level of error between a model processing training data and unseen data. 

Evaluating the accuracy of a machine learning model is an important step before deployment. A model starts to bring value to the organization once deployed, so ensuring the algorithm is functioning efficiently is vital. Although the model may achieve a high accuracy on training data, this is not guaranteed with live data in a dynamic setting. Once deployed, a model could be ineffective with new data if its accuracy has not been cross validated. The results of the cross validation process can be used to signal further optimization of the machine learning model before deployment. 

Four types of cross validation in machine learning

Cross validation is the process of validating a machine learning model against new, unseen data. It’s an important step to take between training a model and deployment in a live setting. A major aim of a machine learning model is to reach a degree of generalization, to accurately predict outcomes with unseen datasets. Models may be overfit to the training data, meaning the model is inaccurate in a live environment.  

Although the aim is the same, there are a range of different cross validation techniques utilized in machine learning. The different cross validation methods can be categorized as either exhaustive methods or non-exhaustive methods. Exhaustive cross validation methods will test all possible combinations between a training and validation dataset. Non-exhaustive cross validation methods don’t test all combinations, but instead will test a randomized partition of the original dataset. 

Here we explore four common cross validation techniques: two exhaustive methods and two non-exhaustive methods. Four common technique for cross validation in machine learning include: 

  • Leave-P-Out cross validation (exhaustive method) 
  • Leave-one-out cross validation (exhaustive method) 
  • Holdout method (non-exhaustive method) 
  • K-fold cross validation (non-exhaustive method) 

Leave-P-Out cross validation

Leave-P-Out cross validation is an exhaustive validation method which tests all possible combinations of training and testing datasets from an original dataset. The process takes a user-defined number of datapoints (p) from the original dataset to populate the validation test dataset. The remaining data points make up the training data for the model. This process is repeated through different iterations and combinations. The higher the number of data points selected to populate the testing data, the more exhaustive the testing. The results of these iterative tests are then averaged to get the final accuracy score. 

Leave-one-out cross validation

Leave-one-out cross validation is a streamlined form of leave-p-out cross validation. In this version, p is set as the value of one. This limits the possible number of combinations, lowering the computational resource required to run the test. Much like in leave-p-out cross validation, all possible combinations are tested before an average is taken from each iteration.  

Holdout method for cross validation

The holdout method is a non-exhaustive cross validation technique based on the randomly assigned data points in a training dataset and test dataset. The available dataset is split into a training and test dataset, with the test data generally having fewer data points. The selection of the test and training data points is randomized.  

The holdout method involves a single test, unlike other methods which include multiple iterations of a model being tested. For this reason the holdout method of cross validation can be less accurate, as other methods have taken an average from multiple tests and iterations. However, the holdout method is understood to be the most straightforward cross validation technique. It is much less resource intensive as a result. 

K-fold cross validation

K-fold cross validation is another form of non-exhaustive cross validation which forms training data from equally sized samples of the original dataset. A number (k) of equally sized testing subsamples is created from the original dataset, which has been randomized. The subsample makes up the testing dataset, whilst the remaining data points are the training datasets. 

Validation testing is repeated for the number of testing subsamples, before being averaged for an overall accuracy score. For example if k is set as 10, the overall dataset is randomized before 10 subsamples are created of equal size. The 10 subsamples make up 10 different testing datasets, with the remaining data points as training data for each iteration. Cross validation tests would be repeated 10 times. 

Real-Time Deployment at Scale, Managed Your Way

Seldon moves machine learning from POC to production to scale, reducing time-to-value so models can get to work up to 85% quicker. In this rapidly changing environment, Seldon can give you the edge you need to supercharge your performance.

With Seldon, your business can efficiently manage and monitor machine learning, minimize risk, and understand how machine learning models impact decisions and business processes. Meaning you know your team has done its due diligence in creating a more equitable system while boosting performance.

Talk to our team about machine learning solutions today –> 

Contents