Introduction to Model Validation for Machine Learning / Data Science

Introduction to Model Validation for Machine Learning / Data Science

Key Points

  • Know the importance of training, validation, and test sets to prevent bias
  • The difference between random and real effects that can impact performance
  • How to split data to measure different model performance
  • k-Fold Cross-validation for training and validation

Introduction

In data science, model validation is a process to measure the machine learning reliability to accurately predict the outcome while preventing possible biases. This process includes using different sets for training the model and another portion for validation and testing. That said, splitting the dataset into a different portion for each aspect of the said phase (training, validation, and test) is the law to properly build a good model, and make judgments about its accuracy. This will also combat having a too-optimistic result such as high prediction accuracy due to patterns caused by random effects from the training data.

There are two types of patterns. There is a real effect, which is derived from the real relationship between attributes and the response. There is a random effect, which seems to be real but actually just a random chance. However, it is important to note that when fitting a model from a training set, both the real effect and random effect are present. This is because the former (real effect) is the same for all datasets while the latter (random effect) is different for all datasets. This also means that when a different data (new | test data) is used, only the real effect is present, and the random effect is different which leads the model to have a lower performance result versus when it is in the training set.

Example Scenario

Let’s say that I am investigating the top 10 log of session counts in my Porta SFTP Server with the highest rates to possibly predict what day and when the session most occurred.

DaySession Rates
Friday17
Monday11
Saturday8
Sunday9
Thursday7
Tuesday10
Wednesday4
Friday13
Monday9
Saturday5
So, based on this data, do you think we can conclude that the best prediction should be Friday or Monday? Of course, the answer is no, while I may have a good prediction rate in the model training based on Monday/Friday (may be due to random effect), the model will still be too way off if it is used against different data (new session from other clients for example). This is because the data from Monday / or Friday have high session rates only by random chance due to some uncertainty.

Imagine that the aggregated data from the table is affected by e.g. business partners/associates (clients) demands (urges) to login to the SFTP Server during the week. A reason could be that in the recent week, different clients weren’t able to log in to the server due to SFTP Server downtime, etc., which resulting those session counts rising the following day (happened by random chance). That said, this is why the same data cannot be used for both training and testing to counteract the random effect.

Measuring Model Performance

If you only measuring one model (not measuring / comparing different types of models), you only need two sets. A larger set to fit the model and the smaller set to measure its effectiveness.

  • The 90 percent will be used to fit the model; and
  • The 80 percent will be used to validate the effectiveness.

Thus, this will help to make a judgment on the accuracy or performance of the model. If there is any uniquely random effect that is present inherited from the training set, this will not be counted as a benefit as we only use the 80% for the validation to measure its effectiveness.

Comparing & Choosing Model

Suppose you build different models using the common classifier machine learning algorithm Support Vector Machine (SVM) and K-Nearest Neighbors (KNN).

Sample of 4 Models Performance from KNN and SVM

Model 1Model 2Model 3Model 4
SVM94/10098/10090/10090/100
KNN92/10094/10087/10095/100
Do you think it is reasonable to just pick the SVM Model 2 as it shows the highest 98% result?

The answer is no. While using Training and Validation sets alone can be reasonably fine when estimating just one model’s accuracy, comparing different models is not. This is because it is unavoidable that the observed accuracy from the high-performing model is equated with the real quality + random effects. This means that the model is most likely to have above-average random effects which leads to being too optimistic.

Sample table of Models with Real and Random effect

ModelReal EffectsRandom EffectsTotal
SVM – 192+294
SVM – 293+598 – Inflated by Random Effects
SVM – 383+184
SVM – 493-390
KNN – 194-292
KNN – 291+394
KNN – 392-587
KNN – 494+094
Note. This is only for visualization to understand what real and random effects mean.

This table shows that the random effect can make the model look better than it really is, and or sometimes it can make look worse than it really is.

Introducing Third Set of Data

To circumvent this issue, we need to have another set of data (third set) to check the performance of the chosen model after choosing it from the Validation set.

Flow Chart Guide

  • Training Set – for building a model
  • Validation Set – choosing the best model; and
  • Test Set – for estimating the performance of the chosen model

Splitting Data

When building a model, the training set should always be larger in general to let the model to fit and capture most important data as much as possible. On the other hand, the testing should have enough data along with some of the important data.

One model

  • 70-90% for training set and 10-30% for the testing set

Comparing model

  • 50-70% for training set and the remaining will be divided for the validation and test set

When splitting data into different sets, it is important to note that randomization or rotation for selecting data points for each set can really be helpful to approximately equalize the distribution of important information across sets. This means that training data should have some (if not most) of the important data points that are commonly shared with validation data and/ or test data. However, be careful as like any other techniques, randomization and rotation alone are also vulnerable to introducing selection bias which impacts model accuracy. To minimize (counteract) this chance, use both this technique (combination).

k-Fold Cross Validation

In splitting data, there is always a chance of a huge skewness distribution of important data. For example, what if the important data is only present in validation or test sets? This means that the training will not be fitter to important data points that are only presented to the training and test sets. Well, that’s where cross-validation comes into play.

There are several cross-validation methods in machine learning, but this post will introduce the k-fold cross-validation method. For your learning curve, here are the names of a few:

  • k-Fold cross-validation
  • Time series (rolling cross-validation)
  • Hold-out cross-validation
  • Leave-p-out cross-validation
  • Leave-one-out cross-validation
  • Stratified k-fold cross-validation
  • Monte Carlo (shuffle-split)

Remember that all of these have one goal to combat selection bias, help evaluate how the model will generalize independent variables, and give estimates of how good that model performs when encountering new data points to make predictions.

k-Fold cross-validation where k is the number of different equal splits of random data coming from the training sets. However, we need to leave one set from that dataset to be used in the validation. So, k-1=number of splits and minus 1 to leave the validation therefore k-1. And to visualize:

ITERATIONSFOLD 1FOLD 2FOLD 3FOLD 4FOLD 5
1ValidationTrainingTrainingTrainingTraining
2TrainingValidationTrainingTrainingTraining
3TrainingTrainingValidationTrainingTraining
4TrainingTrainingTrainingValidationTraining
5TrainingTrainingTrainingTrainingValidation

Note. The table shows k=5 minus 1 (5-1) to leave one for the validation. This is called 5-fold cross-validation.

As you can see in the table above, the 4 sets of splits go to training and 1 set for the validation and continue to rotate in order for the next iteration.

Possible Questions

Can the best-performing model from k-fold cross-validation be enough to use? The answer is no, although you can average the different model performances to get an idea of how well the model is, this will not guarantee to perform well in new data. Instead, you have to use all of the training sets (k) to fit the model.

What is the best way to determine the value number of set/split (k)? There is no best way to determine or standard value of k but the most common is 10.

To learn more about the math behind cross-validation please visit An Introduction to Statistical Learning (statlearning.com) chapter 5.

Conclusion

In the field of data science, there are so many ways to give solutions to the same problems, and learning is the key to innovation. This post shows only the basic general idea of how important the validation and testing phase is after the model is fitted from the training sets. While it is true, this post also shows the different splitting techniques for the dataset to minimize bias. That said, it is in conjunction with the bias/variance tradeoffs to achieve the optimal model prediction accuracy.

I hope you enjoy reading this post. Please subscribe to the mailing list as there will be coding in R and or Python for these different techniques, to see it in action.

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x