Introduction to Model Validation for Machine Learning / Data Science

Key Points

Know the importance of training, validation, and test sets to prevent bias
The difference between random and real effects that can impact performance
How to split data to measure different model performance
k-Fold Cross-validation for training and validation

Introduction

In data science, model validation is a process to measure the machine learning reliability to accurately predict the outcome while preventing possible biases. This process includes using different sets for training the model and another portion for validation and testing. That said, splitting the dataset into a different portion for each aspect of the said phase (training, validation, and test) is the law to properly build a good model, and make judgments about its accuracy. This will also combat having a too-optimistic result such as high prediction accuracy due to patterns caused by random effects from the training data.

There are two types of patterns. There is a real effect, which is derived from the real relationship between attributes and the response. There is a random effect, which seems to be real but actually just a random chance. However, it is important to note that when fitting a model from a training set, both the real effect and random effect are present. This is because the former (real effect) is the same for all datasets while the latter (random effect) is different for all datasets. This also means that when a different data (new | test data) is used, only the real effect is present, and the random effect is different which leads the model to have a lower performance result versus when it is in the training set.

Example Scenario

Let’s say that I am investigating the top 10 log of session counts in my Porta SFTP Server with the highest rates to possibly predict what day and when the session most occurred.

Day	Session Rates
Friday	17
Monday	11
Saturday	8
Sunday	9
Thursday	7
Tuesday	10
Wednesday	4
Friday	13
Monday	9
Saturday	5

So, based on this data, do you think we can conclude that the best prediction should be Friday or Monday? Of course, the answer is no, while I may have a good prediction rate in the model training based on Monday/Friday (may be due to random effect), the model will still be too way off if it is used against different data (new session from other clients for example). This is because the data from Monday / or Friday have high session rates only by random chance due to some uncertainty.

Imagine that the aggregated data from the table is affected by e.g. business partners/associates (clients) demands (urges) to login to the SFTP Server during the week. A reason could be that in the recent week, different clients weren’t able to log in to the server due to SFTP Server downtime, etc., which resulting those session counts rising the following day (happened by random chance). That said, this is why the same data cannot be used for both training and testing to counteract the random effect.

Measuring Model Performance

If you only measuring one model (not measuring / comparing different types of models), you only need two sets. A larger set to fit the model and the smaller set to measure its effectiveness.

The 90 percent will be used to fit the model; and
The 80 percent will be used to validate the effectiveness.

Thus, this will help to make a judgment on the accuracy or performance of the model. If there is any uniquely random effect that is present inherited from the training set, this will not be counted as a benefit as we only use the 80% for the validation to measure its effectiveness.

Comparing & Choosing Model

Suppose you build different models using the common classifier machine learning algorithm Support Vector Machine (SVM) and K-Nearest Neighbors (KNN).

Sample of 4 Models Performance from KNN and SVM

	Model 1	Model 2	Model 3	Model 4
SVM	94/100	*98/100*	90/100	90/100
KNN	92/100	94/100	87/100	95/100

Do you think it is reasonable to just pick the SVM Model 2 as it shows the highest 98% result?

The answer is no. While using Training and Validation sets alone can be reasonably fine when estimating just one model’s accuracy, comparing different models is not. This is because it is unavoidable that the observed accuracy from the high-performing model is equated with the real quality + random effects. This means that the model is most likely to have above-average random effects which leads to being too optimistic.

Sample table of Models with Real and Random effect

Model	Real Effects	Random Effects	Total
SVM – 1	92	+2	94
SVM – 2	93	+5	98 – Inflated by Random Effects
SVM – 3	83	+1	84
SVM – 4	93	-3	90
KNN – 1	94	-2	92
KNN – 2	91	+3	94
KNN – 3	92	-5	87
KNN – 4	94	+0	94

Note. This is only for visualization to understand what real and random effects mean.

This table shows that the random effect can make the model look better than it really is, and or sometimes it can make look worse than it really is.

Introducing Third Set of Data

To circumvent this issue, we need to have another set of data (third set) to check the performance of the chosen model after choosing it from the Validation set.

Flow Chart Guide

Training Set – for building a model
Validation Set – choosing the best model; and
Test Set – for estimating the performance of the chosen model

Splitting Data

When building a model, the training set should always be larger in general to let the model to fit and capture most important data as much as possible. On the other hand, the testing should have enough data along with some of the important data.

One model

70-90% for training set and 10-30% for the testing set

Comparing model

50-70% for training set and the remaining will be divided for the validation and test set

When splitting data into different sets, it is important to note that randomization or rotation for selecting data points for each set can really be helpful to approximately equalize the distribution of important information across sets. This means that training data should have some (if not most) of the important data points that are commonly shared with validation data and/ or test data. However, be careful as like any other techniques, randomization and rotation alone are also vulnerable to introducing selection bias which impacts model accuracy. To minimize (counteract) this chance, use both this technique (combination).

k-Fold Cross Validation

In splitting data, there is always a chance of a huge skewness distribution of important data. For example, what if the important data is only present in validation or test sets? This means that the training will not be fitter to important data points that are only presented to the training and test sets. Well, that’s where cross-validation comes into play.

There are several cross-validation methods in machine learning, but this post will introduce the k-fold cross-validation method. For your learning curve, here are the names of a few:

k-Fold cross-validation
Time series (rolling cross-validation)
Hold-out cross-validation
Leave-p-out cross-validation
Leave-one-out cross-validation
Stratified k-fold cross-validation
Monte Carlo (shuffle-split)

Remember that all of these have one goal to combat selection bias, help evaluate how the model will generalize independent variables, and give estimates of how good that model performs when encountering new data points to make predictions.

k-Fold cross-validation where k is the number of different equal splits of random data coming from the training sets. However, we need to leave one set from that dataset to be used in the validation. So, k-1=number of splits and minus 1 to leave the validation therefore k-1. And to visualize:

ITERATIONS	FOLD 1	FOLD 2	FOLD 3	FOLD 4	FOLD 5
1	Validation	Training	Training	Training	Training
2	Training	Validation	Training	Training	Training
3	Training	Training	Validation	Training	Training
4	Training	Training	Training	Validation	Training
5	Training	Training	Training	Training	Validation

Note. The table shows k=5 minus 1 (5-1) to leave one for the validation. This is called 5-fold cross-validation.

As you can see in the table above, the 4 sets of splits go to training and 1 set for the validation and continue to rotate in order for the next iteration.

Possible Questions

Can the best-performing model from k-fold cross-validation be enough to use? The answer is no, although you can average the different model performances to get an idea of how well the model is, this will not guarantee to perform well in new data. Instead, you have to use all of the training sets (k) to fit the model.

What is the best way to determine the value number of set/split (k)? There is no best way to determine or standard value of k but the most common is 10.

To learn more about the math behind cross-validation please visit An Introduction to Statistical Learning (statlearning.com) chapter 5.

Conclusion

In the field of data science, there are so many ways to give solutions to the same problems, and learning is the key to innovation. This post shows only the basic general idea of how important the validation and testing phase is after the model is fitted from the training sets. While it is true, this post also shows the different splitting techniques for the dataset to minimize bias. That said, it is in conjunction with the bias/variance tradeoffs to achieve the optimal model prediction accuracy.

I hope you enjoy reading this post. Please subscribe to the mailing list as there will be coding in R and or Python for these different techniques, to see it in action.

Tags:

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Introduction to Model Validation for Machine Learning / Data Science

Introduction to Model Validation for Machine Learning / Data Science

Key Points