Optimal separation of a data set in: Training, Validation and Testing

Asked

Viewed 1,864 times

2

I would like to know if there is a recommendation of type "Rule of Thumb" to, in a Machine Learning problem, divide a data set into 3 sets: Training, Validation and Testing.

If so, what would be the ideal division?

I would also like to better understand what the difference is between the validation set and the test set, and why it is necessary to have both.

1 answer

3


In general we randomly separated 70% for training, 15% validation and 15% tests... But this varies a lot and can depend on the problem, for example when there is a time factor, we cannot randomly separate and then it is common to take a different periods for training, validation and testing. Depending on the size of the dataset, it doesn’t even make sense to use these percentages either...

On your other question: why do we use a validation set and a test set?

In general we adjusted a large number of models and verified the prediction error in the validation set, in the end we chose the model with the lowest error in the validation set. The problem is that as we adjust many models, it is easy to find a model that becomes specific (overmatched or overflow) for the validation database and does not work for other data sets. That is why we leave a test set to estimate the prediction error of the chosen model and make sure that the model is not overmatched.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.