In general we randomly separated 70% for training, 15% validation and 15% tests... But this varies a lot and can depend on the problem, for example when there is a time factor, we cannot randomly separate and then it is common to take a different periods for training, validation and testing. Depending on the size of the dataset, it doesn’t even make sense to use these percentages either...
On your other question: why do we use a validation set and a test set?
In general we adjusted a large number of models and verified the prediction error in the validation set, in the end we chose the model with the lowest error in the validation set. The problem is that as we adjust many models, it is easy to find a model that becomes specific (overmatched or overflow) for the validation database and does not work for other data sets. That is why we leave a test set to estimate the prediction error of the chosen model and make sure that the model is not overmatched.
Excellent reply! Thank you very much!
– Antonio Carlos Porto