I understand the reasoning behind splitting the data into a Test set and a Validation set. I also understand that the size of the split will depend on the situation but will generally vary from 50/50 to 90/10.
I built a RNN to correct spelling and start with a data set of ~5m sentences. I shave off 500k sentences and then train with the remaining ~4.5m sentences. When the training is done I take my validation set and compute the accuracy.
The interesting thing is that after only 4% of my validation set I have an accuracy of 69.4% and this percentage doesn’t change by more than 0.1% in either direction. Eventually I just cut the validation short because the number is stuck at 69.5%.
So why slice off 10% for Validation when I could probably get away with 1%? Does it matter?
Larger validation sets give more accurate estimates of out-of-sample performance. But as you’ve noticed, at some point that estimate might be as accurate as you need it to be, and you can make some rough predictions as to the validation sample size you need to reach that point.
For simple correct/incorrect classification accuracy, you can calculate the standard error of the estimate as √p(1−p)/n (standard deviation of a Bernouilli variable), where p is the probability of a correct classification, and n is the size of the validation set. Of course you don’t know p, but you might have some idea of its range. E.g. let’s say you expect an accuracy between 60-80%, and you want your estimates to have a standard error smaller than 0.1%:
How large should n (the size of the validation set) be? For p=0.6 we get:
For p=0.8 we get:
So this tells us you could get away with using less than 5% of your 5 million data samples, for validation. This percentage goes down if you expect higher performance, or especially if you are satisfied with a lower standard error of your out-of-sample performance estimate (e.g. with p=0.7 and for a s.e. < 1%, you need only 2100 validation samples, or less than a twentieth of a percent of your data).
These calculations also showcase the point made by Tim in his answer, that the accuracy of your estimates depends on the absolute size of your validation set (i.e. on n), rather than its size relative to the training set.
(Also I might add that I'm assuming representative sampling here. If your data are very heterogeneous you might need to use larger validation sets just to make sure that the validation data includes all the same conditions etc. as your train & test data.)