I have encountered many peculiarities/misunderstandings of Vowpal Wabbit when trying to do online multiple-pass learning.
Specifically, I need to solve a Ridge Linear regression problem, with
N=4e6
points and a total of aroundK=2.38e5
features. Each point has sparsely populated features, usually between 10 and 100. SinceN
is too large to fit in my PC’s limited memory, I decided to use Vowpal Wabbit’s out-of-core SGD. Features are in text representation:vw -d train.dat -c -f train.model --passes 10 --loss_function squared --l2 0.00001 -l 0.05 Num weight bits = 18 learning rate = 0.05 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 using l2 regularization = 1e-05 using cache_file = vw/Train_0.cache ignoring text input in favor of cache input num sources = 1 average since example example current current current loss last counter weight label predict features 0.028051 0.028051 3 3.0 0.0010 0.0070 24 0.089131 0.150212 6 6.0 0.0036 0.0256 20 0.131556 0.182465 11 11.0 0.0076 0.0297 47 0.102226 0.072896 22 22.0 0.0052 0.0368 16 0.095826 0.089426 44 44.0 0.0539 0.1463 64 0.094730 0.093608 87 87.0 0.0058 0.0605 8 0.084809 0.074887 174 174.0 0.0177 0.0914 13 0.073454 0.062099 348 348.0 0.2019 0.1518 18 0.065183 0.056912 696 696.0 0.0044 0.1866 20 0.061144 0.057106 1392 1392.0 0.8041 0.2107 12 0.060112 0.059079 2784 2784.0 0.0454 0.1215 15 0.054327 0.048543 5568 5568.0 0.0875 0.0016 58 0.052169 0.050009 11135 11135.0 0.0194 0.0785 41 0.048767 0.045365 22269 22269.0 0.6126 0.1835 42 0.046333 0.043900 44537 44537.0 0.0149 0.0000 55 0.045165 0.043997 89073 89073.0 0.4736 0.1643 15 0.043814 0.042463 178146 178146.0 0.0561 0.0189 51 0.042698 0.041581 356291 356291.0 0.0161 0.0412 54 0.042067 0.041436 712582 712582.0 0.0012 0.1215 25 0.041751 0.041435 1425163 1425163.0 0.0540 0.0677 33 0.042095 0.042439 2850326 2850326.0 0.1088 0.1141 41 0.000000 0.000000 5700651 5700651.0 0.0028 -0.0000 43 h 0.000000 0.000000 11401301 11401301.0 0.0006 0.1667 32 h 0.000000 0.000000 22802601 22802601.0 0.0534 0.1617 12 h finished run number of examples per pass = 3702833 passes used = 10 weighted example sum = 3.70283e+07 weighted label sum = 5.96004e+06 average loss = 0 h best constant = 0.160959 total feature number = 1244815410
The
h
on the right end of the last three rows indicates that the loss is calculated on the holdout’s validation-subset of data. Why am I getting an average loss of 0, when presumably the holdout predictions are incorrect, as indicated, again, by the difference betweencurrent label
andcurrent predict
of the last 3 rows? This oddity is obscuring the fact of whether or not my algorithm has converged…I can bypass this obscurity with a work-around, by ignoring
vw
‘s--passes
option and manually performing “passes” along with IO-intensive shuffling of data after each “pass”:for i in {0..10} do # Custom script that shuffles train.dat -> train.dat ./shuffledata train.dat train.dat vw -d train.dat -c -k --l2 0.00001 -l 0.05 -i train.model -f train.model done
However, the above two codes (with the
--passes
option and without the--passes
option + shuffling data script) performs almost exactly the same as the following snippet:vw -d train.dat -c --l2 0.00001 -l 0.05 -f train.model
In other words, it’s as if passes are not needed at all for “convergence.” These experiences lead me to my question(s).
- Why does multiple-pass (shuffled & non-shuffled) give the same result as single-pass in an online SGD setting?
- Why is my holdout error 0?
Answer
Attribution
Source : Link , Question Author : richizy , Answer Author : Community