Confusion with Vowpal Wabbit’s multiple-pass behavior when performing ridge-regression [closed]

I have encountered many peculiarities/misunderstandings of Vowpal Wabbit when trying to do online multiple-pass learning.

Specifically, I need to solve a Ridge Linear regression problem, with N=4e6 points and a total of around K=2.38e5 features. Each point has sparsely populated features, usually between 10 and 100. Since N is too large to fit in my PC’s limited memory, I decided to use Vowpal Wabbit’s out-of-core SGD. Features are in text representation:

vw -d train.dat -c -f train.model --passes 10 --loss_function squared
    --l2 0.00001 -l 0.05

Num weight bits = 18
learning rate = 0.05
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using l2 regularization = 1e-05
using cache_file = vw/Train_0.cache
ignoring text input in favor of cache input
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.028051   0.028051            3         3.0   0.0010   0.0070       24
0.089131   0.150212            6         6.0   0.0036   0.0256       20
0.131556   0.182465           11        11.0   0.0076   0.0297       47
0.102226   0.072896           22        22.0   0.0052   0.0368       16
0.095826   0.089426           44        44.0   0.0539   0.1463       64
0.094730   0.093608           87        87.0   0.0058   0.0605        8
0.084809   0.074887          174       174.0   0.0177   0.0914       13
0.073454   0.062099          348       348.0   0.2019   0.1518       18
0.065183   0.056912          696       696.0   0.0044   0.1866       20
0.061144   0.057106         1392      1392.0   0.8041   0.2107       12
0.060112   0.059079         2784      2784.0   0.0454   0.1215       15
0.054327   0.048543         5568      5568.0   0.0875   0.0016       58
0.052169   0.050009        11135     11135.0   0.0194   0.0785       41
0.048767   0.045365        22269     22269.0   0.6126   0.1835       42
0.046333   0.043900        44537     44537.0   0.0149   0.0000       55
0.045165   0.043997        89073     89073.0   0.4736   0.1643       15
0.043814   0.042463       178146    178146.0   0.0561   0.0189       51
0.042698   0.041581       356291    356291.0   0.0161   0.0412       54
0.042067   0.041436       712582    712582.0   0.0012   0.1215       25
0.041751   0.041435      1425163   1425163.0   0.0540   0.0677       33
0.042095   0.042439      2850326   2850326.0   0.1088   0.1141       41
0.000000   0.000000      5700651   5700651.0   0.0028  -0.0000       43 h
0.000000   0.000000     11401301  11401301.0   0.0006   0.1667       32 h
0.000000   0.000000     22802601  22802601.0   0.0534   0.1617       12 h

finished run
number of examples per pass = 3702833
passes used = 10
weighted example sum = 3.70283e+07
weighted label sum = 5.96004e+06
average loss = 0 h
best constant = 0.160959
total feature number = 1244815410

The h on the right end of the last three rows indicates that the loss is calculated on the holdout’s validation-subset of data. Why am I getting an average loss of 0, when presumably the holdout predictions are incorrect, as indicated, again, by the difference between current label and current predict of the last 3 rows? This oddity is obscuring the fact of whether or not my algorithm has converged…

I can bypass this obscurity with a work-around, by ignoring vw‘s --passes option and manually performing “passes” along with IO-intensive shuffling of data after each “pass”:

for i in {0..10}
do
    # Custom script that shuffles train.dat -> train.dat
    ./shuffledata train.dat train.dat
    vw -d train.dat -c -k --l2 0.00001 -l 0.05 -i train.model -f train.model
done

However, the above two codes (with the --passes option and without the --passes option + shuffling data script) performs almost exactly the same as the following snippet:

vw -d train.dat -c --l2 0.00001 -l 0.05 -f train.model

In other words, it’s as if passes are not needed at all for “convergence.” These experiences lead me to my question(s).

  1. Why does multiple-pass (shuffled & non-shuffled) give the same result as single-pass in an online SGD setting?
  2. Why is my holdout error 0?

Answer

Attribution
Source : Link , Question Author : richizy , Answer Author : Community

Leave a Comment