comparison of SGD and ALS in collaborative filtering

Matrix factorization is widely applied in collaborative filtering, and briefly speaking, it tries to learn the following parameters:

And we could apply SGD and ALS as the learning algorithm, however, as I read here, they said,

SGD is not practical if the dataset size is huge, instead ALS is better.

I wonder why SGD is not good if the dataset is huge? I thought even if it’s huge, we could use mini-batch SGD, which is the widely adopted way to train large nerual nets, isn’t it?


By SGD, each time we only use one data point r_{ui}, and only optimize one part of the entire loss, i.e. optimize (r_{ui} – q_u^Tp_i)^2, so we update q_u using the gradient of this part, resulting \tilde{q_u}.

Absolutely \tilde{q_u} will optimize (r_{ui} – q_u^Tp_i)^2, but it might worsen other parts of the entire loss, say (r_{uj} – q_u^Tp_j)^2, I mean both r_{ui} and r_{un} involves q_u.

Considering the above case, how could we guarantee that SGD will converge?


Both SGD and ALS are very practical for matrix factorization,

Yehuda Koren, a winner of the Netflix prize (see here) and a pioneer in Matrix factorization techniques for CF, worked at Yahoo labs at the time, and was a part of the development of a CF model for Yahoo.

Reading through Yahoo labs’ publications (for example here and here), it is easy to see that they are using SGD heavily, and we can only assume that the same holds for production systems.

Matrix factorization is often done on a matrix of user_featurexmovie_features (instead of matrices of usersxmovies) because of the cold-start issue, making the argument mentioned in the link less relevant.

SGD also has the upper hand regarding dealing with missing data, which is a fairly common scenario.

To sum up, SGD is a very common method for CF, and I see no reason why it cannot be applied on large data sets.

Source : Link , Question Author : avocado , Answer Author : Uri Goren

Leave a Comment