Friday, January 15, 2010

Stochastic Gradient Descent

Most machine learning papers which evaluate an algorithm do so in a batch learning setting. Data is split into two. The first chunk is used to train the model; the second chunk is used to evaluate. But, the more I sit thinking about how to apply learning algorithms to the real world, the more I find this setting inapplicable. Typically, there is so much data and the "world" changes sufficiently often that it is difficult to construct a training set that would be appropriate for batch learning given the time constraints imposed in practice. Of course, there is an entire subdiscipline devoted to the problem of learning in an iterative fashion: online learning.

When text classification was my research topic in grad school, one of my favorite papers was Text categorization based on regularized linear classification methods by Tong Zhang and Frank J. Oles. I felt, and still feel, that this paper showed engineering acumen that is so often missing from research papers, many of which emphasize novel, yet fruitless ideas. The Zhang and Oles paper takes well-known, mathematically simple ideas and shows how they can be applied to achieve state-of-the-art performance.

My search for practical, effective online learning algorithms led me to Léon Bottou and his support of stochastic gradient descent. His style reminds me a bit of the Zhang and Oles paper. SGD has been around for decades, yet here he is showing that SGD trains multiple orders of magnitude faster than batch-style learners, with no loss on test performance. And, the theory supports the evidence. Traditionally, we consider two types of error: (1) the error due to limitations on our model class, and (2) error due to using a sample of data rather than the full joint distribution. Bottou points out a third error we should consider: optimization error (the difference between the empirical loss of our optimized parameter vector and the minimum empirical loss). SGD allows us to quickly find a parameter vector with empirical loss close to the minimum. What SGD cannot do is quickly find the parameter vector for the empirical loss minimum to a high degree of precision. But, since we are already making model class and sample approximations, there is no need to expend great effort trying to eliminate as much optimization error as possible.

The main caveat I see to using stochastic gradient descent for online learning is the setting of the learning schedule. Even for batch-style SGD, setting the learning rate sounds like it is not trivia. Bottou provides some tips in his Tradeoffs lecture:

  • Set the learning rate to η/(t + t0) where η = 1/λ (λ is the regularization parameter), t is the example index, and t0 is set so that initial updates are of the same magnitude as the final size of the weights.
  • If weight magnitudes are difficult to estimate, he suggests using a t0 to get the initial learning rate which would be chosen for the first step of gradient descent using a small batch of examples.
What the above tips do not address is the online/streaming scenario where one never wants the learning rate to become so small that the model would have difficulty reacting to changes in the underlying distribution. It seems that one could simply add a constant to the learning rate. Of course, selection of this constant is not trivial, but should neither be difficult to select by observing the magnitude of the per-example change in weights relative to the magnitude of the weights.


  1. Can you give a reference which discusses the effects of scaling and shifting the input for gradient descent. Something like the one discussed in the following coursera lecture -

  2. I can say that scaling is intimately related to the learning rate. If you multiply your data by 2, you need to divide your learning rate by 1/2. I looked at Hinton's intro video, but am not sure what you are referring to. Can you be more specific?