Wikipedia defines overfitting as occurring when a model describes random error or noise instead of the underlying relationship. And, yet, as I write this, the Blogger editor is telling me that "overfitting" is a spelling error. Shouldn't a nontrivial Wikipedia page be enough to confirm that "overfitting" is, in fact, a correct spelling? 'course, I know from experience that this sort of question is easy to state but difficult to resolve properly...
Thursday, May 19, 2011
Someone on MetaOptimize asked an interesting question: How to use L1 regularization with L-BFGS. One might rephrase this question to ask whether L1 regularization can be achieved with generally available optimization software. My impression of the discussion is that the best answer can be found in a paper by Mark Schmidt, Glenn Fung, and Romer Rosales.
One standard implementation of L-BFGS is SciPy.optimize.lbfgsb which allows for relatively simple inequality constraints. One of the approaches Schmidt, et al. suggest is the classic formulation of L1 regularization which imposes an L1 constraint on the parameter vector (Equation 6 in the ECML paper). Unfortunately, this cannot be used with SciPy.optimize.lbfgsb. However, by splitting each variable into positive and negative components and using the Lagrangian form, one arrives at a formulation for which the SciPy.optimize.lbfgsb implementation is well-suited. See Equation 7 of the ECML paper.
Tuesday, January 25, 2011
Maybe it's bitterness from grad school <g>, but I've long held the belief that gradient-descent-style algorithms are often the most efficient approach for parameter estimation.
The topic of maximum entropy parameter estimation just came up on the scikit-learn mailing list to which I had just subscribed after noticing a relevant cross-post on the scipy mailing list. Scikit-learn provides a variety of machine learning algorithm implementations based on scipy and numpy. List members were debating whether maximum entropy estimation code should be preserved. I noted that maximum entropy yields the same models as maximum likelihood estimated exponential models.
Later someone asked why different techniques were used for maximum entropy versus logistic regression. My opinion is that good gradient-type solvers are difficult to implement and it's easier to get published if a new approach comes with its own, new estimation method. I cited Optimization with EM and Expectation-Conjugate-Gradient as an example. Another list member offered a more relevant example, A comparison of algorithms for maximum entropy parameter estimation, where IIS and GIS are shown to be less efficient than gradient-based methods.