Wikipedia defines overfitting as occurring when a model describes random error or noise instead of the underlying relationship. And, yet, as I write this, the Blogger editor is telling me that "overfitting" is a spelling error. Shouldn't a nontrivial Wikipedia page be enough to confirm that "overfitting" is, in fact, a correct spelling? 'course, I know from experience that this sort of question is easy to state but difficult to resolve properly...

## Wednesday, July 20, 2011

## Thursday, May 19, 2011

### L1 Regularization

Someone on MetaOptimize asked an interesting question: How to use L1 regularization with L-BFGS. One might rephrase this question to ask whether L1 regularization can be achieved with generally available optimization software. My impression of the discussion is that the best answer can be found in a paper by Mark Schmidt, Glenn Fung, and Romer Rosales.

One standard implementation of L-BFGS is SciPy.optimize.lbfgsb which allows for relatively simple inequality constraints. One of the approaches Schmidt, et al. suggest is the classic formulation of L1 regularization which imposes an L1 constraint on the parameter vector (Equation 6 in the ECML paper). Unfortunately, this cannot be used with SciPy.optimize.lbfgsb. However, by splitting each variable into positive and negative components and using the Lagrangian form, one arrives at a formulation for which the SciPy.optimize.lbfgsb implementation is well-suited. See Equation 7 of the ECML paper.

## Tuesday, January 25, 2011

### Fast Maximum Entropy?

Maybe it's bitterness from grad school <g>, but I've long held the belief that gradient-descent-style algorithms are often the most efficient approach for parameter estimation.

The topic of maximum entropy parameter estimation just came up on the scikit-learn mailing list to which I had just subscribed after noticing a relevant cross-post on the scipy mailing list. Scikit-learn provides a variety of machine learning algorithm implementations based on scipy and numpy. List members were debating whether maximum entropy estimation code should be preserved. I noted that maximum entropy yields the same models as maximum likelihood estimated exponential models.

Later someone asked why different techniques were used for maximum entropy versus logistic regression. My opinion is that good gradient-type solvers are difficult to implement and it's easier to get published if a new approach comes with its own, new estimation method. I cited Optimization with EM and Expectation-Conjugate-Gradient as an example. Another list member offered a more relevant example, A comparison of algorithms for maximum entropy parameter estimation, where IIS and GIS are shown to be less efficient than gradient-based methods.

## Friday, October 22, 2010

### Pareto Distribution

I didn't realize that the standard heavy-tailed distribution is the Pareto Distribution. It's a Heavy-tailed distribution which follows the Power Law. It is commonly used to describe wealth distributions.

**Update (10/25/10)**: Another distribution I didn't know about is the negative multinomial distribution. Seems like I would have come across this with all of my work in text classification. Then again, length never seemed to be that useful in determining classification (esp. compared to words). And, I bet even the single failure negative multinomial had too-light of a tail to model actual document lengths accurately.

## Wednesday, October 20, 2010

### Open Source Machine Learning Software

I've just learned about mlpy which is open source python machine learning software. It includes a good variety of machine learning algorithms including SVM, k-NN, LASSO, ridge regression, k-mean clustering, k-median clustering, wavlet transforms, and resampling methods. It uses numpy as one would expect. But, I see no mention of the use of sparse matrices, so I suspect that the library would not scale well to large feature matrices, which is typically essential in my work.

It looks like a good resource for open source ML software is MLoss, which organized a NIPS workshop.

## Wednesday, October 13, 2010

### A Better Lady Tasting Tea?

I recently learned of the famous Lady Tasting Tea hypothesis test experiment. When I read of Rod Sturdivant's account of the experiment, it struck me that the experiment was not designed to maximize the amount of learned information for a fixed number of cups.

The designer of the experiment, R. A. Fisher, was trying to determine whether a lady could tell whether milk was poured before or after tea was poured into a cup. To test, he presented her with four cups poured milk-first and four cups poured tea-first in random order. This results in 70 possibilities and a 1.4%-chance of correctly determining the order given no ability. A single mistake on the part of the lady would make it impossible to reject the no-ability hypothesis with reasonable (p=0.05) confidence.

If Fisher had instead flipped 8 coins, filling milk first if head and tea first if tails, then there would have been 256 possibilities with a 0.4%-chance (`1/256`) of guessing all 8 correct given no ability and a 3.9%-chance (`9/256`) of guessing 7-or-8 correct with no ability. So, with this method, a single mistake would have still allowed one to reject the no-ability hypothesis.

Of course, the coin-flip test is technically more difficult because of the need to make absolute determinations, rather than merely determining an ordering of the cups, as is the case in the permutation test.

## Friday, August 6, 2010

### Drawing Points from a Uniform Distribution

Everyone know how to draw points from a uniform distribution, right? Well, what if you want to draw points uniformly over a circle? You can draw from the bounding square and discard points outside the circle. But, this doesn't work for the *n*-sphere as *n* grows large (since the expected number of pseudo-draws increases without bound). The question of drawing uniformly from an elipse recently arose on the scipy mailing list and one participant posted an excellent discussion of how to make it work without discarding any draws. One of the key points is that you need (1) a parameterization, and (2) a transformation of that parameterization which yields a constant Jacobian.