tag:blogger.com,1999:blog-45587314165556725592014-10-06T23:13:32.938-04:00Machine Learning QuirksJasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.comBlogger11125tag:blogger.com,1999:blog-4558731416555672559.post-51785177962423789582011-07-20T18:06:00.002-04:002011-07-20T18:06:40.540-04:00Is Overfitting a Word?<p>
Wikipedia defines <a href="http://en.wikipedia.org/wiki/Overfitting">overfitting</a> as occurring when a model describes random error or noise instead of the underlying relationship. And, yet, as I write this, the Blogger editor is telling me that "overfitting" is a spelling error. Shouldn't a nontrivial Wikipedia page be enough to confirm that "overfitting" is, in fact, a correct spelling? 'course, I know from experience that this sort of question is easy to state but difficult to resolve properly...
</p>
Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com1tag:blogger.com,1999:blog-4558731416555672559.post-32043406863118773842011-05-19T10:58:00.000-04:002011-05-19T10:58:42.247-04:00L1 Regularization<p>Someone on <a href="http://metaoptimize.com/qa">MetaOptimize</a> asked an interesting question: <a href="http://metaoptimize.com/qa/questions/1334/how-to-optimize-l1-regularized-objective-function-with-l-bfgs-can-r-implementation-do-it">How to use L1 regularization with L-BFGS</a>. One might rephrase this question to ask whether L1 regularization can be achieved with generally available optimization software. My impression of the discussion is that the best answer can be found in a <a href="http://pages.cs.wisc.edu/~gfung/GeneralL1/">paper by Mark Schmidt, Glenn Fung, and Romer Rosales</a>.<br />
</p><p>One standard implementation of L-BFGS is <a href="http://www.scipy.org/doc/api_docs/SciPy.optimize.lbfgsb.html">SciPy.optimize.lbfgsb</a> which allows for relatively simple inequality constraints. One of the approaches Schmidt, et al. suggest is the classic formulation of L1 regularization which imposes an L1 constraint on the parameter vector (Equation 6 in the <a href="http://www.cs.ubc.ca/~schmidtm/Software/L1General/L1General.pdf">ECML paper</a>). Unfortunately, this cannot be used with SciPy.optimize.lbfgsb. However, by splitting each variable into positive and negative components and using the Lagrangian form, one arrives at a formulation for which the SciPy.optimize.lbfgsb implementation is well-suited. See Equation 7 of the <a href="http://www.cs.ubc.ca/~schmidtm/Software/L1General/L1General.pdf">ECML paper</a>.<br />
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com0tag:blogger.com,1999:blog-4558731416555672559.post-16189795903313508432011-01-25T08:37:00.001-05:002011-04-04T08:51:08.333-04:00Fast Maximum Entropy?<p>Maybe it's bitterness from grad school <g>, but I've long held the belief that gradient-descent-style algorithms are often the most efficient approach for parameter estimation.<br />
</p><p>The topic of maximum entropy parameter estimation just came up on the <a href="https://lists.sourceforge.net/lists/listinfo/scikit-learn-general">scikit-learn mailing list</a> to which I had just subscribed after noticing a relevant cross-post on the scipy mailing list. <a href="http://scikit-learn.sourceforge.net/">Scikit-learn</a> provides a variety of machine learning algorithm implementations based on scipy and numpy. List members were <a href="https://sourceforge.net/mailarchive/forum.php?thread_name=1295941214.26559.2.camel%40Nokia-N900-51-1&forum_name=scikit-learn-general">debating whether maximum entropy estimation code should be preserved</a>. I noted that <a href="http://www.cs.cmu.edu/afs/cs/user/aberger/www/html/tutorial/node8.html">maximum entropy yields the same models as maximum likelihood estimated exponential models</a>.<br />
</p><p>Later someone asked why different techniques were used for maximum entropy versus logistic regression. My opinion is that good gradient-type solvers are difficult to implement and it's easier to get published if a new approach comes with its own, new estimation method. I cited <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.83.649">Optimization with EM and Expectation-Conjugate-Gradient</a> as an example. Another list member offered a more relevant example, <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.134.7137">A comparison of algorithms for maximum entropy parameter estimation</a>, where IIS and GIS are shown to be less efficient than gradient-based methods.<br />
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com0tag:blogger.com,1999:blog-4558731416555672559.post-61237546189783370362010-10-22T09:16:00.002-04:002010-10-25T11:13:45.708-04:00Pareto Distribution<p>I didn't realize that the standard heavy-tailed distribution is the <a href="http://en.wikipedia.org/wiki/Pareto_distribution">Pareto Distribution</a>. It's a <a href="http://en.wikipedia.org/wiki/Heavy-tailed_distribution">Heavy-tailed distribution</a> which follows the <a href="http://en.wikipedia.org/wiki/Power_Law">Power Law</a>. It is commonly used to describe wealth distributions.<br />
</p><p><b>Update (10/25/10)</b>: Another distribution I didn't know about is the <a href="http://en.wikipedia.org/wiki/Negative_multinomial_distribution">negative multinomial distribution</a>. Seems like I would have come across this with all of my work in text classification. Then again, length never seemed to be that useful in determining classification (esp. compared to words). And, I bet even the single failure negative multinomial had too-light of a tail to model actual document lengths accurately.<br />
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com0tag:blogger.com,1999:blog-4558731416555672559.post-17460840593684088952010-10-20T11:59:00.000-04:002010-10-20T11:59:50.756-04:00Open Source Machine Learning Software<p>I've just learned about <a href="https://mlpy.fbk.eu/">mlpy</a> which is open source python machine learning software. It includes a good variety of machine learning algorithms including SVM, k-NN, LASSO, ridge regression, k-mean clustering, k-median clustering, wavlet transforms, and resampling methods. It uses numpy as one would expect. But, I see no mention of the use of sparse matrices, so I suspect that the library would not scale well to large feature matrices, which is typically essential in my work.<br />
</p><p>It looks like a good resource for open source ML software is <a href="http://mloss.org/">MLoss</a>, which organized a <a href="http://mloss.org/workshop/nips08/">NIPS workshop</a>.<br />
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com2tag:blogger.com,1999:blog-4558731416555672559.post-87781036794064566222010-10-13T18:05:00.000-04:002010-10-13T18:05:48.929-04:00A Better Lady Tasting Tea?<p>I recently learned of the famous <a href="http://en.wikipedia.org/wiki/Lady_tasting_tea">Lady Tasting Tea</a> hypothesis test experiment. When I read of <a href="http://www.dean.usma.edu/math/people/sturdivant/images/MA376/dater/ladytea.pdf">Rod Sturdivant's account of the experiment</a>, it struck me that the experiment was not designed to maximize the amount of learned information for a fixed number of cups.<br />
</p><p>The designer of the experiment, <a href="http://en.wikipedia.org/wiki/Ronald_Fisher">R. A. Fisher</a>, was trying to determine whether a lady could tell whether milk was poured before or after tea was poured into a cup. To test, he presented her with four cups poured milk-first and four cups poured tea-first in random order. This results in 70 possibilities and a 1.4%-chance of correctly determining the order given no ability. A single mistake on the part of the lady would make it impossible to reject the no-ability hypothesis with reasonable (p=0.05) confidence.<br />
</p><p>If Fisher had instead flipped 8 coins, filling milk first if head and tea first if tails, then there would have been 256 possibilities with a 0.4%-chance (<tt>1/256</tt>) of guessing all 8 correct given no ability and a 3.9%-chance (<tt>9/256</tt>) of guessing 7-or-8 correct with no ability. So, with this method, a single mistake would have still allowed one to reject the no-ability hypothesis.<br />
</p><p>Of course, the coin-flip test is technically more difficult because of the need to make absolute determinations, rather than merely determining an ordering of the cups, as is the case in the permutation test.<br />
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com0tag:blogger.com,1999:blog-4558731416555672559.post-44756306737373763812010-08-06T15:53:00.000-04:002010-08-06T15:53:43.854-04:00Drawing Points from a Uniform Distribution<p>Everyone know how to draw points from a uniform distribution, right? Well, what if you want to draw points uniformly over a circle? You can draw from the bounding square and discard points outside the circle. But, this doesn't work for the <a href="http://en.wikipedia.org/wiki/N-sphere"><i>n</i>-sphere</a> as <i>n</i> grows large (since the expected number of pseudo-draws increases without bound). The question of drawing uniformly from an elipse recently arose on the scipy mailing list and one participant posted an excellent discussion of <a href="http://sagenb.org/home/pub/2336/">how to make it work without discarding any draws</a>. One of the key points is that you need (1) a parameterization, and (2) a transformation of that parameterization which yields a constant Jacobian.<br />
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com0tag:blogger.com,1999:blog-4558731416555672559.post-64747855361003445332010-01-26T19:49:00.003-05:002010-03-05T10:01:59.848-05:00Sort by Rating<p>A plea to <a href="http://www.overstock.com/">Overstock</a> (as well as any other web retailer which is willing to listen):
<blockquote>
<p>
When I view a page of items from a category, I can select various "Sort By" options, such as "Top Sellers" and "Review Ratings". When I'm not looking for something specific, I would like to know what items are both popular and highly regarded by other customers. Neither "Top Sellers" nor "Review Ratings" does this. "Top Sellers" ignores ratings and a product with a single 5-star rating can be the top product according to "Review Ratings".
</p>
<p>
I would like to suggest an improvement to the "Review Ratings" sort which would make it more useful. Currently, you sort by average rating, i.e. sum-of-ratings divided-by number-of-rating. If you add a fictitious 3-star rating to this calculation (sum-of-ratings-plus-3 divided-by number-of-ratings-plus-1) the sort will be more valuable as only products which are popular AND receive top reviews would make it to the top of the list.
</p>
</blockquote>
<p>
'course, anyone with a hint of machine learning background will recognize this as a very basic Bayesian posterior estimate. <a href="http://www.imdb.com/chart/top">IMDB</a> is one of the few sensible web sites which uses this sort of estimate for sorting based on rating.
</p>
<p>
<b>Update (3/5/10)</b>: As Ashwin noted in the comments, <a href="http://www.evanmiller.org/how-not-to-sort-by-average-rating.html">the Wilson Score</a> may be an even better solution. </p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com3tag:blogger.com,1999:blog-4558731416555672559.post-52668845302064049862010-01-15T09:11:00.003-05:002010-01-15T11:20:33.191-05:00Stochastic Gradient Descent<p>
Most machine learning papers which evaluate an algorithm do so in a <i>batch learning</i> setting. Data is split into two. The first chunk is used to train the model; the second chunk is used to evaluate. But, the more I sit thinking about how to apply learning algorithms to the real world, the more I find this setting inapplicable. Typically, there is so much data and the "world" changes sufficiently often that it is difficult to construct a training set that would be appropriate for batch learning given the time constraints imposed in practice. Of course, there is an entire subdiscipline devoted to the problem of learning in an iterative fashion: <i>online learning</i>.
</p>
<p>
When text classification was my research topic in grad school, one of my favorite papers was <a href="http://citeseerx.ksu.edu.sa/viewdoc/summary?doi=10.1.1.20.5553">Text categorization based on regularized linear classification methods by Tong Zhang and Frank J. Oles</a>. I felt, and still feel, that this paper showed engineering acumen that is so often missing from research papers, many of which emphasize novel, yet fruitless ideas. The Zhang and Oles paper takes well-known, mathematically simple ideas and shows how they can be applied to achieve state-of-the-art performance.
</p>
<p>
My search for practical, effective online learning algorithms led me to <a href="http://leon.bottou.org/">Léon Bottou</a> and his support of <a href="http://leon.bottou.org/research/stochastic">stochastic gradient descent</a>. His style reminds me a bit of the Zhang and Oles paper. SGD has been around for decades, yet here he is showing that <a href="http://leon.bottou.org/talks/largescale">SGD trains multiple orders of magnitude faster than batch-style learners, with no loss on test performance</a>. And, the theory supports the evidence. Traditionally, we consider two types of error: (1) the error due to limitations on our model class, and (2) error due to using a sample of data rather than the full joint distribution. Bottou points out a third error we should consider: optimization error (the difference between the empirical loss of our optimized parameter vector and the minimum empirical loss). SGD allows us to quickly find a parameter vector with empirical loss close to the minimum. What SGD cannot do is quickly find the parameter vector for the empirical loss minimum to a high degree of precision. But, since we are already making model class and sample approximations, there is no need to expend great effort trying to eliminate as much optimization error as possible.
</p>
<p>
The main caveat I see to using <a href="http://leon.bottou.org/research/stochastic">stochastic gradient descent</a> for online learning is the setting of the learning schedule. Even for batch-style SGD, setting the learning rate sounds like it is not trivia. Bottou provides some tips in his <a href="http://leon.bottou.org/talks/largescale">Tradeoffs</a> lecture:
<ul>
<li> Set the learning rate to <code>η/(t + t<sub>0</sub>)</code> where <code>η = 1/λ</code> (<code>λ</code> is the regularization parameter), <code>t</code> is the example index, and <code>t<sub>0</sub></code> is set so that initial updates are of the same magnitude as the final size of the weights.
<li> If weight magnitudes are difficult to estimate, he suggests using a <code>t<sub>0</sub></code> to get the initial learning rate which would be chosen for the first step of gradient descent using a small batch of examples.
</ul>
What the above tips do not address is the online/streaming scenario where one never wants the learning rate to become so small that the model would have difficulty reacting to changes in the underlying distribution. It seems that one could simply add a constant to the learning rate. Of course, selection of this constant is not trivial, but should neither be difficult to select by observing the magnitude of the per-example change in weights relative to the magnitude of the weights.
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com2tag:blogger.com,1999:blog-4558731416555672559.post-2151870500487018142010-01-13T15:18:00.006-05:002010-01-14T13:38:43.930-05:00Sam Roweis Suicide<p>
I'm speachless.
</p>
<p>
<a href="http://gothamist.com/2010/01/13/professor_jumps_to_death.php">NYU Professor Jumps To Death</a>
</p>
<p>
<b>Update:</b> (1/14) It sounds like the care of extremely premature 1 year old twin girls may have been the main reason for Sam's suicide. See <tt>fr_of_sams</tt> comment a page down in <a href="http://www.nydailynews.com/forums/thread.jspa?threadID=84311">this NY Daily News thread</a>:
<blockquote>
Their children were twin girls. Born over one year ago. Severely premature: 24 weeks gestation. Severely disabled. Require 24-hour nursing care for their entire lives. Cannot eat, and will be on feeding tubes for the rest of their lives. Constant medical treatment and medical crises. Maybe someday they will walk. Maybe not. They will never really talk. The financial pressures on this couple were severe: they had already used up all insurance benefit limits, and burnt through their own personal savings with their retirement savings following rapidly. We should put the blame for Sam's death where it belongs. We have the technology to keep little girls like that alive. We have a society that pretty much insists upon doing so. But we put the financial and organizational and emotional burden entirely on the parents. We allow insurance companies to cap benefits, and we have no social system for rescuing people from these straights until they are literally homeless and bereft.
</blockquote>
I had only met Sam a handful of times at conferences and certainly can't confirm or deny any of these details, but I don't know what else could have caused him to take his own life...
</p>
<p>
See <a href="http://hunch.net/?p=1172">John Lanford's post</a> and <a href="http://samroweis1972-2010.blogspot.com/">the Sam Roweis memoriam blog</a>.
</p>Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com0tag:blogger.com,1999:blog-4558731416555672559.post-11307142612230176222009-10-28T09:56:00.000-04:002009-10-28T11:00:09.990-04:00Smooth Absolute aka Huber LossWhile working on my thesis, I was looking for a way to get some beneficial properties of loss/penalty functions like the hinge and <a href="http://mathworld.wolfram.com/L1-Norm.html">L1</a>/absolute while still being able to use my favorite optimization algorithm, <a href="http://www.cs.cmu.edu/~jrs/jrspapers.html#cg">conjugate gradients</a>. I came up with the trick of using <a href="http://mathworld.wolfram.com/L2-Norm.html">L2</a>/squared error close to the origin/hinge and L1/absolute elsewhere. Later, when reading <a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">The Elements of Statistical Learning</a>, I learned that this had trick has been invented decades ago by Peter Huber (1964). Unfortunately, the Huber Loss definition is incorrect in both the 1st and 2nd editions. The correct (LaTeX) definition is:
<pre>
\begin{align}
L(y,f(x)) = \left\{ \begin{array}{cl}
\frac{1}{2} \left[y-f(x)\right]^2 & \text{for }|y-f(x)| \le \delta, \\
\delta \left(|y-f(x)|-\delta/2\right) & \text{otherwise.}
\end{array}\right.
\end{align}
</pre>
I.e. let <tt>z≡y-f(x)</tt>; then the inner portion is <tt>z<sup>2</sup>/2</tt>; the outer portion is <tt>δ*(z-δ/2)</tt>.Jasonhttp://www.blogger.com/profile/00489496856755184870noreply@blogger.com2