Machine Learning Quirks: Smooth Absolute aka Huber Loss

Wednesday, October 28, 2009

Smooth Absolute aka Huber Loss

While working on my thesis, I was looking for a way to get some beneficial properties of loss/penalty functions like the hinge and L1/absolute while still being able to use my favorite optimization algorithm, conjugate gradients. I came up with the trick of using L2/squared error close to the origin/hinge and L1/absolute elsewhere. Later, when reading The Elements of Statistical Learning, I learned that this had trick has been invented decades ago by Peter Huber (1964). Unfortunately, the Huber Loss definition is incorrect in both the 1st and 2nd editions. The correct (LaTeX) definition is:

\begin{align}
L(y,f(x)) = \left\{ \begin{array}{cl}
\frac{1}{2} \left[y-f(x)\right]^2 & \text{for }|y-f(x)| \le \delta, \\
\delta \left(|y-f(x)|-\delta/2\right) & \text{otherwise.}
\end{array}\right.
\end{align}

I.e. let z≡y-f(x); then the inner portion is z²/2; the outer portion is δ*(z-δ/2).