While working on my thesis, I was looking for a way to get some beneficial properties of loss/penalty functions like the hinge and

L1/absolute while still being able to use my favorite optimization algorithm,

conjugate gradients. I came up with the trick of using

L2/squared error close to the origin/hinge and L1/absolute elsewhere. Later, when reading

The Elements of Statistical Learning, I learned that this had trick has been invented decades ago by Peter Huber (1964). Unfortunately, the Huber Loss definition is incorrect in both the 1st and 2nd editions. The correct (LaTeX) definition is:

\begin{align}
L(y,f(x)) = \left\{ \begin{array}{cl}
\frac{1}{2} \left[y-f(x)\right]^2 & \text{for }|y-f(x)| \le \delta, \\
\delta \left(|y-f(x)|-\delta/2\right) & \text{otherwise.}
\end{array}\right.
\end{align}

I.e. let

`z≡y-f(x)`; then the inner portion is

`z`^{2}/2; the outer portion is

`δ*(z-δ/2)`.