How do I train a (logistic?) regression in R using L1 loss function?

I can train a logistic regression in R using

glm(y ~ x, family=binomial(logit)))

but, IIUC, this optimizes for log likelihood.

Is there a way to train the model using the linear (L1) loss function (which in this case is the same as the total variation distance)?

I.e., given a numeric vector x and a bit (logical) vector y, I want to construct a monotonic (in fact, increasing) function f such that |f(x)y| is minimized.

See also

Answer

What you want to do does not exist because it is, for lack of better word, mathematically flawed.

But first, I will stress why I think the premises of your question are sound. I will then try to explain why I think the conclusions you draw from them rest on a misunderstanding of the logistic model and, finally, I will suggest an alternative approach.

I will denote {(xxi,yi)}ni=1 your n observations (the bolder letters denote vectors) which lie in p dimensional space (the first entry of xxi is 1) with p<n, yi[0,1] and f(xxi)=f(xxiββ) is an monotonous function of xxiββ, say like the logistic curve to fix ideas. For expediency, I will just assume that n is sufficiently large compared to p.

You are correct that if you intend to use TVD as criterion to evaluate the fitted model, then it is reasonable to expect your fit to optimize that same criterion among all possible candidate, on your data. Hence

ββ=argmin

The problem is the error term: \epsilon_i=y_i-f(\pmb x_i'\pmb\beta)
and if we enforce E(\pmb\epsilon)=0 (we simply want our model to be asymptotically unbiased), then, \epsilon_i must be heteroskedastic. This is because y_i can take
on only two values, 0 and 1. Therefore, given
\pmb x_i, \epsilon_i can also only take on two values: 1-f(\pmb x_i'\pmb\beta) when y_i=1, which occurs with probability f(\pmb x_i'\pmb\beta), and -f(\pmb x_i'\pmb\beta) when y_i=1, which occurs with probability 1-f(\pmb x_i'\pmb\beta).

These consideration together imply that:

\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\text{var}(\pmb\epsilon)=E(\pmb\epsilon^2)=(1-f(\pmb x'\pmb\beta))^2f(\pmb x'\pmb\beta)+(-f(\pmb x'\pmb\beta))^2(1-f(\pmb x'\pmb\beta))\\
\;=(1-f(\pmb x'\pmb\beta))f(\pmb x'\pmb\beta)\\
=E(\pmb y|\pmb x)E(1-\pmb y|\pmb x)

hence \text{var}(\pmb\epsilon) is not constant but concave parabola shaped and is maximized when \pmb x is such that E(y|\pmb x)\approx .5.

This inherent heteroskedasticity of the residuals has consequences. It implies among other things that when minimizing the l_1 loss function, you are asymptotically over-weighting part of your sample. That is, the fitted \pmb\beta^* don't fit the data at all but only the portion of it that is clustered around places where \pmb x is such that E(\pmb y|\pmb x)\approx .5. To wit, these are the least informative data points in your sample: they correspond to those observations for which the noise component is the largest. Hence, your fit is pulled towards \pmb\beta^*=\pmb\beta:f(\pmb x'\pmb\beta)\approx .5, e.g. made irrelevant.

One solution, as is clear from the exposition above is to drop the requirement of unbiased-ness. A popular way to bias the estimator (with some Bayesian interpretation attached) is by including a shrinkage term. If we re-scale the response:

y^+_i=2(y_i-.5),1\leq i\leq n

and, for computational expediency, replace f(\pmb x'\pmb\beta) by another monotone function g(\pmb x,[c,\pmb\gamma])=\pmb x'[c,\pmb\gamma] --it will be convenient for the sequel to denote the first component of the vector of parameter as c and the remaining p-1 ones \pmb\gamma-- and include a shrinkage term (for example one of the form ||\pmb\gamma||_2), the resulting optimization problem becomes:

[c^*,\pmb\gamma^{*}]=\underset{\pmb[c,\pmb\gamma]\in\mathbb{R}^{p}}{\arg\min}\;\;\sum_{i=1}^n\max(0,1-y_i^+\pmb x_i'\pmb[c,\pmb\gamma])+\frac{1}{2}||\pmb\gamma||_2

Note that in this new (also convex) optimization problem, the penalty for a correctly classified observations is 0 and it grows linearly with \pmb x'\pmb[c,\gamma] for a miss-classified one --as in the l_1 loss. The [c^*,\pmb\gamma^*] solution to this second optimization problem are the celebrated linear svm (with perfect separation) coefficients. As opposed to the \pmb\beta^*, it makes sense to learn these [c^*,\pmb\gamma^{*}] from the data with an TVD-type penalty ('type' because of the bias term). Consequently, this solution is widely implemented. See for example the R package LiblineaR.

Attribution
Source : Link , Question Author : sds , Answer Author : user603

Leave a Comment