This is a rather general question (i.e. not necessarily specific to statistics), but I have noticed a trend in the machine learning and statistical literature where authors prefer to follow the following approach:
Approach 1: Obtain a solution to a practical problem by formulating a cost function for which it is possible (e.g. from a computational standpoint) to find a globally optimal solution (e.g. by formulating a convex cost function).
Approach 2: Obtain a solution to the same problem by formulating a cost function for which we may not be able to obtain a globally optimal solution (e.g. we can only get a locally optimal solution for it).
Note that rigorously speaking the two problems are different; the assumption is that we can find the globally optimal solution for the first one, but not for the second one.
Other considerations aside (i.e. speed, ease of implementation, etc.), I am looking for:
- An explanation of this trend (e.g. mathematical or historical arguments)
- Benefits (practical and/or theoretical) for following Approach 1 instead of 2 when solving a practical problem.
My believe is that the goal should be to optimize the function you are interested in. If that happens to be the number of misclassifications – and not a binomial likelihood, say – then you should try minimizing the number of misclassifications. However, for the number of practical reasons mentioned (speed, implementation, instability etc), this may not be so easy and it may even be impossible. In that case, we choose to approximate the solution.
I know of basically two approximation strategies; either we come up with algorithms that attempt to directly approximate the solution of the original problem, or we reformulate the original problem as a more directly solvable problem (e.g. convex relaxations).
A mathematical argument for preferring one approach over the other is whether we can understand a) the properties of the solution actually computed and b) how well the solution approximates the solution of the problem we are actually interested in.
I know of many results in statistics where we can prove properties of a solution to a optimization problem. To me it seems more difficult to analyze the solution of an
algorithm, where you don’t have a mathematical formulation of what it computes (e.g. that it solves a given optimization problem). I certainly won’t claim that you can’t, but it seems to be a theoretical benefit, if you can give a clear mathematical formulation of what you compute.
It is unclear to me, if such mathematical arguments give any practical benefits to Approach 1 over Approach 2. There are certainly somebody out there, who are not afraid of a non-convex loss function.