If you fit a non linear function to a set of points (assuming there is only one ordinate for each abscissa) the result can either be:

- a very complex function with small residuals
- a very simple function with large residuals
Cross validation is commonly used to find the “best” compromise between these two extremes. But what does “best” mean? Is it “most likely”? How would you even start to prove what the most likely solution is?

My inner voice is telling me that CV is finding some sort of minimum energy solution. This makes me think of entropy, which I vaguely know occurs in both stats and physics.

It seems to me that the “best” fit is generated by minimising the sum of functions of complexity and error ie

`minimising m where m = c(Complexity) + e(Error)`

Does this make any sense? What would the functions c and e be?

Please can you explain using non mathematical language, because I will not understand much maths.

**Answer**

I think this is an excellent question. I am going to paraphase it just to be sure I have got it right:

It would seem that there are lots of

ways to choose the complexity penalty

function c and error penalty

function e. Which choice is `best’.

What should best evenmean?

I think the answer (if there is one) will take you way beyond just cross-validation. I like how this question (and the topic in general) ties nicely to Occam’s Razor and the general concept of parsimony that is fundamental to science. I am by no means an expert in this area but I find this question hugely interesting. The best text I know on these sorts of question is Universal Artificial Intelligence by Marcus Hutter (don’t ask me any questions about it though, I haven’t read most of it). I went to a talk by Hutter and couple of years ago and was very impressed.

You are right in thinking that there is a minimum entropy argument in there somewhere (used for the complexity penalty function c in some manner). Hutter advocates the use of Kolmogorov complexity instead of entropy. Also, Hutter’s definition of `best’ (as far as I remember) is (informally) the model that *best predicts the future* (i.e. best predicts the data that will be observed in the future). I can’t remember how he formalises this notion.

**Attribution***Source : Link , Question Author : bart , Answer Author : Robby McKilliam*