In Decision Tree, splitting criterion methods are applied say information gain to split the current tree node to built a decision tree, but in many machine learning problems, normally there is a cost/loss function to be minimised to get the best parameters.

My question is how to define such a cost function of Decision Tree?

**Answer**

I think it helps to distinguish between *training* metrics and *evaluation* metrics, and between *global* training metrics and *local* training metrics. When we talk about evaluation metrics, as @AlvaroFuentes said, a loss function can always be defined for decision trees, in the same way as for any other model. In training, it is true that often a global metric is chosen and training attempts to optimize over that metric *. But training does not have to be this way, and in the case of decision trees, training proceeds through a greedy search, each step based on a local metric (eg, information gain or Gini index). In fact, even when a global training metric is defined (say, likelihood), each step in training is still is evaluated based on some local metric (say, gradient of likelihood) and hence in a sense “greedy”; it is just that in this case the local metric is inspired by the global one. And in both cases there is no guarantee that a greedy search based on some local metric actually optimizes the global metric (eg, local optima).

* Side note: This training metric is often different from the evaluation metrics, chosen for its nicer mathematical properties to help the training; eg, likelihood, L2 or cross entropy vs accuracy or AUC.

**Attribution***Source : Link , Question Author : GoingMyWay , Answer Author : Lei Huang*