What is the VC dimension of a decision tree with k splits in two dimensions? Let us say the model is CART and the only allowed splits are parallel to the axes.
So for one split we can order 3 points in a triangle and then for any labeling of the points we could get perfect prediction (i.e.: shattered points)
But what about 2 splits, or any general k?
I’m not sure this is a question with a simple answer, nor do I believe it is a question that even needs to be asked about decision trees.
Consult Aslan et al., Calculating the VC-Dimension of Trees (2009). They address this problem by doing an exhaustive search, in small trees, and then providing an approximate, recursive formula for estimating the VC dimension on larger trees. They then use this formula as part of a pruning algorithm. Had there been a closed-form answer to your question, I am sure they would have supplied it. They felt the need to iterate their way through even fairly small trees.
My two cents worth. I’m not sure that it’s meaningful to talk about the VC dimension for decision tres. Consider a d dimensional response, where each item is a binary outcome. This is the situation considered by Aslan et al. There are 2d possible outcomes in this sample space and 2d possible response patterns. If I build a complete tree, with d levels and 2d leaves, then I can shatter any pattern of 2d responses. But nobody fits complete trees. Typically, you overfit and then prune back using cross-validation. What you get at the end is a smaller and simpler tree, but your hypothesis set is still large. Aslan et al. try to estimate the VC dimension of families of isomorphic trees. Each family is a hypothesis set with its own VC dimension.
The previous picture illustrates a tree for a space with d=3 that shatters 4 points: (1,0,0,1),(1,1,1,0),(0,1,0,1),(1,1,0,1). The fourth entry is the “response”. Aslan et al. would regard a tree with the same shape, but using x1 and x2, say, to be isomorphic and part of the same hypothesis set. So, although there are only 3 leaves on each of these trees, the set of such trees can shatter 4 points and the VC dimension is 4 in this case. However, the same tree could occur in a space with 4 variables, in which case the VC dimension would be 5. So it’s complicated.
Aslan’s brute force solution seems to work fairly well, but what they get isn’t really the VC dimension of the algorithms people use, since these rely on pruning and cross-validation. It’s hard to say what the hypothesis space actually is, since in principle, we start with a shattering number of possible trees, but then prune back to something more reasonable. Even if someone begins with an a priori choice not to go beyond two layers, say, there may still be a need to prune the tree. And we don’t really need the VC dimension, since cross-validation goes after the out of sample error directly.
To be fair to Aslan et al., they don’t use the VC dimension to characterize their hypothesis space. They calculate the VC dimension of branches and use that quantity to determine if the branch should be cut. At each stage, they use the VC dimension of the specific configuration of the branch under consideration. They don’t look at the VC dimension of the problem as a whole.
If your variables are continuous and the response depends on reaching a threshold, then a decision tree is basically creating a bunch of perceptrons, so the VC dimension would presumably be greater than that (since you have to estimate the cutoff point to make the split). If the response depends monotonically on a continuous response, CART will chop it up into a bunch of steps, trying to recreate a regression model. I would not use trees in that case — possibly gam or regression.