Why is a 0-1 loss function intractable?

In Ian Goodfellow’s Deep Learning book, it is written that

Sometimes, the loss function we actually care about (say, classification error) is not one that can be optimized efficiently. For example, exactly minimizing expected 0-1 loss is typically intractable (exponential in the input dimension), even for a linear classifier. In such situations, one typically optimizes
a surrogate loss function instead, which acts as a proxy but has advantages.

Why is 0-1 loss intractable, or how is it exponential in the input dimensions?

The 0-1 loss function is non-convex and discontinuous, so (sub)gradient methods cannot be applied. For binary classification with a linear separator, this loss function can be formulated as finding the $\beta$ that minimizes the average value of the indicator function $\mathbf{1}(y_{i}\beta\mathbf{x}_{i} \leq 0)$ over all $i$ samples. This is exponential in the inputs, as since there are two possible values for each pair, there are $2^{n}$ possible configurations to check for $n$ total sample points. This is known to be NP-hard. Knowing the current value of your loss function doesn’t provide any clue as to how you should possibly modify your current solution to improve, as you could derive if gradient methods for convex or continuous functions were available.