I’ve been reading Shalev-Shwartz & Ben-David’s book, “Understanding Machine Learning”, which presents the PAC theory in its Part I. While the theory of PAC learnability does appear very elegant and remarkable to me, I’m not so sure about its implications on practical machine learning problems. What is the primary utility (or utilities) of PAC learnability and VC dimension?
Is it that the PAC learnability theory tells us whether a hypothesis class H is well-chosen in the sense that test error can be made small with high probability? In other words, if H is PAC learnable, then we are assured that overfitting does not occur? So, is the primary value of the PAC theory to provide guidance on choosing the hypothesis class and the training set size? In addition, the PAC theory is remarkable because the assertions above hold without knowing the distribution on the input and output alphabet of the learning task?
I’m new to machine learning. I’d appreciate any corrections/comments/answers that would help me see how the PAC theory helps in practical machine learning problems.
I’m not sure if this answer addresses all of your questions above, but here’s my shot at answering your main question as to why PAC-learnability is useful in ML:
Let’s say you have a hypothesis h that belongs to some hypotheses space H. You want to find out how many training examples you need for your hypothesis to learn to some minimal performance level. Sample complexity (number of training examples needed for a learner to learn to some minimal performance) is driven by PAC-learnability. Basically we want h to be approximately correct (the error goal such that the error probability (epsilon) is bounded: 0<= epsion <= 1/2) and we also want to be confident that h is going to be correct most of the time (the certainty goal such that certainty probability (delta) is at least within some specified range: 0<=delta<=1/2). We want all this in polynomial time. Of course, PAC-learnability is dependent on sample size and dimension. So you can use PAC-learnability to determine the number of examples you need for your hypotheses h to be probably approximately correct.