<< Chapter < Page | Chapter >> Page > |
Consider a finite collection of models , and recall the basic PAC bound: for any , with probability at least
where
and the loss is assumed to be bounded between 0 and 1. Note that we can write the inequality above as:
Letting , we have:
This is precisely the form of Hoeffding's inequality, with in place of the usual . In effect, in order to have Hoeffding's inequality hold with probability for all , we must distribute the “ -budget” or “confidence-budget” over all (in this case, evenly distributed):
However, to apply the union bound, we do not need to distribute evenly among the candidate models. We only require:
So, if are positive numbers satisfying , then we can take . This provides two advantages:
Prefix codes are one way to achieve this. If we assign a binary prefix code of length to each , then the values satisfy according to the Kraft inequality.
The main point of this lecture is to examine how PAC bounds of the form w.p.
can be used to select a model that comes close to achieving the best possible performace
Let be the model selected from using the training data . We will specify this model in a moment, but keep in mind that it is notnecessarily the model with minimum empirical risk as before. We would like to have
as small as possible. First, for any , define
where
Then w.p.
and in particular,
so, by the definition of ,
We will make use of the inequality above in a moment. First note that
The second term is exactly 0, since .
Now consider the first term . Let be the set of events on which
From the bound above, we know that . Thus,
We can summarize our analysis with the following theorem.
TheoremLet be a countable collection of models, and assign a positive number to each such that . Define the minimum complexity regularized risk model
Then,
This shows that
is a reasonable surrogate for
Let be the input space and be the output space. Let , k = 1, 2, ... denotes the collection of histogram classification rules withk equal volume bins. One choice of prefix code for this example is: and so on .... Then, if first code is corresponding to k , followed by bits to indicate which of the histogram rules in is under consideration, we have
Let be the model that solves the minimization i.e.,
That is, for each k, let
Then select the best according to
and set
Then,
It is a simple exercise to show that if and the Bayes decision boundary is a 1-d curve, then by setting and selecting the best from we have
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?