<< Chapter < Page | Chapter >> Page > |
Given a set of training data and a finite collection of candidate functions , select that (hopefully) is a good predictor for future cases. That is
where is the empirical risk. For any particular , the corresponding empirical risk is defined as
Hoeffding's inequality (Chernoff's bound in this case) allows us to gauge how close is to the true risk of , , in probability
Since our selection process involves deciding among all , we would like to gauge how close the empirical risks are to theirexpected values. We can do this by studying the probability that one or more of the empirical risks deviates significantly from itsexpected value. This is captured by the probability
Note that the event
is equivalent to union of the events
Therefore, we can use Bonferonni's bound (aka the “union of events” or “union” bound) to obtain
where is the number of classifiers in . In the proof of Hoeffding's inequality we also obtained a one-sided inequality thatimplied
and hence
We can restate the inequality above as follows, For all and for all with probability at least
This follows by setting and solving for . Thus with a high probability , the true risk for all is bounded by the empirical risk of plus a constant that depends on , the number of training samples n, and the size . Most importantly the bound does not depend on the unknown distribution . Therefore, we can call this a distribution-free bound.
We can use the distribution-free bound above to obtain a bound on the expected performance of the minimum empirical riskclassifier
We are interested in bounding
the expected risk of minus the minimum risk for all . Note that this difference is always non-negative since is at best as good as
Recall that and , with probability at least
where
In particular, since this holds for all including ,
and for any other
since . In particular,
where .
Let denote the set of events on which the above inequality holds. Then by definition
We can now bound as follows
since . The quantity above is bounded as follows.
since , and
Thus
So we have
In particular, for , we have
Let be the collection of all classifiers with M equal volume cells. Then , and the histogram classification rule
satisfies
which suggests the choice (balancing with ), resulting in
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?