<< Chapter < Page | Chapter >> Page > |
The VC inequality is a powerful generalization of the bounds we obtained for the hyperplane classifier in the previous lecture . The basic idea of the proof is quite similar. Before starting the inequality, we need to introduce theconcept of shatter coefficients and VC dimension.
Let be a collection of subsets of , definition: The shatter coefficient of is defined by
The shatter coefficients are a measure of the richness of the collection . is the largest number of different subsets of a set of points that can be generated by intersecting the set with elements of .
In 1-d, Let Possible subsets of generated by intersecting with sets of the form are . Hence .
In 2-d, Let = all rectangles in
Consider a set of training points. If we arrange the four points into the corner of a diamond shape. It's easyto see that we can find a rectangle in to cover any subsets of the four points as the above picture, i.e. .
Clearly, as well.
However, for . This is because we can always select four points such that the rectangle, which just contains fourof them, contains the other point. Consequently, we cannot find a rectangle classifier which contains the four outer points and does not contain the innerpoint as shown above.
Note the .
If then we say that shatters .
Let be a collection of set with VC dimension . Then , also .
Let be a collection of classifiers of the form Define In words, this is collection of subsets of for which on maps the features to a label opposite of . The size of expresses the richness of . The larger is the more likely it is that there exists an for which is close to the Bayes risk where is the Bayes classifier. The shatter coefficient of is defined as and the VC dimesion of is defined as .
linear (hyperplane) classifiers in
Consider = 2. Let be the number of training points, it is easy to see that when , let be as above. By using linear classifiers in , it is easy to see that we can assign 1 to all possible subsets and 0 to their complements. Hence .
When , we can also assign 1 to all possible subsets and 0 to their complements, and vice versa. Hence .
When , we can arrange arrange the point (non-colinear) so that the set of linear classifiers shatters the three points, hence
When , no matter where the points and what designated binary values are. It's clear that does not shatter the four points. To see the claim, first observe that the four points will form a 4-gon (if the four points are co-linear, or if the three points are co-linear then clearly linear classifiers cannot shatter the points). The two points that belong to the same diagonal lines form 2 groups and no linear classifier can assign different values to the 2 groups. Hence and .
We state here without proving it that in general the class of linear classifiers in has .
Let be i.i.d. -valued random variables. Denote the common distribution of by for any subset . Similarly, define the empirical distribution .
TheoremFor any probablilty measure and collection of subsets , and for any .
and
Before giving a proof to the theorem. We present a Corollary.
CorollaryLet be a collection of classifiers of the form with VC dimension , Let and , where are i.i.d. with joint distribution .
Define
.
Then
Let
Note that
where .
Similarly,
Therefore, according to the VC theorem.
Since and
Next, note that
Therefore,
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?