<< Chapter < Page | Chapter >> Page > |
The number of different labellings that a class can produce on a set of n training data is a measure of the "effective size"of . The Vapnik-Chervonenkis (VC) dimension of is proportional to the log of the effective size. Let denote the VC dimension of , typically a constant, independent of n. The VC inequality states that for all
This type of uniform concentration inequality can be used in a similar fashion to our use of Hoeffding's inequality plus unionbound.
We will go into the details of VC Theory next lecture , and the remainder of this lecture will introduce the key ideas with anexample Consider the following setup. Let , Let
with and This is the collection of all hyperplane classifiers. is infinite and uncountable.
Suppose that we have n training data
There are at most unique classifiers in with respect to these data. To see this, consider d arbitrary datapoints , and let be a hyperplane containing these points. To be specific, take thehyperplane with
this hyperplane coincides with two possible classification rules:
Each d-tuple of training data produces two distinct classifiers, assuming the data are not co-linear. Thus, there are at most unique classifiers in with respect to the training data. (All other produce the same labels and empirical risk as one of the classifiers.) Let's enumerate theunique hyperplane classifiers , and let
and let
and define
If multiple achieve , pick to be one of them in an arbitrary fixed number.
TheoremAssume that has a density, but that the distribution of is other arbitrary. If and then
The proof is a specialization of the basic ingredients of VC Theory to the case at hand. Here we follow the proof in DGL'96. First we note that,
and since for any
Therefore, by the union bound:
We can bound the second term of the above bound using Chernoff's/Hoeffding's inequality:
Next, let's bound one of the terms in the summation. For example, take
Note that by symmetry all terms will have identical bounds. Since the bounds are independent of .
Assume that is determined by the first d data points . By the smoothing property of expectations we can write,
From here, we will bound the conditional probability inside the expectation. Let be d additional random samples that are independent and identicallydistributed as the data . are often called the "ghost sample" since they are not actually observed. They are a fictious sample leads to asimple bound on the conditional probability. Define if
or if
That is, agrees with our observed data on i>d, but the first d samples are replaced with the ghost sample. Then,
where,
Note that is binomially distributed with mean and it is independent of Therefore,
In conclusion,
Lastly, Corollary If , then
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?