<< Chapter < Page Chapter >> Page >

The number of different labellings that a class F can produce on a set of n training data is a measure of the "effective size"of F . The Vapnik-Chervonenkis (VC) dimension of F is proportional to the log of the effective size. Let V ( F , n ) denote the VC dimension of F , typically a constant, independent of n. The VC inequality states that for all f F

P | R ^ n ( f ) - R ( f ) | > ϵ 8 e V ( F , h ) e - n ϵ 2 / 32 .

This type of uniform concentration inequality can be used in a similar fashion to our use of Hoeffding's inequality plus unionbound.

Hyperplane classifiers

We will go into the details of VC Theory next lecture , and the remainder of this lecture will introduce the key ideas with anexample Consider the following setup. Let X = [ 0 , 1 ] d , Y = { 0 , 1 } Let

F = f : f ( x ) = 1 { w T x + w 0 > 0 }

with w 0 and w R d + 1 This is the collection of all hyperplane classifiers. F is infinite and uncountable.

Suppose that we have n training data

X i , Y i i = 1 n .

There are at most 2 n d unique classifiers in F with respect to these data. To see this, consider d arbitrary datapoints x 1 , ... , x i d , and let w T x + w 0 > 0 be a hyperplane containing these points. To be specific, take thehyperplane with

| | w 0 w | | = 1 .

this hyperplane coincides with two possible classification rules:

f 1 ( x ) = 1 { w T x + w 0 > 0 }
f 2 ( x ) = 1 { w T x + w 0 < 0 }

Each d-tuple of training data produces two distinct classifiers, assuming the data are not co-linear. Thus, there are at most 2 * n d unique classifiers in F with respect to the training data. (All other f F produce the same labels and empirical risk as one of the classifiers.) Let's enumerate theunique hyperplane classifiers f 1 , ... , f 2 * n d , and let

f ^ n = arg min f f 1 , ... , f 2 n d R ^ n ( f )

and let

R * = i n f f F R ( f )

and define

f * = a r g m i n f F R ( f )

If multiple f F achieve R * , pick f * to be one of them in an arbitrary fixed number.

Theorem

Assume that P x has a density, but that the distribution of ( x , y ) is other arbitrary. If n d and 2 d / n ϵ 1 then

P R ( f ^ n ) - R ( f ) > ϵ e 2 d ϵ 2 n d + 1 e - n ϵ 2 / 2 .
The assumption that P x has a density insures that no d+1 points are co-planar. This in turn, guarantees that there areexactly 2 n d unique classifier and that the 2 n d under consideration are fully representative of all possible classifiers in F , with respect to the data.

The proof is a specialization of the basic ingredients of VC Theory to the case at hand. Here we follow the proof in DGL'96. First we note that,

R ( f ^ n ) - R ( f * ) = R ( f ^ n ) - R ^ n ( f ^ n ) + R ^ n ( f ^ n ) - R ( f * )
R ( f ^ n ) - R ^ n ( f ^ n ) + R ^ n f * ) - R ( f * ) + d / n

and since R ^ n ( f ^ n ) R ^ n ( f ) + d / n for any f F

m a x i = 1 , ... , 2 n d ( R ( f i ) - ( ^ R ) n ( f i ) ) + ( ^ R ) n ( f * ) - R ( f * ) + d / n .

Therefore, by the union bound:

P ( R ( f ^ n ) - R ( f * ) > ϵ )
i = 1 2 n d P R ( f i ) - R ^ n ( f i ) > ϵ / 2 + P R ^ n ( f * ) - R ( f * ) + d / n > ϵ / 2 .

We can bound the second term of the above bound using Chernoff's/Hoeffding's inequality:

P R ^ n ( f * ) - R ( f * ) > ϵ / 2 - d / n
e - 2 n ϵ / 2 - d / n 2
e 2 d ϵ e - n ϵ 2 / 2 .

Next, let's bound one of the terms in the summation. For example, take

P R ( f i ) - R ^ n ( f i ) > ( ϵ / 2 ) .

Note that by symmetry all 2 n d terms will have identical bounds. Since the bounds are independent of P x y .

Assume that f 1 is determined by the first d data points x 1 4 , ... , x d . By the smoothing property of expectations we can write,

P R ( f i ) - R ^ n ( f ) > ϵ / 2 = E P R ( f i ) - R ^ n ( f i ) > ϵ / 2 | x 1 , ... , x d .

From here, we will bound the conditional probability inside the expectation. Let ( X 1 " , Y 1 " ) , ... , ( X d " , Y d " ) be d additional random samples that are independent and identicallydistributed as the data ( X 1 , Y 1 ) , ... , ( X d , Y d ) . X i " , Y i " i = 1 d are often called the "ghost sample" since they are not actually observed. They are a fictious sample leads to asimple bound on the conditional probability. Define if i d

( X i ' , Y i ' ) = ( X i " , Y i " )

or if i > d

( X i ' , Y i ' ) = ( X i , Y i ) .

That is, X i ' , Y i ' i = 1 d agrees with our observed data on i>d, but the first d samples are replaced with the ghost sample. Then,

P R ( f i ) - R ^ n ( f 1 ) > ϵ / 2 | x 1 , ... , x d
P R ( f i ) - 1 / n i = d + 1 n 1 f 1 ( x i ) y i > ϵ / 2 | x 1 , ... , x d
P R ( f i ) - 1 / n 1 n 1 f 1 ( x i ) y i + d / n > ϵ / 2 | x 1 , ... , x d
= P R ( f i ) - ( ^ R ) n ' ( f 1 ) > t / 2 - d / n | x 1 , ... , x d

where,

R ^ n ' ( f 1 ) = 1 / n i = 1 n 1 { f 1 ( x i ' ) y i ' } .

Note that n ( ^ R ) n ' ( f 1 ) is binomially distributed with mean R ( f 1 ) and it is independent of x 1 , ... , x d Therefore,

P R ( f i ) - R ^ n ' ( f 1 ) > ϵ / 2 - d / n | x 1 , ... , x d
= P ( R ( f i ) - R ^ n ' ( f 1 ) > t / 2 - d / n | x 1 , ... , x d )
e - 2 n ϵ / 2 - d / n 2
e 2 d ϵ e - n ϵ 2 / 2 .

In conclusion,

P R ( f ^ n ) - R * > ϵ
i = 1 2 n d P R ( f ) i - R ^ n ( t i ) > ϵ / 2 + P R ^ n ( f * ) - R ( f * ) + d / n > ϵ / 2
2 n d e 2 d ϵ e - n ϵ 2 / 2 + e 2 d ϵ e - n ϵ 2 / 2
= e 2 d ϵ 2 n d + 1 e - n ϵ 2 / 2 .

Lastly, Corollary If n d , then

E R ( f ^ n ) - m i n f F R ( f ) 2 ( d + 1 ) ( l o g n + 2 ) / n .

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask