<< Chapter < Page Chapter >> Page >

Now p Y | X ( 1 ) ( Y 1 = 1 | X 1 ) = 1 / 2 + a and p Y | X ( 0 ) ( Y 1 = 1 | X 1 ) = 1 / 2 - a . Also under model 1, Y 1 Bernoulli ( 1 / 2 + a ) . So we get:

K ( P 1 | | P 0 ) = n ( 1 / 2 + a ) log 1 / 2 + a 1 / 2 - a + ( 1 / 2 - a ) log 1 / 2 - a 1 / 2 + a = n 2 a log ( 1 / 2 + a ) - 2 a log ( 1 / 2 - a ) = 2 n a log 1 / 2 + a 1 / 2 - a 2 n a 1 / 2 + a 1 / 2 - a - 1 = 4 n a 2 1 1 / 2 - a

Let a = 1 / n and n 16 , then K ( P 1 | | P 0 ) 4 n 1 n 1 1 / 2 - 1 / n 16 .

Using Theorem 1 , since d Δ ( G 0 * , G 1 * ) = 1 , we get:

inf G ^ n max j P j ( d Δ ( G ^ n , G j * ) 1 / 2 ) 1 4 e - 16

Taking s n = 1 / n , this implies

inf G ^ n sup p P P ( R ( G ^ n ) - R ( G * ) 1 / n ) 1 4 e - 16

or, after Markov's inequality

inf G ^ n sup p P E [ R ( G ^ n ) - R ( G * ) ] 1 4 e - 16 1 n

Therefore, apart from the log n factor, ERM is getting the best possible performance.

Reducing the initial problem to a binary hypothesis testing does not always work. Sometimes we need M hypotheses, with M as n . If this is the case, we have the following theorem:

Theorem 2 Let M 2 . { f 0 , , f M } F be such that

  • d ( f j , f k ) 2 s n , where d is a semi-distance.
  • 1 M j = 1 M K ( P j | | P 0 ) α log M , with 0 < α < 1 / 8 .

Then

inf f ^ n sup f F P f ( d ( f ^ n , f ) s n ) inf f ^ n max j P j ( d ( f ^ n , f j ) s n ) M 1 + M 1 - 2 α - 2 α log M > 0

We will use this theorem to show that the estimator of Lecture 4 is optimal. Recall the setup of Lecture 4 . Let

F = { f : | f ( t ) - f ( s ) | L | t - s | t , s }

i.e. the class of Lipschitz functions with constant L . Let

x i = i / n , i = 1 , , n
Y i = f ( x i ) + W i

E [ W i ] = 0 , E [ W i 2 ] = σ 2 < , W i , W j are indepedent if i j . In that lecture, we constructed an estimator f ^ n such that

sup f F E [ | | f ^ n - f | | 2 ] = O ( n - 2 / 3 )

Is this the best we can do?

We are going to construct a collection f 0 , , f M F and apply Theorem 2. Notice that the metric of interest is d ( f ^ n , f ) = | | f ^ n - f | | , a semi-distance. Let W i i i d N ( 0 , σ 2 ) . Let m N , h = 1 / m and define

K ( x ) = L h 2 - L | x | I | x | h / 2 = L 2 | h - 2 x | I | x | h / 2

Note that | K ( a ) - K ( b ) | L | a - b | , a , b . The subclass we are going to consider are functions of the form

i.e. “bump" functions. Let Ω = { 0 , 1 } m be the collection of binary vectors of length m , e.g. w = ( 1 , 0 , 1 , , 0 ) Ω . Define

f w ( x ) = i = 1 m w i K x - h 2 ( 2 i - 1 )

Note that for w , w ' Ω ,

d ( f w , f w ' ) = | | f w - f w ' | | = 0 1 i = 1 m ( w i - w i ' ) 2 K 2 x - h 2 ( 2 i - 1 ) 1 / 2 = ρ ( w , w ' ) K 2 ( x ) d x

where ρ ( w , w ' ) is the Hamming distance, ρ ( w , w ' ) = i = 1 m | w i - w i ' | 2 = i = 1 m | w i - w i ' | . Now

K 2 ( x ) = 2 0 h / 2 L 2 x 2 d x = 2 L 2 h 3 3 · 8 = L 2 12 h 3

so

d ( f w , f w ' ) = ρ ( w , w ' ) L 12 h 3 / 2

Since | Ω | = 2 n , the number of functions in our class is 2 n . Turns out, we do not need to consider all functions f w , w Ω , but only a select few. Using all the functions leads to a looser lower bound of the form n - 1 , which corresponds to the parametric rate. The problem under consideration is non-parametric, and hence we expecta slower rate of convergence. To get a tighter lower bound, the following result is of use:

Lemma

Varshamov-gilbert '62

Let m 8 . There exists a subset { w ( 0 ) , , w ( M ) } of Ω such that w ( 0 ) = ( 0 , 0 , , 0 ) ,

ρ ( w ( j ) , w ( k ) ) m 8 , 0 j < k M and M 2 m / 8 .

What this lemma says is that there are many ( 2 m ) sequences in Ω that are very different (i.e. ρ ( w ( j ) , w ( k ) ) m ). We are going to use the lemma to construct a useful set of hypotheses. Let { w ( 0 ) , , w ( M ) } be the class of sequences in the lemma and define

f j = Δ f w ( j ) , j { 0 , , M }

We now need to look at the conditions of Theorem 2 and choose m appropriately.

First note that for j k ,

d ( f j , f k ) = ρ ( w ( j ) , w ( k ) ) L 12 h 3 / 2 m 8 L 12 m - 3 / 2 = L 4 6 m - 1

Now let P j = Δ P Y 1 , , Y m ( j ) , j { 0 , , M } . Then

K ( P j | | P 0 ) = E j log p Y 1 , , Y m ( j ) p Y 1 , , Y m ( 0 ) = i = 1 n E j log p ( j ) Y i p Y i ( 0 ) = 1 2 σ 2 i = 1 n f j 2 ( x i ) 1 2 σ 2 i = 1 n L h 2 2 = L 2 8 σ 2 n h 2 = L 2 8 σ 2 n m - 2

Now notice that log M m 8 log 2 (from Lemma [link] ). We want to choose m such that

1 M j = 1 M K ( P j | | P 0 ) L 2 8 σ 2 n m - 2 < α m 8 log 2 α log M

This gives

m > L 2 α σ 2 log 2 1 / 3 n 1 / 3 : = C 0 n 1 / 3

so take m = C 0 n 1 / 3 + 1 . Now

d ( f j , f k ) L 4 6 m - 1 2 const n - 1 / 3 for n n 0 ( const )

Therefore,

inf f ^ n sup f F P f ( | | f ^ n - f | | const n - 1 / 3 ) c > 0

or,

inf f ^ n sup f F P f ( | | f ^ n - f | | 2 const n - 2 / 3 ) c > 0

or after Markov's inequality,

inf f ^ n sup f F E f [ | | f ^ n - f | | 2 ] c · const n - 2 / 3

Therefore, the estimator constructed in class attains the optimal rate of convergence.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask