0.18 The vapnik-chervonenkis inequality

Statistical learning theory Page 1 / 1

The vapnik-chervonenkis inequality

The VC inequality is a powerful generalization of the bounds we obtained for the hyperplane classifier in the previous lecture . The basic idea of the proof is quite similar. Before starting the inequality, we need to introduce theconcept of shatter coefficients and VC dimension.

Shatter coefficients

Let $A$ be a collection of subsets of $R^{d}$ , definition: The $n^{t h}$ shatter coefficient of $A$ is defined by

S_{A} (n) = \begin{matrix} m a x \\ x_{1}, ..., x_{n} ϵ R^{d} \end{matrix} |{{x_{1}, ..., x_{n}} ⋂ A, A ϵ A}| .

The shatter coefficients are a measure of the richness of the collection $A$ . $S_{A} (n)$ is the largest number of different subsets of a set of $n$ points that can be generated by intersecting the set with elements of $A$ .

In 1-d, Let $A = {(- \infty, t], t ϵ R}$ Possible subsets of ${x_{1}, ..., x_{n}}$ generated by intersecting with sets of the form $(- \infty, t]$ are ${x_{1}, ..., x_{n}}, {x_{1}, ..., x_{n - 1}}, ..., {x_{1}}, φ$ . Hence $S_{d} (n) = n + 1$ .

In 2-d, Let $A$ = ${$ all rectangles in $R^{2}$ $}$

Consider a set ${x_{1}, x_{2}, x_{3}, x_{4}}$ of training points. If we arrange the four points into the corner of a diamond shape. It's easyto see that we can find a rectangle in $R^{2}$ to cover any subsets of the four points as the above picture, i.e. $S_{A} (4) = 2^{4} = 16$ .

Clearly, $S_{A} (n) = 2^{n}, n = 1, 2, 3$ as well.

However, for $n = 5, S_{A} (n) < 2^{5}$ . This is because we can always select four points such that the rectangle, which just contains fourof them, contains the other point. Consequently, we cannot find a rectangle classifier which contains the four outer points and does not contain the innerpoint as shown above.

Note the $S_{A} \leq 2^{n}$ .

If $|{{x_{1}, ..., x_{n}} ⋂ A, A ϵ A}| = 2^{n}$ then we say that $A$ shatters $x_{1}, ..., x_{n}$ .

Vc dimension

The VC dimension

V_{A}

of a collection of sets

A

is defined as the largest interger

n

such that

S_{A} (n) = 2^{n}

Sauer's lemma:

Let $A$ be a collection of set with VC dimension $V_{A} < \infty$ . Then $\forall n, S_{A} (n) \leq \sum_{i = 0}^{V_{A}} (\begin{matrix} n \\ i \end{matrix})$ , also $S_{A} (n) \leq {(n + 1)}^{V_{A}}, \forall n$ .

Vc dimension and classifiers

Let $F$ be a collection of classifiers of the form $f : R^{d} \to {0, 1}$ Define $A = {{x : f (x) = 1} \times {0} ⋃ {x : f (x) = 0} \times {1}, f ϵ F}$ In words, this is collection of subsets of $X \times Y$ for which on $f ϵ F$ maps the features $x$ to a label opposite of $y$ . The size of $A$ expresses the richness of $F$ . The larger $A$ is the more likely it is that there exists an $f ϵ F$ for which $R (f) = P (f (X) \neq Y)$ is close to the Bayes risk $R^{*} = P (f^{*} (X) \neq Y)$ where $f^{*}$ is the Bayes classifier. The $n^{t h}$ shatter coefficient of $F$ is defined as $S_{F} (n) = S_{A} (n)$ and the VC dimesion of $F$ is defined as $V_{F} = V_{A}$ .

linear (hyperplane) classifiers in $R^{d}$

Consider $d$ = 2. Let $n$ be the number of training points, it is easy to see that when $n = 1$ , let $A$ be as above. By using linear classifiers in $R^{2}$ , it is easy to see that we can assign 1 to all possible subsets ${{x_{1}}, φ}$ and 0 to their complements. Hence $S_{F} (1) = 2$ .

When $n = 2$ , we can also assign 1 to all possible subsets ${{x_{1}, x_{2}}, {x_{1}}, {x_{2}}, φ}$ and 0 to their complements, and vice versa. Hence $S_{F} (2) = 4 = 2^{2}$ .

When $n = 3$ , we can arrange arrange the point $x_{1}, x_{2}, x_{3}$ (non-colinear) so that the set of linear classifiers shatters the three points, hence $S_{F} (3) = 8 = 2^{3}$

When $n = 4$ , no matter where the points $x_{1}, x_{2}, x_{3}, x_{4}$ and what designated binary values $y_{1}, y_{2}, y_{3}, y_{4}$ are. It's clear that $A$ does not shatter the four points. To see the claim, first observe that the four points will form a 4-gon (if the four points are co-linear, or if the three points are co-linear then clearly linear classifiers cannot shatter the points). The two points that belong to the same diagonal lines form 2 groups and no linear classifier can assign different values to the 2 groups. Hence $S_{F} (4) < 16 = 2^{4}$ and $V_{F} = 3$ .

We state here without proving it that in general the class of linear classifiers in $R^{d}$ has $V_{F} = d + 1$ .

The vc inequality

Let $X_{1},, ..., X_{n}$ be i.i.d. $R^{d}$ -valued random variables. Denote the common distribution of $X_{i}, 1 \leq i \leq n$ by $μ (A) = P (X_{1} ϵ A)$ for any subset $A \subset R^{d}$ . Similarly, define the empirical distribution $μ_{n} (A) = \frac{1}{n} \sum_{1}^{n} 1_{{X_{i} ϵ A}}$ .

Theorem

Vc '71

For any probablilty measure $μ$ and collection of subsets $A$ , and for any $ϵ > 0$ .

P (\begin{matrix} s u p \\ A ϵ A \end{matrix} |μ_{n} (A) - μ (A)| > ϵ) \leq 8 S_{A} (n) e^{- n ϵ^{2} / 32}

and

E [\begin{matrix} s u p \\ A ϵ A \end{matrix}, |μ_{n} (A) - μ (A)|] \leq 2 \sqrt{\frac{log 2 S_{A} (n)}{n}}

Before giving a proof to the theorem. We present a Corollary.

Corollary

Let $F$ be a collection of classifiers of the form $f : R^{d} \to {0, 1}$ with VC dimension $V_{F} < \infty$ , Let $R (f) = P (f (X) \neq Y)$ and ${\hat{R}}_{n} (f) = \frac{1}{n} \sum_{1}^{n} 1_{{f (X_{i}) \neq Y_{i}}}$ , where $X_{i}, Y_{i}, 1 \leq i \leq n$ are i.i.d. with joint distribution $P_{X Y}$ .

Define

${\hat{f}}_{n} = \begin{matrix} a r g m i n \\ f ϵ F \end{matrix} {\hat{R}}_{n} (f)$ .

Then

E [R ({\hat{f}}_{n})] - \begin{matrix} i n f \\ f ϵ F \end{matrix} R (f) \leq 4 \sqrt{\frac{V_{F} log (n + 1) + log 2}{n}} .

Let $A = {{x : f (x) = 1} \times {0} ⋃ {x : f (x) = 0} \times {1}, f ϵ F}$

Note that

P (f (X) \neq Y) = P ((X, Y) ϵ A) : = μ (A)

where $A = {x : f (x) = 1} \times {0} ⋃ {x : f (x) = 0} \times {1}$ .

Similarly,

\frac{1}{n} \sum_{1}^{n} 1_{{f (X_{i}) \neq Y_{i}}} = \frac{1}{n} \sum_{1}^{n} 1_{{(X_{i}, Y_{i}) ϵ A}} : = μ (A) .

Therefore, according to the VC theorem.

\begin{matrix} E [\begin{matrix} s u p \\ f ϵ F \end{matrix}, |{\hat{R}}_{n} (f) - R (f)|] = E [\begin{matrix} s u p \\ A ϵ A \end{matrix}, |μ_{n} (A) - μ (A)|] & \leq & 2 \sqrt{\frac{log 2 S_{A} (n)}{n}} \\ = & 2 \sqrt{\frac{log 2 S_{F} (n)}{n}} \end{matrix}

Since $V_{F} < \infty, S_{F} (n) \leq {(n + 1)}^{V_{F}}$ and

E [\begin{matrix} s u p \\ f ϵ F \end{matrix}, |{\hat{R}}_{n} (f) - R (f)|] \leq 2 \sqrt{\frac{V_{F} log (n + 1) + log 2}{n}} .

Next, note that

\begin{matrix} R ({\hat{f}}_{n}) - \begin{matrix} i n f \\ f ϵ F \end{matrix} R (f) & = & [R ({\hat{f}}_{n}) - {\hat{R}}_{n} ({\hat{f}}_{n})] + [{\hat{R}}_{n} ({\hat{f}}_{n}) - \begin{matrix} i n f \\ f ϵ F \end{matrix} R (f)] \\ = & [R ({\hat{f}}_{n}) - {\hat{R}}_{n} ({\hat{f}}_{n})] + [\begin{matrix} s u p \\ f ϵ F \end{matrix}, ({\hat{R}}_{n} ({\hat{f}}_{n}) - R (f))] \\ \leq & [R ({\hat{f}}_{n}) - {\hat{R}}_{n} ({\hat{f}}_{n})] + [\begin{matrix} s u p \\ f ϵ F \end{matrix}, ({\hat{R}}_{n} (f) - R (f))] \\ \leq & 2 \begin{matrix} s u p \\ f ϵ F \end{matrix} |{\hat{R}}_{n} (f) - R (f)| \end{matrix} .

Therefore,

\begin{matrix} E [R, ({\hat{f}}_{n})] - \begin{matrix} i n f \\ f ϵ F \end{matrix} R (f) & \leq & 2 E [\begin{matrix} s u p \\ f ϵ F \end{matrix}, |{\hat{R}}_{n} (f) - R (f)|] \\ \leq & 4 \sqrt{\frac{V_{F} log (n + 1) + log 2}{n}} \end{matrix} .

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	2 Physiotherapy Flashcards Set 2 By Rhodes Start Flashcards
	3 Week 3 Social Psych By Yacoub Jayoghli Start Quiz
	Anthropology Biology Culture By Richley Crapo Start Assignment
	8 Arts Society: Theater 8 By Jonathan Long Start Quiz
	18 AP Key Terms 18 Cardiovascular System Blood By OpenStax Start Key Terms
©flickr: Donnie	How well do you know Beanie Boos? By Angela Eckman Start Quiz
	9 Domain Driven Design By JavaChamp Team Start Quiz
	1 Psychology Concept Test By John Gabrieli Start Test
	How well do you know 5SOS? By Wey Hey Start Quiz
	How to Analyze Stocks By Yasser Ibrahim Start Quiz