0.5 Plug-in classifier and histogram classifier (Page 2/2)

Statistical learning theory Page 2 / 2

Theorem

Consistency of histogram classifiers

If $M \to \infty$ and $\frac{n}{M} \to \infty$ as $n \to \infty$ , then the histogram classifier risk converges to the Bayes risk for every distribution $P_{X Y}$ with marginal density $p_{X} (x) \geq c$ , for some constant $c > 0$ . Actually, the result holds for every distribution $P_{X Y}$ . For the more general theorem, refer to Theorem $6.1$ in A probabilistic Theory of Pattern Recognition by Luc Devroye, László Györfi and Gábor Lugosi. .

What the theorem tells us is that we need the number of partition cells to tend to infinity (to insure that the bias tends to zero), but they can'tgrow faster than the number of samples ( i.e., we want the number of samples per box tending to infinity to drive the variance to zero).

Let $P_{j} \equiv \frac{\int_{Q_{j}} η (x) p_{X} (x) d x}{\int_{Q_{j}} p_{X} (x) d x}$ (the theoretical analog of ${\hat{P}}_{j}$ ) and define

\bar{η} (x) = \sum_{j = 1}^{M} P_{j} 1_{{x \in Q_{j}}}

The function $\bar{η}$ is the theoretical analog of $\hat{η}$ (i.e., the function obtained by averaging $η$ over the partition cells). By the triangle inequality,

E [|, {\hat{η}}_{n}, (X) - η (X) |] \leq \underset{E s t i m a t i o n E r r o r}{\underset{︸}{E [| {\hat{η}}_{n} (X) - \bar{η} (X) |]}} + \underset{A p p r o x i m a t i o n E r r o r}{\underset{︸}{E [| {\bar{η}}_{n} (X) - η (X) |]}}

Let's first bound the estimation error. For any $x \in {[0, 1]}^{d}$ , let $Q (x)$ denote the histogram bin in which $x$ falls in. Define the random variable

N (x) = \sum_{i = 1}^{n} 1_{{X_{i} \in Q (x)}}

If $Q (x) = Q_{j}$ , then this random variable is simply $n {\hat{P}}_{j}$ . Note that

{\hat{η}}_{n} (x) = \frac{1}{N (x)} B (x)

where $B (x) = = \sum_{i = 1}^{n} 1_{{X_{i} \in Q (x), Y_{i} = 1}} = \sum_{i : X_{i} \in Q (x)} Y_{i}$ . $B (x)$ is simply the number of samples in cell $Q (x)$ labelled 1. Now ${\hat{η}}_{n} (x)$ is a fairly complicated random variable, but the conditional distribution of $B (x)$ given $N (x)$ is relatively simple. Note that

B (x) | N (x) = k \sim Binomial (k, \bar{η} (x))

since $\bar{η} (x)$ is the probability of a sample in $Q (x)$ having the label 1 and we are conditioning on the event of observing $k$ samples in $Q (x)$ .

Now consider the conditional expectation

E [|{\hat{η}}_{n} (x) - \bar{η} (x)| ∣ N (x) = k] \leq \{\begin{matrix} E [|\frac{B (x)}{N (x)} - \bar{η} (x)| ∣ N (x) = k], & k > 0 \\ 1, & k = 0 (since 0 \leq \bar{η} (x) \leq 1) \end{matrix})

Next note that

\begin{matrix} E [|\frac{B (x)}{N (x)} - \bar{η} (x)| ∣ N (x) = k] & = & E [|\frac{B (x)}{k} - \bar{η} (x)| ∣ N (x) = k] \\ = & E [\frac{1}{k} | B (x) - \underset{E [B (x)]}{\underset{︸}{k \bar{η} (x)}} | ∣ N (x) = k] \\ \leq & \frac{1}{k} {(\underset{conditional variance of B (x)}{\underset{︸}{E [| B (x) - k \bar{η} {(x) |}^{2} ∣ N (x) = k]}})}^{\frac{1}{2}} \end{matrix}

by the Jensen's inequality, $E [| Z |] \leq {(E [| Z |}^{2} {])}^{\frac{1}{2}}$ .

Therefore,

\begin{matrix} E [|\frac{B (x)}{N (x)} - \bar{η} (x)| ∣ N (x) = k] & \leq & \frac{1}{k} {(k, \bar{η}, (x), (1 - \bar{η} (x)))}^{\frac{1}{2}} \\ = & \sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x))}{k}} \end{matrix}

and

E [| {\hat{η}}_{n} (x) - \bar{η} (x) | ∣ N (x) = k] \leq \{\begin{matrix} \sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x)}{k}}, & k > 0 \\ 1, & k = 0 \end{matrix})

or in other words,

E [| {\hat{η}}_{n} (x) - \bar{η} (x) | ∣ N (x) = k] \leq \sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x)}{N (x)}} 1_{{N (x) > 0}} + 1_{{N (x) = 0}}

Now taking expectation with respect to $N (x)$

\begin{matrix} E_{N} [E [| {\hat{η}}_{n} (x) - \bar{η} (x) | N (x) = k]] & \leq & E_{N} [\sqrt{\frac{\bar{η} (x) (1 - \bar{η} (x)}{N (x)}}, 1_{{N (x) > 0}}] + P (N (x) = 0) \\ \leq & E [\frac{1}{2 \sqrt{N (x)}}, 1_{{N (x) > 0}}] + P (N (x) = 0) \\ \leq & \frac{1}{2} P (N (x) \leq k) + \frac{1}{2 \sqrt{k}} \underset{\leq 1}{\underset{︸}{P (N (x) > k)}} + P (N (x) = 0) \end{matrix}

Now a key fact is that for any $k > 0$ , $P (N \leq k) \to 0 as n \to \infty$ . This follows from the assumption that the marginal density $p_{X} (x) \geq c$ , for some constant $c > 0$ , and $\frac{n}{M} \to \infty$ as $n \to \infty$ . This result is easily verified by contradiction. If $P (N \leq k) \to q > 0 as n \to \infty$ , then $P_{X} (x) > 0$ is contradicted. Thus, for any $ϵ > 0$ there exists a $k > 0$ such that $\frac{1}{2 \sqrt{k}} < ϵ$ and $P (N \leq k) < ϵ$ for $n$ sufficiently large. Therefore, for $n$ sufficiently large and every $x \in {[0, 1]}^{d}$ ,

E [| {\hat{η}}_{n} (x) - \bar{η} (x) |] < 3 ϵ

where the expectation is with respect to the distribution of the sample ${X_{i}, Y_{i}}_{i = 1}^{n}$ . Thus,

E [| {\hat{η}}_{n} (X) - \bar{η} (X) |] < 3 ϵ

where the expectation is now with respect to the distribution of the sampleand the marginal distribution of $X$ .

Next consider the approximation error $E [|, {\bar{η}}_{n}, (X) - η (X) |]$ , where the expectation is over $X$ alone. The function $η$ may not itself be continuous, but there is another function $η_{ϵ}$ that is uniformly continuous and such that $E [|, η_{ϵ}, (X) - η (X) |] < ϵ$ . Recall that uniformly continuous functions can be well approximated by piecewiseconstant functions.

By the triangle inequality,

E [|, \bar{η}, - η |] \leq \underset{\leq ϵ}{\underset{︸}{E [| \bar{η} - {\bar{η}}_{ϵ} |]}} + E [| {\bar{η}}_{ϵ} - η_{ϵ} |] + \underset{\leq ϵ by design}{\underset{︸}{E [|, η_{ϵ}, - η |]}}

where ${\bar{η}}_{ϵ} (x) = \sum_{j = 1}^{m} [\int_{Q_{j}}, η_{ϵ}, (x^{'}), p_{X}, (x^{'}), d, x^{'}] 1_{{x \in Q_{j}}}$ .

\begin{matrix} E [| \bar{η} (X) - {\bar{η}}_{ϵ} (X) |] & = & \sum_{j = 1}^{m} [\int_{Q_{j}}, | η (x) - η_{ϵ} (x) |, p_{X}, (x), d, x] 1_{{x \in Q_{j}}} \\ \leq & ϵ \end{matrix}

and since $η_{ϵ}$ is uniformly continuous,

\begin{matrix} E [| {\bar{η}}_{ϵ} (X) - η_{ϵ} (X) |] & = & \sum_{j = 1}^{M} \int_{Q_{j}} | {\bar{η}}_{ϵ} (x) - η_{ϵ} (x) | 1_{{x \in Q_{j}}} p_{X} (x) d x \\ \leq & \sum_{j = 1}^{M} δ P (x \in Q_{j}), where δ depends on M \\ = & δ, since \sum_{j = 1}^{M} P (X \in Q_{j}) = 1 \end{matrix}

By taking $M$ sufficiently large, $δ$ can be made arbitrarily small. So for large $M$ , $δ \leq ϵ$ .

Thus, we have shown

E [|, \bar{η}, (X) - η (X) |] < 3 ϵ

for sufficiently large $M$ . Since $ϵ > 0$ was arbitrary, we have shown that taking

{\hat{f}}_{n} (x) = \{\begin{matrix} 1, & {\hat{η}}_{n} (x) \geq 1 / 2 \\ 0, & o t h e r w i s e \end{matrix})

satisfies

P ({\hat{f}}_{n} (X) \neq Y) - P (f^{*} (X) \neq Y) \leq 2 E [|, {\hat{η}}_{n}, (X) - η (X) |] \to 0

\begin{matrix} M & \to & \infty \\ \frac{n}{M} & \to & \infty as n \to \infty \end{matrix}

Note

P ({\hat{f}}_{n} (X) \neq Y) = E [1_{{\hat{f} (X) \neq Y}}]

is the expected risk of

\hat{f}

, with expectation over the distributions of

(X, Y)

and

{X_{i}, Y_{i}}_{i = 1}^{n}

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

©flickr: Francisco	U.S. Civil War Pre-test By Danielle Stephens Start Quiz
	10 AP 10 Muscle Tissue Essay Quiz By OpenStax Start Flashcards
	22 AP Key Terms 22 The Respiratory System By OpenStax Start Key Terms
	5 Arts Society: Theater 5 By Jonathan Long Start Quiz
	2 SCEA Java Architect By JavaChamp Team Start Exam
	1 SCJP/OCJP Java Certification By JavaChamp Team Start Exam
	College physics By OpenStax Read Online Course
	3 Arts Society: Theater 3 By Jonathan Long Start Quiz
©flickr: Rod	Medieval Africa By Jesenia Wofford Start Quiz
©flickr:	Arctic Ease Telemarketing Campaign Quiz By Nicole Duquette Start Quiz