0.14 Maximum likelihood and complexity regularization (Page 2/2)

Statistical learning theory Page 2 / 2

Error bound

Suppose that we have a pool of candidate functions $F$ , and we want to select a function $f$ from $F$ using the training data. Our usual approach is to show that the distribution of $\hat{R_{n}} (f)$ concentrates about its mean as $n$ grows. First, we assign a complexity $c (f) > 0$ to each $f \in F$ so that $\sum 2^{- c (f)} \leq 1$ . Then, apply the union bound to get a uniform concentration inequality holding for all models in $F$ . Finally, we use this concentration inequality to bound the expected risk of our selected model.

We will essentially accomplish the same result here, but avoid the need for explicit concentration inequalities and instead make use of the information-theoretic bounds.

We would like to select an $f \in F$ so that the excess risk is small.

\begin{matrix} 0 & \leq & R (f) - R (f^{*}) \\ = & \frac{1}{n} E [log p_{f^{*}} (Y) - log p_{f} (Y)] \\ = & \frac{1}{n} E [log \frac{p_{f^{*}} (Y)}{p_{f} (Y)}] \\ \equiv & \frac{1}{n} K (p_{f}, p_{f^{*}}) \end{matrix}

where

K (p_{f}, p_{f^{*}}) = \sum_{i = 1}^{n} \underset{K (p_{f (x_{i})}, p_{f^{*} (x_{i})})}{\underset{︸}{(\int log \frac{p_{f^{*} (x_{i})} (y_{i})}{p_{f (x_{i})} (y_{i})} \cdot p_{f^{*} (x_{i})} (y_{i}) d y_{i})}}

is again the KL divergence.

Unfortunately, as mentioned before, $K (p_{f}, p_{f^{*}})$ is not a true distance. So instead we will focus on the expected squared Hellinger distance as our measure of performance. We will get a bound on

\frac{1}{n} E [H^{2}, (p_{f} (Y), p_{f^{*}} (Y))] = \frac{1}{n} \sum_{i = 1}^{n} (\int {(\sqrt{p_{f (x_{i})} (y_{i})} - \sqrt{p_{f^{*} (x_{i})} (y_{i})})}^{2} d y_{i}) .

Maximum complexity-regularized likelihood estimation

Theorem

Li-barron 2000, kolaczyk-nowak 2002

Let ${x_{i}, Y_{i}}_{i = 1}^{n}$ be a random sample of training data with ${Y_{i}}$ independent,

Y_{i} \sim p_{f^{*} (x_{i})} (y_{i}), i = 1, ..., n

for some unknown function $f^{*}$ .

Suppose we have a collection of candidate functions $F$ , and complexities $c (f) > 0, f \in F$ , satisfying

\sum_{f \in F} 2^{- c (f)} \leq 1 .

Define the complexity-regularized estimator

\hat{f_{n}} \equiv arg min_{f \in F} \{- \frac{1}{n} \sum_{i = 1}^{n} log p_{f} (Y_{i}) + \frac{2 c (f) log 2}{n}\} .

Then,

\begin{matrix} \frac{1}{n} E [H^{2}, (p_{f} (Y), p_{f^{*}} (Y))] & \leq & - \frac{2}{n} E [log (A (p_{f} (Y), p_{f^{*}} (Y)))] \\ \leq & min_{f \in F} \{\frac{1}{n} K (p_{f}, p_{f^{*}}) + \frac{2 c (f) log 2}{n}\} . \end{matrix}

Before proving the theorem, let's look at a special case.

Gaussian noise

Suppose $Y_{i} = f (x_{i}) + W_{i}, W_{i} \overset{i . i . d .}{\sim} N (0, σ^{2})$ .

p_{f (x_{i})} (y_{i}) = \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{{(y_{i} - f (x_{i}))}^{2}}{2 σ^{2}}} .

Using results from example 1 , we have

\begin{matrix} - 2 log A (p_{\hat{f_{n}}} (Y), p_{f^{*}} (Y)) & = & \sum_{i = 1}^{n} - 2 log A (p_{\hat{f_{n}} (x_{i})} (Y_{i}), p_{f^{*} (x_{i})} (Y_{i})) \\ = & \sum_{i = 1}^{n} - 2 log \int \sqrt{p_{\hat{f_{n}} (x_{i})} (y_{i}) \cdot p_{f^{*} (x_{i})} (y_{i})} d y_{i} \\ = & \frac{1}{4 σ^{2}} \sum_{i = 1}^{n} {(\hat{f_{n}} (x_{i}) - f^{*} (x_{i}))}^{2} . \end{matrix}

Then,

- \frac{2}{n} E [log A (p_{\hat{f_{n}}}, p_{f^{*}})] = \frac{1}{4 σ^{2} n} \sum_{i = 1}^{n} E [{(\hat{f_{n}} (x_{i}) - f^{*} (x_{i}))}^{2}] .

We also have,

\begin{matrix} \frac{1}{n} K (p_{f}, p_{f^{*}}) & = & \frac{1}{n} \sum_{i = 1}^{n} \frac{{(f (x_{i}) - f^{*} (x_{i}))}^{2}}{2 σ^{2}} \\ - log p_{f} (Y) & = & \sum_{i = 1}^{n} \frac{{(Y_{i} - f (X_{i}))}^{2}}{2 σ^{2}} . \end{matrix}

Combine everything together to get

\hat{f_{n}} = arg min_{f \in F} \{\frac{1}{n} \sum_{i = 1}^{n} \frac{{(Y_{i} - f (X_{i}))}^{2}}{2 σ^{2}} + \frac{2 c (f) log 2}{n}\} .

The theorem tells us that

\frac{1}{4 n} \sum_{i = 1}^{n} E [\frac{{(\hat{f_{n}} (x_{i}) - f^{*} (x_{i}))}^{2}}{σ^{2}}] \leq min_{f \in F} \{\frac{1}{n} \sum_{i = 1}^{n} \frac{{(f (x_{i}) - f^{*} (x_{i}))}^{2}}{2 σ^{2}} + \frac{2 c (f) log 2}{n}\}

\frac{1}{n} \sum_{i = 1}^{n} E [{(\hat{f_{n}} (x_{i}) - f^{*} (x_{i}))}^{2}] \leq min_{f \in F} \{\frac{2}{n} \sum_{i = 1}^{n} {(f (x_{i}) - f^{*} (x_{i}))}^{2} + \frac{8 σ^{2} c (f) log 2}{n}\} .

Now let's come back to the proof.

Proof

\begin{matrix} H^{2} (p_{\hat{f_{n}}}, p_{f^{*}}) & = & \int {(\sqrt{p_{\hat{f_{n}}} (y)} - \sqrt{p_{f^{*}} (y)})}^{2} d y \\ \leq & - 2 log \underset{a f f i n i t y}{\underset{︸}{(\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y)}} \end{matrix}

\Rightarrow

E [H^{2}, (p_{\hat{f_{n}}}, p_{f^{*}})] \leq 2 E [log (\frac{1}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] .

Now, define the theoretical analog of $\hat{f_{n}}$ :

f_{n} = arg min_{f \in F} \{\frac{1}{n} K (p_{f}, p_{f^{*}}) + \frac{2 c (f) log 2}{n}\} .

Since

\begin{matrix} \hat{f_{n}} & = & arg min_{f \in F} \{- \frac{1}{n} log p_{f} (Y) + \frac{2 c (f) log 2}{n}\} \\ = & arg max_{f \in F} \{\frac{1}{n}, (log p_{f} (Y) - 2 c (f) log 2)\} \\ = & arg max_{f \in F} \{\frac{1}{2}, (log p_{f} (Y) - 2 c (f) log 2)\} \\ = & arg max_{f \in F} \{log (\sqrt{p_{f} (Y)} \cdot e^{- c (f) log 2})\} \\ = & arg max_{f \in F} \{\sqrt{p_{f} (Y)} \cdot e^{- c (f) log 2}\} \end{matrix}

we can see that

\frac{\sqrt{p_{\hat{f_{n}}} (Y)} e^{- c (\hat{f_{n}}) log 2}}{\sqrt{p_{f_{n}} (Y)} e^{- c (f_{n}) log 2}} \geq 1 .

Then can write

\begin{matrix} E [H^{2}, (p_{\hat{f_{n}}}, p_{f^{*}})] & \leq & 2 E [log (\frac{1}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] \\ \leq & 2 E [log (\frac{\sqrt{p_{\hat{f_{n}}} (Y)} e^{- c (\hat{f_{n}}) log 2}}{\sqrt{p_{f_{n}} (Y)} e^{- c (f_{n}) log 2}} \cdot \frac{1}{\int \sqrt{p_{\hat{f_{n}}} \cdot p_{f^{*}}} d y})] . \end{matrix}

Now, simply multiply the argument inside the $log$ by $\sqrt{\frac{p_{f^{*}} (Y)}{p_{f^{*}} (Y)}}$ to get

\begin{matrix} E [H^{2}, (p_{\hat{f_{n}}}, p_{f^{*}})] & \leq & 2 E [log (\frac{\sqrt{p_{f^{*}} (Y)}}{\sqrt{p_{f_{n}} (Y)}} \frac{\sqrt{p_{\hat{f_{n}}} (Y)}}{\sqrt{p_{f^{*}} (Y)}} \frac{e^{- c (\hat{f_{n}}) log 2}}{e^{- c (f_{n}) log 2}} \cdot \frac{1}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] \\ = & E [log (\frac{p_{f^{*}} (Y)}{p_{f_{n}} (Y)})] + 2 c (f_{n}) log 2 \\ + 2 E [log (\frac{\sqrt{p_{\hat{f_{n}}} (Y)}}{\sqrt{p_{f^{*}} (Y)}} \cdot \frac{e^{- c (\hat{f_{n}}) log 2}}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] \\ = & K (p_{f_{n}}, p_{f^{*}}) + 2 c (f_{n}) log 2 \\ + 2 E [log (\frac{\sqrt{p_{\hat{f_{n}}} (Y)}}{\sqrt{p_{f^{*}} (Y)}} \cdot \frac{e^{- c (\hat{f_{n}}) log 2}}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] . \end{matrix}

The terms $K (p_{f_{n}}, p_{f^{*}}) + 2 c (f_{n}) log 2$ are precisely what we wanted for the upper bound of the theorem. So, to finish theproof we only need to show that the last term is non-positive. Applying Jensen's inequality, we get

2 E [log (\frac{\sqrt{p_{\hat{f_{n}}} (Y)}}{\sqrt{p_{f^{*}} (Y)}} \cdot \frac{e^{- c (\hat{f_{n}}) log 2}}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] \leq 2 log (E, [e^{- c (\hat{f_{n}}) log 2} \cdot \frac{\sqrt{\frac{p_{\hat{f_{n}}} (Y)}{p_{f^{*}} (Y)}}}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y}]) .

Both $Y$ and $\hat{f_{n}}$ are random, which makes the expectation difficult to compute. However, we can simplify the problem using the union bound,which eliminates the dependence on $\hat{f_{n}}$ :

\begin{matrix} 2 E [log (\frac{\sqrt{p_{\hat{f_{n}}} (Y)}}{\sqrt{p_{f^{*}} (Y)}} \cdot \frac{e^{- c (\hat{f_{n}}) log 2}}{\int \sqrt{p_{\hat{f_{n}}} (y) \cdot p_{f^{*}} (y)} d y})] & \leq & 2 log (E, [\sum_{f \in F} e^{- c (f) log 2} \cdot \frac{\sqrt{\frac{p_{f} (Y)}{p_{f^{*}} (Y)}}}{\int \sqrt{p_{f} (y) \cdot p_{f^{*}} (y)} d y}]) \\ = & 2 log (\sum_{f \in F}, 2^{- c (f)}, \frac{E [\sqrt{\frac{p_{f} (Y)}{p_{f^{*}} (Y)}}]}{\int \sqrt{p_{f} (y) \cdot p_{f^{*}} (y)} d y}) \\ = & 2 log (\sum_{f \in F}, 2^{- c (f)}) \\ \leq & 0 . \end{matrix}

where the last two lines come from

E [\sqrt{\frac{p_{f} (Y)}{p_{f^{*}} (Y)}}] = \int \sqrt{\frac{p_{f} (y)}{p_{f^{*}} (y)}} \cdot p_{f^{*}} (y) d y = \int \sqrt{p_{f} (y) \cdot p_{f^{*}} (y)} d y

and

\sum_{f \in F} 2^{- c (f)} \leq 1 .

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Statistical learning theory. OpenStax CNX. Apr 10, 2009 Download for free at http://cnx.org/content/col10532/1.3

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?

Ask

	23 AP Key Terms 23 The Digestive System By OpenStax Start Key Terms
	15 Sociology 15 Religion MCQ By OpenStax Start Quiz
	44 Biology 44 Ecology and the Biosphere MCQ By OpenStax Start Quiz
©flickr: Luis	Chemistry Final Review By Madison Christian Start Exam
©flickr: Justin	The Last Holiday Concert Chapter 1 By Mackenzie Wilcox Start Quiz
©flickr: Miguel	Protozoal and Parasitic Infections By Cath Yu Start Quiz
	2 AP 02 Chemical Level of Organization Essay By OpenStax Start Flashcards
	How to Analyze Stocks By Yasser Ibrahim Start Quiz
	NCE Ch 06 Groups By Anh Dao Start Quiz
	PE Power Enigeering Safety By Gerr Zen Start Quiz