<< Chapter < Page | Chapter >> Page > |
Recall the classification problem. In Lecture 6 , where we assumed that , we obtained the PAC bound
From Corrolary 1 in Lecture 6 ,
In Lectures 7 and 8 , we dropped the assumption that and obtained,
This led to
Hoeffding's inequality was central to our analysis of learning under bounded loss functions. In many regression and signal estimationproblems it is natural to consider squared error loss functions (rather than or absolute error). In such cases, we will need to derive bounds using different techniques.
To illustrate the distinction between classification andregression, consider a simple, scalar signal plus noise problem. Consider , , where is a fixed unknown scalar parameter and the are independent, zero-mean, unit variance random variables. Let . Then, according to the Central Limit Theorem, is distributed approximately . A simple tail-bound on the Gaussian distribution gives us
which implies that
This is a bound on the deviations of the squared error . Notice that the exponential decay rate is a function of rather than , as in Hoeffding's inequality. The squared error concentration inequality implies that (just write ). Therefore, in regression with a squared error loss, we can hope to get a rate of convergence as fastas instead of . The reason is simply because we are using an squared error loss instead of the or absolute error loss.
To begin our investigation into regression and function estimation, let us consider the following. Let and . Take such that is a map . We have training data . As our loss function, we take the squared error, i.e.,
The risk is then the MSE:
We know that the function that minimizes the MSE is just the conditional expectation of Y given X:
Now let We would like to select an using the training data such that the excess risk
is small. Let's consider the difference between the empirical risks:
Note that Hence, by the Strong Law of Large Numbers (SLLN), we know that
as But how fast is this convergence?
We will derive a PAC style bound for the difference The following derivation is from Barron 1991. The excess risk and it empirical counterpart will be denoted by
Note that is the sum of independent random variables:
where Therefore,
We are looking for a PAC bound of the form
If the variables are bounded, then we can apply Hoeffding's inequality. However, a more useful bound for our regression problem can be derivedif the the variables satisfy the following moment condition:
Notification Switch
Would you like to follow the 'Statistical learning theory' conversation and receive update notifications?