The start and end points of each line segment are each
one of
discrete values, as indicated in
[link] . Since each line may start at any
of the
levels and terminate at any of the
levels, there are a total of
possible lines for each segment.
Given that there are
intervals we have
Therefore we can use
bits to
describe a function
.
Let
Construct a
prefix code for every
by
Thus, if
, then the prefix code associated
with
has codeword length
which satisfies the Kraft Inequality
Now we will apply our complexity regularization result
to select a function
from
and bound its risk.
We are assuming Gaussian errors, so
We can ignore the constant term and so our empirical
selection is
We can compute
according to:
For
then select
and finally
Because the KL divergence and
affinity simply reduce to squared error in the Gaussian
case
(Lecture 14) , we arrive at a relatively simple bound on the
mean square error of
The first term in the brackets above is
related to the error incurred by approximating
by an element
of
The second term is related to the estimation error
involved with the model selection process.
Let's focus on the approximation error.
First, suppose
for
Let
be the “best" piecewise
linear approximation to
, with
pieces on intervals
Consider the difference between
and
on one such interval, say
By applying Taylor's
theorem with remainder we have
for
and some
. Define
Note that
is not
necessarily the best piecewise linear approximation to
, just
good enough for our purposes. Then using the fact that
, for
we have
So, for all
Now let
be the element of
closest to
(
is the quantized version of
)
since we used
bits to quantize
the endpoints of each line segment. Consequently,
Thus it
follows that
The first and last terms
dominate the above expression. Therefore, the upper bound is minimizedwhen
and
are balanced. This is
accomplished by choosing
Then it follows that
If
then we have
If
for
, let
be the following
piecewise constant approximation to
Let
Then
Repeating the same
reasoning as in the
case, we arrive at
for
In particular, for
we get
within a logarithmic factor of the rate we had before (in
Lecture
4 ) for that case!
Summary
can be computed by finding least-square line
fits to the data on partitions of the form
for
and then
selecting the best fit by the
that gives the minimum of
the complexity regularization criterion.
If
for some
then
automaticallypicks the optimal
number of bins. Essentially
(indirectly) estimates the
smoothness of
and produces a rate which is near minimax
optimal ! (
is the best
possible).
The larger
is the faster the convergence
and the better the denoising !