Therefore, apart from the
factor, ERM is getting the best possible performance.
Reducing the initial problem to a binary hypothesis testing does not always work. Sometimes
we need
hypotheses, with
as
. If this is the case,
we have the following theorem:
Theorem 2
Let
.
be such that
, where
is a semi-distance.
, with
.
Then
We will use this theorem to show that the estimator of
Lecture 4 is optimal. Recall
the setup of
Lecture 4 . Let
i.e. the class of Lipschitz functions with constant
. Let
. In that lecture, we constructed an estimator
such that
Is this the best we can do?
We are going to construct a collection
and apply
Theorem 2. Notice that the metric of interest is
,
a semi-distance. Let
. Let
,
and define
Note that
,
. The subclass we are going to consider are
functions of the form
i.e. “bump" functions. Let
be the collection of binary vectors of
length
, e.g.
. Define
Note that for
,
where
is the Hamming distance,
.
Now
so
Since
, the number of functions in our class is
. Turns out, we do not
need to consider all functions
, but only a select few. Using all the
functions leads to a looser lower bound of the form
, which corresponds to the
parametric rate. The problem under consideration is non-parametric, and hence we expecta slower rate of convergence. To get a tighter lower bound, the
following result is of use:
Lemma
Varshamov-gilbert '62
Let
. There exists a subset
of
such that
,
What this lemma says is that there are many
sequences in
that are very
different (i.e.
). We are going to use the lemma to construct a useful
set of hypotheses. Let
be the class of sequences in the lemma and define
We now need to look at the conditions of Theorem 2 and choose
appropriately.
First note that for
,
Now let
. Then
Now notice that
(from Lemma
[link] ).
We want to choose
such that
This gives
so take
. Now
Therefore,
or,
or after Markov's inequality,
Therefore, the estimator constructed in class attains the optimal rate of convergence.