The following Lemma relates total variation, affinity and KL divergence.
Lemma
For the first inequality,
For the second inequality,
Putting everything together, we now have the following Theorem:
Theorem
Let
be a class of models, and suppose we have observations
distributed
according to
,
. Let
be the performance
measure of the estimator
relative to the true model
. Assume also
is a semi-distance.
Let
be s.t.
. Then
How do we use this theorem?
Choose
such that
, then
is bounded away from
0 and we get a bound
or, after Markov's
To apply the theorem, we need to design
s.t.
and
.
To reiterate, the design of
requires careful construction so as to balance the tradeoff between
the first condition which requires
to be far apart, and the second condition which requires
to be close to each other.
Lets use this theorem in a problem we are familiar with. Let
and
,
where
.
Suppose
. We proved that under these assumptions and an upper bound on the
density of
, the Chernoff bounding technique yielded an expected error rate for ERM
Is this the best possible rate?
Construct two models in the above class (denote it by
),
and
. For both take
and
,
, so
,
.
We are interested in controlling the excess risk
Note that if the true underlying model is either
or
, we have:
Proposition 1
is a semi-distance.
It suffices to show that
,
and
.
The first two statements are obvious. The last one (triangle inequality) followsfrom the fact that
.
Suppose this was not the case, then
s.t.
and
. In other words,
Since
, we have:
Lets look at the first reduction step:
So we can work out a bound on
and then translate it to excess risk.
Lets apply
Theorem 1 . Note that
and let
and
.