In other modules, estimators/predictors are analyzed, in order to obtain upper
bounds on their performance. These bounds are of the form:
where
.
We would like to know if these bounds are tight, in the sense that there is no otherestimator that is significantly better. To answer this,
we need lower bounds like
We assume we have the following ingredients:
Class of models,
.
is a
class of models containing the “true" model and is a subset of some bigger class
. E.g.
could be the class of
Lipschitz density functions or distributions
satisfying the box-counting condition.
An observation model,
, indexed by
.
denotes the distribution of the data under model
. E.g. in regression and classification, this
is the distribution of
. We will assume that
is a probability measure on the
measurable space
.
A performance metric
. If you have a model estimate
, then the
performance of that model estimate relative to the true model
is
.
E.g.
As before, we are interested in the risk of a learning rule, in particular the maximal risk given as:
where
is a function of the observations
and
denotes the
expectation with respect to
.
The main goal is to get results of the form
where
and
as
. The
is taken over all
estimators, i.e. all measurable functions
.
Suppose we have shown that
and also that for a particular estimator
We say that
is the optimal rate of convergence for this problem and that
attains that rate.
Two rates of convergence
and
are equivalent, i.e.
iff
General reduction scheme
Instead of directly bounding the expected performance, we are going to prove stronger probability
bounds of the form
These bounds can be readily converted to expected performance bounds using Markov's
inequality:
Therefore it follows:
First reduction step
Reduce the original problem to an easier one by replacing the larger class
with
a smaller finite class
. Observe that
The key idea is to choose a finite collection of models such that the resulting problem is as
hard as the original, otherwise the lower bound will not be tight.
Second reduction step
Next, we reduce the problem to a hypotheses test. Ideally, we would like to have something like
The
is over all measurable test functions
and
denotes the probability that after observing the
data, the test infers the wrong hypothesis.
This might not always be true or easy to show, but in certain scenarios it can be done.
Suppose
is a semi-distance, i.e. it satisfies