In the
last lecture we derived a risk (MSE) bound for regression
problems; i.e., select an
so that
is small, where
. The result is
summarized below.
Theorem
Complexity regularization with squared error loss
Let
,
,
iid,
unknown,
= {collection of candidate functions},
Let
,
, be positive numbers satisfying
, and select a function from
according to
with
and
. Then,
where
.
Maximum likelihood estimation
The focus of this lecture is to consider another approach to learning
based on maximum likelihood estimation. Consider the classical signalplus noise model:
where
are iid zero-mean noises. Furthermore, assume that
for some known density
. Then
since
.
A very common and useful loss function to consider is
Minimizing
with respect to
is equivalent to maximizing
or
Thus, using the negative log-likelihood as a loss
function leads to maximum likelihood estimation. If the
are iid
zero-mean Gaussian r.v.s then this is just the squared error loss weconsidered last time. If the
are Laplacian distributed e.g.
, then we obtain the absolute error, or
,
loss function. We can also handle non-additive models such as thePoisson model
In this case
which is a very different loss function, but quite appropriate for many imaging problems.
Before we investigate maximum likelihood estimation for model
selection, let's review some of the basic concepts. Let
denote a parameter space (e.g.,
), and assume we have
observations
where
is a parameter determining the
density of the {
}. The ML estimator of
is
maximizes the expected log-likelihood. To see this, let's compare the expected log-likelihood of
with any other
.
Why?
On the other hand, since
maximizes the likelihood over
, we have
Therefore,
or re-arranging
Notice that the quantity
is an empirical average whose mean is
. By the law of large numbers, for each
,
If this also holds for the sequence
, then we have
which implies that
which often implies that
in some appropriate sense (e.g., point-wise or in norm).
Gaussian distributions
Hellinger distance
The KL divergence is not a distance function.
Therefore, it is often more convenient to work with the Hellinger metric,
The Hellinger metric is symmetric, non-negative and
and therefore it is a distance measure. Furthermore, the squared Hellinger distance lower bounds the KL divergence, so convergence in KL divergence implies convergence of the Hellinger distance.
Proposition 1
Proof:
Note that in the proof we also showed that
and using the fact
again, we have
The quantity inside the log is called the
affinity between
and
:
This is another measure of closeness between
and
.
Gaussian distributions
Poisson distributions
If
, then
Summary
- Maximum likelihood estimator maximizes the empirical average
(our empirical risk is negative log-likelihood)
-
maximizes the expectation
(the risk is the expected negative log-likelihood)
-
so we expect some sort of concentration of measure.
- In particular, since
we might expect that
for the sequence of estimates
.
So, the point is that maximum likelihood estimator is just a special
case of a loss function in learning. Due to its special structure, weare naturally led to consider KL divergences, Hellinger distances, and
Affinities.