Review : maximum likelihood estimation
In the
last lecture , we have
i.i.d observations drawn from an unknown distribution
With
loss function defined as
, the empirical risk
is
Essentially, we want to choose a distribution from the collection of distributions within the parameter space that minimizes the empirical risk,
i.e., we would like to select
where
The risk
is defined as
Note that
minimizes
over
.
Finally, the excess risk
of
is defined as
We recognized that the excess risk corresponding to this loss function is simply the
Kullback-Leibler (KL) Divergence or
Relative Entropy , denoted by
. It is easy to see that
is always non-negative and is zero if and only if
. KL divergence measures how different two probability distributions are and therefore is natural to measure convergence of the maximum likelihood procedures. However,
is not a distance metric because it is not symmetric and does not satisfy the triangle inequality. For this reason, two other quantities play a key role in maximum likelihood estimation, namely
Hellinger Distance and
Affinity .
The Hellinger distance
is defined as
We proved that the squared Hellinger distance lower bounds the KL divergence:
The affinity
is defined as
we also proved that
Gaussian distribution
Y is Gaussian with mean
and variance
.
First, look at
Then,
Maximum likelihood estimation and complexity regularization
Suppose that we have n i.i.d training samples,
.
Using conditional probability,
can be written as
Let's assume for the moment that
is completely unknown, but
has a special form:
where
is a known parametric density function with parameter
.
Signal-plus-noise observation model
where
and
.
Poisson
The likelihood loss function
is
The
expected loss is
Notice that the first term is a constant with respect to
.
Hence, we define our risk
to be
The function
minimizes this risk since
minimizes the integrand.
Our empirical risk
is the negative log-likelihood of the training samples:
The value
is the
empirical probability of observing
.
Often in function estimation, we have control over where we sample X. Let's assume that
and
. Suppose we sample
uniformly with
samples for some positive integer
(
i.e., ,take
evenly spaced samples in each coordinate).
Let
,
denote these sample points, and assume that
. Then, our empirical risk is
Note that
is now a deterministic quantity.
Our risk
is
The risk is minimized by
. However,
is not a unique minimizer. Any
that agrees with
at the point
also minimizes this risk.
Now, we will make use of the following vector and shorthand notation. The uppercase
denotes a random variable, while the lowercase
and
denote deterministic quantities.
Then,
(random)
(deterministic) .
With this notation, the empirical risk and the true risk can be written as