<< Chapter < Page | Chapter >> Page > |
The last step of this derivation used Jensen's inequality. Specifically, is a concave function, since over its domain . Also, the term
in the summation is just an expectation of the quantity with respect to drawn according to the distribution given by . By Jensen's inequality, we have
where the “ ” subscripts above indicate that the expectations are with respect to drawn from . This allowed us to go from [link] to [link] .
Now, for any set of distributions , the formula [link] gives a lower-bound on . There're many possible choices for the 's. Which should we choose? Well, if we have some current guess of the parameters, it seems natural to try to make the lower-bound tight at that value of . I.e., we'll make the inequality above hold with equality at ourparticular value of . (We'll see later how this enables us to prove that increases monotonically with successsive iterations of EM.)
To make the bound tight for a particular value of , we need for the step involving Jensen's inequality inour derivation above to hold with equality. For this to be true, we know it is sufficient that that the expectationbe taken over a “constant”-valued random variable. I.e., we require that
for some constant that does not depend on . This is easily accomplished by choosing
Actually, since we know (because it is a distribution), this further tells us that
Thus, we simply set the 's to be the posterior distribution of the 's given and the setting of the parameters .
Now, for this choice of the 's, Equation [link] gives a lower-bound on the loglikelihood that we're trying to maximize. This is the E-step. In the M-step of the algorithm, we then maximizeour formula in Equation [link] with respect to the parameters to obtain a new setting of the 's. Repeatedly carrying out these two steps gives us the EM algorithm, which is as follows:
How will we know if this algorithm will converge? Well, suppose and are the parameters from two successive iterations of EM. We will now prove that , which shows EM always monotonically improves the log-likelihood.The key to showing this result lies in our choice of the 's. Specifically, on the iteration of EM in which the parametershad started out as , we would have chosen . We saw earlier that this choice ensures that Jensen's inequality,as applied to get [link] , holds with equality, and hence
The parameters are then obtained by maximizing the right hand side of the equation above. Thus,
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?