<< Chapter < Page | Chapter >> Page > |
Right. So we have some models for the joint distribution for X and Z and our goal is to find the maximum likeliness estimate of the parameters data where the likelihood is defined as something equals 1 to M [inaudible] probably of our data as usual. And here X is parameterized by data is now given by a sum over all the values of ZI parameterized by data. Okay. So just by taking our model of the joint distribution of X and Z and marginalizing out ZI that we get P of XI parameterized by data. And so the EM algorithm will be a way of performing this maximum likeliness estimation problem, which is complicated by the fact that we have these ZIs in our model that are unobserved. Before I actually do the math, here’s a useful picture to keep in mind. So the horizontal axis in this cartoon is the [inaudible]axis and there’s some function, the law of likelihood of theta zero that we’re trying to maximize, and usually maximizing our [inaudible] derivatives instead of the zero that would be very hard to do. What the EM algorithm will do is the following. Let’s say it initializes some value of theta zero, what the EM algorithm will end up doing is it will construct a lower bound for this law of likelihood function and this lower bound will be tight [inaudible]of equality after current guessing the parameters and they maximize this lower boundary with respect to theta so we’ll end up with say that value. So that will be data 1. Okay.
And then EM algorithm look at theta 1 and they’ll construct a new lower bound of theta and then we’ll maximize that. So you jump here. So that’s the next theta 2 and you do that again and you get the same 3, 4, and so on until you converge to local optimum on [inaudible] theta function. Okay. So this is a cartoon that displays what the EM algorithm will do. So let’s actually make that formal now. So you want to maximize with respect to theta sum of [inaudible]– there’s my theta, so this is sum over 1 [inaudible] sum over all values of Z. Okay. So what I’m going to do is multiply and divide by the same thing and I’m gonna write this as Q – okay. So I’m going to construct the probability distribution QI, that would be over the latent random variables ZI and so these QI would get distribution so each of the QI would bring in a 0 and sum over all the values of ZI of QI would be 1, so these Qs will be a probability distribution that I get to construct. Okay. And then I’ll later go describe the specific choice of this distribution QI. So this QI is a probability distribution over the random variables of ZI so this is [inaudible]. Right. I see some frowns. Do you have questions about this? No. Okay.
So the log function looks like that and there’s a concave function so that tells us that the log of E of X is greater than and equal to the expected value of log X by the other concave function form of Jensen’s and Equality. And so continuing from the previous expression, this is a summary of a log and an expectation, that must therefore be greater than or equal to the expected value of the log of that. Okay. Using Jensen’s and Equality. And lastly just to expand out this formula again. This is equal to that. Okay. Yeah.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?