<< Chapter < Page | Chapter >> Page > |
Student:
[Inaudible]
Instructor (Andrew Ng) :
[Inaudible]
Student:
[Inaudible]. Yeah. Okay. So this has the [inaudible]so let’s say
Random variable Z, right, and Z has some distribution. Let’s denote it G. And let’s say I have some function G of Z. Okay. Then by definition, the expected value of G of Z, by definition, that’s equal to sum over all the values of Z, the probability of that value of Z times Z of G. Right. That’s sort of the definition of a random variable. And so the way I got from this step to this step is by using that. So in particular, now, I’ve been using distribution QI to denote the distribution of Z, so this is, like, sum over Z of P of Z times [inaudible]. And so this is just the expected value with respect to a random variable Z joined from the distribution Q of G of Z. Are there questions?
Student: So in general when you’re doing maximum likelihood estimations, the likelihood of the data, but in this case you only say probability of X because you only have observed X whereas previously we said probability of X given the labels?
Instructor (Andrew Ng) :Yes. Exactly. Right. Right. [Inaudible] we want to choose the parameters that maximizes the probability of the data, and in this case, our data comprises only the Xs because we don’t reserve the Zs, and therefore, the likelihood of parameters is given by the probability of the data, which is [inaudible]. So this is all we’ve done, right, we wanted to maximize the law of likelihood of theta and what we’ve done, through these manipulations, we’ve know constructed a lower bound on the law of likelihood of data. Okay.
And in particular, this formula that we came up, we should think of this as a function of theta then, if you think about it, theta are the parameters of your model, right, if you think about this as a function of your parameters theta, what we’ve just shown is that the law of likelihood of your parameters theta is lower bounded by this thing. Okay. Remember that cartoon of repeatedly constructing a lower bound and optimizing the lower bound. So what we’ve just done is construct a lower bound for the law of likelihood for theta. Now, the last piece we want for this lower bound is actually we want this inequality to hold with equality for the current value for theta.
So just refrain back to the previous cartoon. If this was the law of likelihood for theta, we’d then construct some lower bound of it, some function of theta and if this is my current value for theta, then I want my lower bound to be tight. I want my lower bound to be equal to the law of likelihood of theta because that’s what I need to guarantee that when I optimize my lower bound, then I’ll actually do even better on the true objective function. Yeah.
Student:
How do [inaudible]
Instructor (Andrew Ng) :Excuse me. Yeah. Great question. How do I know that function is concave? Yeah. I don’t think I’ve shown it. It actually turns out to be true for all the models we work with. Do I know that the law of bound is a concave function of theta? I think you’re right. In general, this may not be a concave function of theta. For many of the models we work with, this will turn out to be a concave function, but that’s not always true. Okay. So let me go ahead and choose a value for Q. And I’ll refer back to Jensen’s and Equality. We said that this inequality will become an equality if the random variable inside is a constant. Right. If you’re taking an expectation with respect to constant valued variables.
So the QI of ZIs must sum to 1 and so to compute it you should just take P of XI, ZI, parameterized by theta and just normalize the sum to one. There is a step that I’m skipping here to show that this is really the right thing to do. Hopefully, you’ll just be convinced it’s true. For the actual steps that I skipped, it’s actually written out in the lecture notes. So you then have the denominator, by definition, is that and so by the definition of conditional probability QI of ZI is just equal to P of ZI given XI and parameterized by theta. Okay.
And so to summarize the algorithm, the EM algorithm has two steps. And the E step, we set, we choose the distributions QI, so QI of ZI will set to be equal to a P of ZI given [inaudible] by data. That’s the formula we just worked out. And so by this step we’ve now created a lower bound on the law of likelihood function that is now tight at a current value of theta. And in the M step, we then optimize that lower bound with respect to our parameters theta and specifically to the [inaudible]of theta. Okay. And so that’s the EM algorithm. I won’t have time to do it today, but I’ll probably show this in the next lecture, but the EM algorithm’s that I wrote down for the mixtures of Gaussian’s algorithm is actually a special case of this more general template where the E step and the M step responded. So pretty much exactly to this E step and this M step that I wrote down. The E step constructs this lower bound and makes sure that it is tight to the current value of theta. That’s in my choice of Q, and then the M step optimizes the lower bound with respect to [inaudible] data. Okay. So lots more to say about this in the next lecture. Let’s check if there’s any questions before we close. No. Okay. Cool. So let’s wrap up for today and we’ll continue talking about this in the next session.
[End of Audio]
Duration: 76 minutes
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?