<< Chapter < Page | Chapter >> Page > |
And so that’s how you compute E of ZI transpose and E of ZI ZI transpose and substitute them back into this formula. And you would then have your M step update to the parameter matrix lambda. And the last thing I want to point out in this derivation is that it turns out – it’s probably because of the name EM algorithm, expectation maximization, one common mistake for the EM algorithm is in the E step, some want to take the expectation of the random variable Z, and then in the M step they just plug in the expected value everywhere you see it. So in particular, one common mistake in deriving EM for factor analysis is to look at this and say, “Oh, look. I see ZZ transpose. Let’s just plug in the expected value under the Q distribution.” And so plug in that – mu of ZI given XI times mu of ZI given XI transpose – into that expectation, and that would be an incorrect derivation of EM because it’s missing this other term, sigma of ZI given XI. So one common misconception for EM is that in the E step you just compute the expected value of the hidden random variable, and the M step you plug in the expected value. It turns out in some algorithms that turns out to be the right thing to do. In the mixture of Gaussians and the mixture of [inaudible] models, that would actually give you the right answer, but in general the EM algorithm is more complicated than just taking the expected values of the random variables and then pretending that they were sort of observed at the expected values.
So I wanna go through this just to illustrate that step as well. So just to summarize the three key things to keep in – that came up in this variation were, 1.) That for the E step, we had a continuous Gaussian random variable, and so to compute the E step, we actually compute the mean and covariance of the distribution QI. The second thing that came up was in the M step when you see these integrals, sometimes if you interpret that as expectation then the rest of the math becomes much easier. And the final thing was again in the M step, the EM algorithm is derived by a certain maximization problem that we solve. It is not necessarily just plugging the expected value of ZI everywhere.
Let’s see. I feel like I just did a ton of math and wrote down way too many equations. And even doing this, I was skipping many steps. So you can go to the lecture notes to see all the derivations of the steps I skipped, like how you actually take derivatives with respect to the matrix lambda, and how to compute the updates for the other parameters as well, for mu and for psi, because this is only for lambda. And so that’s the factor analysis algorithm. Justin?
Student: I was just wondering in the step in the lower right board, you said that the second term doesn’t have any parameters that we’re interested in. The first term has all the parameters, so we’ll [inaudible], but it seems to me that QI has a lot of parameters [inaudible].
Instructor (Andrew Ng) :I see. Right. Let’s see. So the question was doesn’t the term QI have parameters? So in the EM algorithm QI is – it actually turns out in the EM algorithm, sometimes P of ZI may have parameters, but QI of ZI may never have any parameters. In the specific case of factor analysis, P of ZI doesn’t have parameters. In other examples, the mixture of Gaussian models say, ZI was a multinomial random variable, and so in that example PI of ZI has parameters, but it turns out that Q of ZI will never have any parameters.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?