<< Chapter < Page | Chapter >> Page > |
I can actually write down what that is because XI is Gaussian with parameter mu and variance lambda lambda transpose times psi, so you can actually write this down as one over two pi to the N over two determined of – and then times E to the minus one half X of – so that’s my formula for the density of a Gaussian that has mean mu and covariance lambda lambda transpose plus psi. So this is my likelihood of the parameters given a training set. And one thing you could do is actually take this likelihood and try to maximize it in terms of the parameters, try to find the [inaudible] the parameters. But if you do that you find that – if you sort of try – take [inaudible]to get the law of likelihood, take derivatives, set derivatives equal to zero, you find that you’ll be unable to solve for the maximum of this analytically. You won’t be able to solve this [inaudible] likelihood estimation problem.
If you take the derivative of this with respect to the parameters, set the derivatives equal to zero and try to solve for the value of the parameters lambda, mu, and psi [inaudible] so you won’t be able to do that [inaudible]. So what we’ll use instead to estimate the parameters in a factor analysis model will be the EM algorithm.
Student: Why is the law of likelihood P of X and not P of X [inaudible] X and Z or something?
Instructor (Andrew Ng) :Oh, right. So the question is why is the likelihood P of X and not P of X given Z or P of X and Z. The answer is let’s see – we’re in the – so by analogy to the mixture of Gaussian distributions models, we’re given a training set that just comprises a set of unlabeled training examples that – for convenience for whatever reason, it’s easier for me to write down a model that defines the joint distribution on P of X comma Z.
But what I would like to do is really maximize – as a function of my parameters – I’m using theta as a shorthand to denote all the parameters of the model – I want to maximize the probability of the data I actually observe. And this would be actually [inaudible] theta [inaudible]from I equals one to M, and this is really integrating out Z.
So I only ever get to observe the Xs and the Zs are latent random variables or hidden random variables. And so what I’d like to do is do this maximization in which I’ve implicitly integrated out Z. Does that make sense? Actually, are there other questions about the factor analysis models? Okay. So here’s the EM algorithm. In the E step, we compute the conditional distribution of ZI given XI in our current setting of the parameters. And in the M step, we perform this maximization. And if this looks a little bit different from the previous versions of EM that you’ve seen, the only difference is that now I’m integrating over ZI because ZI is now a Gaussian random variable, is now this continuous value thing, so rather than summing over ZI, I now integrate over ZI. And if you replace [inaudible] of a sum, you [inaudible]ZI then this is exactly the M step you saw when we worked it out from the mixture of Gaussian [inaudible].
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?