<< Chapter < Page | Chapter >> Page > |
Your maximum likeliness estimate for probability of getting examples from cross J is just the fraction of examples in your training set that actually came from cross J. So this is the maximum likeliness estimation for Gaussian’s Discriminant Analysis. Now, in the mixture of Gaussian’s model and the EM problem we don’t actually have these cross labels, right, we just have an unlabeled data set like this. We just have a set of dots. I’m trying to draw the same data set that I had above, but just with the cross labels. So now, it’s as if you only get to observe the XIs, but the ZIs are unknown. Okay. So the cross label is unknown. So in the EM algorithm we’re going to try to take a guess for the values of the ZIs, and specifically, in the E step we computed WIJ was our current best guess for the probability that ZI equals J given that data point. Okay. So this just means given my current hypothesis, the way the Gaussian’s are, and given everything else, can I compute the [inaudible] probability – what was the [inaudible]probability that the point XI actually came from cross J? What is the probability that this point was a cross versus O? What’s the probability that this point was [inaudible]? And now in the M step, my formula of estimating for the parameters [inaudible]will be given by 1 over M sum from I equals 1 through M, sum of WIJ. So WIJ is right. The probability is my best guess for the probability that point I belongs to Gaussian or belongs to cross J, and [inaudible] using this formula instead of this one. Okay.
And similarly, this is my formula for the estimate for new J and it replaces the WIJs with these new indicator functions, you get back to the formula that you had in Gaussian’s Discriminant Analysis. I’m trying to convey an intuitive sense of why these algorithm’s make sense. Can you raise your hand if this makes sense now? Cool. Okay. So what I want to do now is actually present a broader view of the EM algorithm. What you just saw was a special case of the EM algorithm for specially to make sure of Gaussian’s model, and in the remaining half hour I have today I’m going to describe a general description of the EM algorithm and everything you just saw will be devised, sort of there’s a special case of this more general view that I’ll present now. And as a pre-cursor to actually deriving this more general view of the EM algorithm, I’m gonna have to describe something called Jensen’s and Equality that we use in the derivation.
So here’s Jensen’s and Equality. Just let F be a convex function. So a function is a convex of the second derivative, which I’ve written F prime prime to [inaudible]. The functions don’t have to be differentiatable to be convex, but if it has a second derivative, then F prime prime should be creating a 0. And let X be a random variable then the F applied to the expectation of X is less than the equal of 2D expectation of F of F. Okay. And hopefully you remember I often drop the square back, so E of X is the [inaudible], I’ll often drop the square brackets.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?