<< Chapter < Page | Chapter >> Page > |
Instructor (Andrew Ng) :Yeah, let’s see. So LaPlace smoothing is a method to give you, sort of, hopefully, better estimates of their probability distribution over a multinomial, and so was I using X to Y in the previous lecture? So in trying to estimate the probability over a multinomial – I think X and Y are different. I think – was it X or Y? I think it was X, actually. Well – oh, I see, right, right. I think I was using a different definition for the random variable Y because suppose you have a multinomial random variable, X which takes on – let’s use a different alphabet. Suppose you have a multinomial random variable X which takes on L different values, then the maximum likelihood estimate for the probability of X, PFX equals K, will be equal to, right, the number of observations. The maximum likelihood estimate for the probability of X being equal to K will be the number of observations of X equals K divided by the total number of observations of X, okay? So that’s the maximum likelihood estimate. And to add LaPlace smoothing to this, you, sort of, add one to the numerator, and you add L to the denominator where L was the number of possible values that X can take on. So, in this case, this is a probability that X equals K, and X can take on 50,000 values if 50,000 is the length of your dictionary; it may be something else, but that’s why I add 50,000 to the denominator. Are there other questions? Yeah.
Student: Is there a specific definition for a maximum likelihood estimation of a parameter? We’ve talked about it a couple times, and all the examples make sense, but I don’t know what the, like, general formula for it is.
Instructor (Andrew Ng) :I see. Yeah, right. So the definition of maximum likelihood – so the question is what’s the definition for maximum likelihood estimate? So actually in today’s lecture and the previous lecture when I talk about Gaussian Discriminant Analysis I was, sort of, throwing out the maximum likelihood estimates on the board without proving them. The way to actually work this out is to actually write down the likelihood.
So the way to figure out all of these maximum likelihood estimates is to write down the likelihood of the parameters, phi K given Y being zero, phi Y, right? And so given a training set, the likelihood, I guess, I should be writing log likelihood will be the log of the product of I equals one to N, PFXI, YI, you know, parameterized by these things, okay? Where PFXI, YI, right, is given by NI, PFX, YJ given YI. They are parameterized by – well, I’ll just drop the parameters to write this more simply – oh, I just put it in – times PFYI, okay?
So this is my log likelihood, and so the way you get the maximum likelihood estimate of the parameters is you – so if given a fixed training set, given a set of fixed IYI’s, you maximize this in terms of these parameters, and then you get the maximum likelihood estimates that I’ve been writing out. So in a previous section of today’s lecture I wrote out some maximum likelihood estimates for the Gaussian Discriminant Analysis model, and for Naïve Bayes, and then this – I didn’t prove them, but you get to, sort of, play with that yourself in the homework problem as well and for one of these models, and you’ll be able to verify that when you maximize the likelihood and maximize the log likelihood that hopefully you do get the same formulas as what I was drawing up on the board, but a way is to find the way these are derived is by maximizing this, okay? Cool.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?