<< Chapter < Page | Chapter >> Page > |
And, in other words, there’s assuming – once you know whether an email is spam or not spam, then knowing whether other words appear in the email won’t help you predict whether any other word appears in the email, okay? And, obviously, this assumption is false, right? This assumption can’t possibly be true. I mean, if you see the word – I don’t know, CS229 in an email, you’re much more likely to see my name in the email, or the TA’s names, or whatever.
So this assumption is normally just false under English, right, for normal written English, but it turns out that despite this assumption being, sort of, false in the literal sense, the Naive Bayes algorithm is, sort of, an extremely effective algorithm for classifying text documents into spam or not spam, for classifying your emails into different emails for your automatic view, for looking at web pages and classifying whether this webpage is trying to sell something or whatever. It turns out, this assumption works very well for classifying text documents and for other applications too that I’ll talk a bit about later.
As a digression that’ll make sense only to some of you. Let me just say that if you’re familiar with Bayesian X world, say graphical models, the Bayesian network associated with this model looks like this, and you’re assuming that this is random variable Y that then generates X1, X2, through X50,000, okay? If you’ve not seen the Bayes Net before, if you don’t know your graphical model, just ignore this. It’s not important to our purposes, but if you’ve seen it before, that’s what it will look like.
Okay. So the parameters of the model are as follows with phi FI given Y = 1, which is probably FX = 1 or XI = 1 given Y = 1, phi I given Y = 0, and phi Y, okay? So these are the parameters of the model, and, therefore, to fit the parameters of the model, you can write down the joint likelihood, right, is equal to, as usual, okay?
So given the training sets, you can write down the joint likelihood of the parameters, and then when you do maximum likelihood estimation, you find that the maximum likelihood estimate of the parameters are – they’re really, pretty much, what you’d expect. Maximum likelihood estimate for phi J given Y = 1 is sum from I = 1 to M, indicator XIJ = 1, YI = 1, okay?
And this is just a, I guess, stated more simply, the numerator just says, “Run for your entire training set, some [inaudible] examples, and count up the number of times you saw word “Jay” in a piece of email for which the label Y was equal to 1.” So, in other words, look through all your spam emails and count the number of emails in which the word “Jay” appeared out of all your spam emails, and the denominator is, you know, sum from I = 1 to M, the number of spam. The denominator is just the number of spam emails you got.
And so this ratio is in all your spam emails in your training set, what fraction of these emails did the word “Jay” appear in – did the, “Jay” you wrote in your dictionary appear in? And that’s the maximum likelihood estimate for the probability of seeing the word “Jay” conditions on the piece of email being spam, okay? And similar to your maximum likelihood estimate for phi Y is pretty much what you’d expect, right? Okay?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?