<< Chapter < Page | Chapter >> Page > |
And to do this, we’ll use some simple heuristics; for every feature we’ll just try to compute some rough estimate or compute some measure of how informative xi is about y. So there are many ways you can do this. One way you can choose is to just compute the correlation between xi and y. And just for each of your features just see how correlated this is with your class label y. And then you just pick the top k most correlated features. Another way to do this – for the case of text classification, there’s one other method, which especially for this k features I guess – there’s one other informative measure that’s used very commonly, which is called major information.
I’m going to tell you some of these ideas in problem sets, but I’ll just say this very briefly. So the major information between feature xi and y – I’ll just write out the definition, I guess. Let’s say this is text classification, so x can take on two values, 0, 1; the major information between xi and y is to find out some overall possible values of x; some overall possible values of y times the distribution times that. Where all of these distributions – where so the joint distribution over xi and y, you would estimate from your training data all of these things you would use, as well. You would estimate from the training data what is the probability that x is 0, what’s the probability that x is one, what’s the probability that x is 0, y is 0, x is one; y is 0, and so on.
So it turns out there’s a standard information theoretic measure of how different probability distributions are. And I’m not gonna prove this here. But it turns out that this major information is actually – so the standard measure of how different distributions are; called the K-L divergence. When you take a class in information theory, you have seen concepts of mutual information in the K-L divergence, but if you haven’t, don’t worry about it. Just the intuition is there’s something called K-L divergence that’s a formal measure of how different two probability distributions are. And mutual information is a measure for how different – the joint distribution is of x and y; from the distribution you would get – if you were to assume they were independent; okay?
So if x and y were independent, then p of x, y would be equal to p of x times p of y. And so you know, this distribution and this distribution would be identical, and the K-L divergence would be 0. In contrast, if x and y were very non-independent – in other words, if x and y are very informative about each other, then this K-L divergence will be large. And so mutual information is a formal measure of how non-independent x and y are. And if x and y are highly non-independent then that means that x will presumably tell you something about y, and so they’ll have large mutual information. And this measure of information will tell you x might be a good feature. And you get to play with some of these ideas more in the problem sets. So I won’t say much more about it.
And what you do then is – having chosen some measure like correlation or major information or something else, you then pick the top k features; meaning that you compute correlation between xi and y for all the features of mutual information – xi and y for all the features. And then you include in your learning algorithm the k features of the largest correlation with the label or the largest mutual information label, whatever. And to choose k, you can actually use cross validation, as well; okay? So you would take all your features, and sort them in decreasing order of mutual information. And then you’d try using just the top one feature, the top two features, the top three features, and so on. And you decide how many features includes using cross validation; okay? Or you can – sometimes you can just choose this by hand, as well.
Okay. Questions about this? Okay. Cool. Great. So next lecture I’ll continue – I’ll wrap up the Bayesian model selection, but less close to the end.
[End of Audio]
Duration: 77 minutes
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?