<< Chapter < Page | Chapter >> Page > |
Take logistic regression – logistic regression you have ten parameters and 0.01 error, and with 95 percent probability. How many training examples do I need? If you actually plug in actual constants into the text for learning theory bounds, you often get extremely pessimistic estimates with the number of examples you need. You end up with some ridiculously large numbers. You would need 10,000 training examples to fit ten parameters. So a good way to think of these learning theory bounds is – and this is why, also, when I write papers on learning theory bounds, I quite often use big-O notation to just absolutely just ignore the constant factors because the bounds seem to be very loose.
There are some attempts to use these bounds to give guidelines as to what model to choose, and so on. But I personally tend to use the bounds – again, intuition about – for example, what are the number of training examples you need gross linearly in the number of parameters or what are your gross x dimension in number of parameters; whether it goes quadratic – parameters? So it’s quite often the shape of the bounds. The fact that the number of training examples – the fact that some complexity is linear in the VC dimension, that’s sort of a useful intuition you can get from these theories. But the actual magnitude of the bound will tend to be much looser than will hold true for a particular problem you are working on. So did that answer your question?
Student: Uh-huh.
Instructor (Andrew Ng) :Yeah. And it turns out, by the way, for myself, a rule of thumb that I often use is if you’re trying to fit a logistic regression model, if you have n parameters or n plus one parameters; if the number of training examples is ten times your number of parameters, then you’re probably in good shape. And if your number of training examples is like tiny times the number of parameters, then you’re probably perfectly fine fitting that model. So those are the sorts of intuitions that you can get from these bounds.
Student: In cross validation do we assume these examples randomly?
Instructor (Andrew Ng) :Yes. So by convention we usually split the train testers randomly. One more thing I want to talk about for model selection is – there’s actually a special case of model selections, called the feature selection problem. And so here’s the intuition: for many machine-learning problems you may have a very high dimensional feature space with very high dimensional – you have x’s – [inaudible] feature x’s. So for example, for text classification – and I wanna talk about this text classification example that spam versus non-spam. You may easily have on the order of 30,000 or 50,000 features. I think I used 50,000 in my early examples. So if you have so many features – you have 50,000 features, depending on what learning algorithm you use, there may be a real risk of over fitting.
And so if you can reduce the number of features, maybe you can reduce the variance of your learning algorithm, and reduce the risk of over fitting. And for the specific case of text classification, if you imagine that maybe there’s a small number of “relevant features,” so there are all these English words. And many of these English words probably don’t tell you anything at all about whether the email is spam or non-spam. If it were, you know, English function words like, the, of, a, and; these are probably words that don’t tell you anything about whether the email is spam or non-spam. So words in contrast will be a much smaller number of features that are truly “relevant” to the learning problem.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?