<< Chapter < Page | Chapter >> Page > |
And so – well, so if you set delta to be two KE so negative two gamma squared M. This is that form that I had on the left. And if you solve for M, what you find is that there’s an equivalent form of this result that says that so long as your training set assigns M as greater than this. And this is the formula that I get by solving for M. Okay? So long as M is greater than equal to this, then with probability, which I’m abbreviating to WP again, with probability at least one minus delta, we have for all. Okay? So this says how large a training set size that I need to guarantee that with probability at least one minus delta, we have the training error is within gamma of generalization error for all my hypotheses, and this gives an answer.
And just to give this another name, this is an example of a sample complexity bound. So from undergrad computer science classes you may have heard of computational complexity, which is how much computations you need to achieve something. So sample complexity just means how large a training example – how large a training set – how large a sample of examples do you need in order to achieve a certain bound and error. And it turns out that in many of the theorems we write out you can pose them in sort of a form of probability bound or a sample complexity bound or in some other form. I personally often find the sample complexity bounds the most easy to interpret because it says how large a training set do you need to give a certain bound on the errors.
And in fact – well, we’ll see this later, sample complexity bounds often sort of help to give guidance for really if you’re trying to achieve something on a machine learning problem, this really is trying to give guidance on how much training data you need to prove something.
The one thing I want to note here is that M grows like the log of K, right, so the log of K grows extremely slowly as a function of K. The log is one of the slowest growing functions, right. It’s one of – well, some of you may have heard this, right? That for all values of K, right – I learned this from a colleague, Andrew Moore, at Carnegie Mellon – that in computer science for all practical purposes for all values of K, log K is less [inaudible], this is almost true. So log K is – logs is one of the slowest growing functions, and so the fact that M sample complexity grows like the log of K, means that you can increase this number of hypotheses in your hypothesis class quite a lot and the number of the training examples you need won’t grow very much.
[Inaudible]. This property will be important later when we talk about infinite hypothesis classes. The final form is the – I guess is sometimes called the error bound, which is when you hold M and delta fixed and solved for gamma. And so – and what do you do – what you get then is that the probability at least one minus delta, we have that. For all hypotheses in my hypothesis class, the difference in the training generalization error would be less than equal to that. Okay? And that’s just solving for gamma and plugging the value I get in there. Okay?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?