<< Chapter < Page | Chapter >> Page > |
Student: But there are sets of size three that cannot shatter; right? [Inaudible] was your point.
Instructor (Andrew Ng) :Yes, absolutely. So it turns out that if I choose a set like this – it’s actually set s, then there are labelings on this they cannot realize. And so, h cannot shatter this set. But that’s okay because – right – there definitely is – there exists some other set of size three being shattered. So the VC dimension is three. And then there is no set of size four that can shatter. Yeah?
Student: [Inaudible].
Instructor (Andrew Ng) :Not according to this definition. No. Right. So again, let’s see, I can choose my set s to be to be a set of three points that are all over lapping. Three points in exactly the same place. And clearly, I can’t shatter this set, but that’s okay. And I can’t shatter this set, either, but that’s okay because there are some other sets of size three that I can shatter. And it turns out this result holds true into the – more generally, in any dimensions – the VC dimension of the class of linear classifiers in any dimensions is equal to n plus one. Okay? So this is in [inaudible], and if you have linear classifiers over in any dimensional feature space, the VC dimension in any dimensions; whereas, n d is equal to n plus one.
So maybe you wanna write it down: what is arguably the best-known result in all of learning theory, I guess; which is that. Let a hypothesis class be given, and let the VC dimension of h be equal to d. Then we’re in probability of one minus delta. We have that – the formula on the right looks a bit complicated, but don’t worry about it. I’ll point out the essential aspects of it later. But the key to this result is that if you have a hypothesis class with VC dimension d, and now this can be an infinite hypothesis class, what Vapnik and Chervonenkis show is that we’re probability of at least one minus delta. You enjoy this sort of uniform conversions results; okay? We have that for all hypotheses h – that for all the hypotheses in your hypothesis class, you have that the generalization error of h minus the training error of h.
So the difference between these two things is bounded above by some complicated formula like this; okay? And thus, we’re probably one minus delta. We also have that – have the same thing; okay? And going from this step to this step; right? Going from this step to this step is actually something that you saw yourself; that we actually proved earlier. Because – you remember, in the previous lecture we proved that if you have uniform conversions, then that implies that – it appears actually that we showed that if generalization error and training error are close to each other; within gamma of each other, then the generalization error of the hypotheses you pick will be within two gamma times the best generalization error.
So this is really generalization error of h [inaudible] best possible generalization error plus two times gamma. And just the two constants in front here that I’ve absorbed into the big-O notation. So that formula is slightly more complicated. Let me just rewrite this as a corollary, which is that in order to guarantee that this holds, we’re probability of one minus delta. We’re probably at least one minus delta, I should say. It suffices that – I’m gonna write this – this way: I’m gonna write m equals big-O of d, and I’m going to put gamma and delta in as a subscript error to denote that. Let’s see, if we treat gamma and delta as constants, so they allow me to absorb turns that depend on gamma and delta into the big-O notation, then in order to guarantee this holds, it suffices that m is on the order of the VC dimension and hypotheses class; okay?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?