<< Chapter < Page | Chapter >> Page > |
So here’s a somewhat broken argument. Let’s say I want to apply this result analyzing logistic regression. So let’s say your hypothesis class is because of all linear division boundaries; right? So say script h is parameterized by d real numbers; okay? So for example, if you’re applying logistic regression with over [inaudible], then d would be endless one with logistic regression to find the linear position boundary, parameterized by endless one real numbers.
When you think about how your hypothesis class is really represented in a computer – computers use zero one bits to represent real numbers. And so if you use like a normal standard computer, it normally will represent real numbers by what’s called double position floating point numbers. And what that means is that each real number is represented by or a 64-bit representation; right?
So really – you know what floating point is in a computer. So a 64-bit floating point is what almost all of us use routinely. And so this parameterized by d real numbers, that’s really as if it’s parameterized by 64 times d bits. Computers can’t represent real numbers. They only represent – used to speed things. And so the size of your hypothesis class in your computer representation – you have 64 times d bits that you can flip. And so the number of possible values for your 62 to 64 d bits is really just to the power of 64 d; okay? Because that’s the number of ways you can flip the 64 d bits. And so this is why it’s important that we that we had log k there; right? So k is therefore, to the 64 d. And if I plug it into this equation over here, what you find is that in order to get this sort of guarantee, it suffices that m is great and equal to on the order of – one of the gamma square log – it’s just a 64 d over delta, which is that; okay?
So just to be clear, in order to guarantee that there’s only one, instead of the same complexity result as we had before – so the question is: suppose, you want a guarantee that a hypotheses returned by empirical risk minimization will have a generalization error that’s within two gamma or the best hypotheses in your hypotheses class. Then what this result suggests is that, you know, in order to give that sort of error bound guarantee, it suffices that m is greater and equal to this. In other words, that your number of training examples has to be on the order of d over gamma square; 10, 12, 1 over delta. Okay? And the intuition that this conveys is actually, roughly right.
This says, that the number of training examples you need is roughly linear in the number of parameters of your hypothesis class. That m has [inaudible] on the order of something linear, [inaudible]. That intuition is actually, roughly right. I’ll say more about this later. This result is clearly, slightly broken, in the sense that it relies on a 64-bit representation of 14-point numbers. So let me actually go ahead and outline the “right way” to show this more formally; all right? And it turns out the “right way” to show this more formally involves a much longer – because the proof is extremely involved, so I’m just actually going to state the result, and not prove it.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?