<< Chapter < Page | Chapter >> Page > |
So what the Hufting inequality says is that if you pick a value of gamma, let me put S one interval gamma there’s another interval gamma. Then the saying that the probability mass of the details, in other words the probability that my value of Phi hat is more than a gamma away from the true value, that the total mass – that the total probability mass in these tails is at most two E to the negative two gamma squared M. Okay? That’s what the Hufting inequality – so if you can’t read that this just says – this is just the right hand side of the bound, two E to negative two gamma squared. So balance the probability that you make a mistake in estimating the mean of a Benuve random variable.
And the cool thing about this bound – the interesting thing behind this bound is that the [inaudible] exponentially in M, so it says that for a fixed value of gamma, as you increase the size of your training set, as you toss a coin more and more, then the worth of this Gaussian will shrink. The worth of this Gaussian will actually shrink like one over root to M.And that will cause the probability mass left in the tails to decrease exponentially, quickly, as a function of that. And this will be important later. Yeah?
Student:
Does this come from the central limit theorem [inaudible].
Instructor (Andrew Ng): No it doesn’t. So this is proved by a different – this is proved – no – so the central limit theorem – there may be a version of the central limit theorem, but the versions I’m familiar with tend – are sort of asymptotic, but this works for any finer value of M. Oh, and for your – this bound holds even if M is equal to two, or M is [inaudible], if M is very small, the central limit theorem approximation is not gonna hold, but this theorem holds regardless. Okay? I’m drawing this just as a cartoon to help explain the intuition, but this theorem just holds true, without reference to central limit theorem.
All right. So lets start to understand empirical risk minimization, and what I want to do is begin with studying empirical risk minimization for a [inaudible] case that’s a logistic regression, and in particular I want to start with studying the case of finite hypothesis classes. So let’s say script H is a class of K hypotheses. Right. So this is K functions with no – each of these is just a function mapping from inputs to outputs, there’s no parameters in this. And so what the empirical risk minimization would do is it would take the training set and it’ll then look at each of these K functions, and it’ll pick whichever of these functions has the lowest training error. Okay?
So now that the logistic regression uses an infinitely large – a continuous infinitely large class of hypotheses, script H, but to prove the first row I actually want to just describe our first learning theorem is all for the case of when you have a finite hypothesis class, and then we’ll later generalize that into the hypothesis classes. So empirical risk minimization takes the hypothesis of the lowest training error, and what I’d like to do is prove a bound on the generalization error of H hat. All right. So in other words I’m gonna prove that somehow minimizing training error allows me to do well on generalization error.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?