<< Chapter < Page | Chapter >> Page > |
And so taking derivatives, you have one-half something squared. So the two comes down. So you have two times one-half times theta of X minus Y, and then by the [inaudible] derivatives, we also must apply this by the derivative of what's inside the square. Right, the two and the one-half cancel. So this leaves [inaudible]times that, theta zero, X zero plus [inaudible].
Okay, and if you look inside this sum, we're taking the partial derivative of this sum with respect to the parameter theta I. But all the terms in the sum, except for one, do not depend on theta I. In this sum, the only term that depends on theta I will be some term here of theta I, X I. And so we take the partial derivative with respect to theta I, X I – take the partial derivative with respect to theta I of this term theta I, X I, and so you get that times X I.
Okay, and so this gives us our learning rule, right, of theta I gets updated as theta I minus alpha times that. Okay, and this Greek alphabet alpha here is a parameter of the algorithm called the learning rate, and this parameter alpha controls how large a step you take. So you're standing on the hill. You decided what direction to take a step in, and so this parameter alpha controls how aggressive – how large a step you take in this direction of steepest descent.
And so if you – and this is a parameter of the algorithm that's often set by hand. If you choose alpha to be too small than your steepest descent algorithm will take very tiny steps and take a long time to converge. If alpha is too large then the steepest descent may actually end up overshooting the minimum, if you're taking too aggressive a step.
Yeah?
Student: [Inaudible].
Instructor (Andrew Ng) :Say that again?
Student: Isn't there a one over two missing somewhere?
Instructor (Andrew Ng) :Is there a one-half missing?
Student: I was [inaudible].
Instructor (Andrew Ng) :Thanks. I do make lots of errors in that. Any questions about this?
All right, so let me just wrap this property into an algorithm. So over there I derived the algorithm where you have just one training example, more generally for M training examples, gradient descent becomes the following. We're going to repeat until convergence the following step.
Okay, theta I gets updated as theta I and I'm just writing out the appropriate equation for M examples rather than one example. Theta I gets updated. Theta I minus alpha times the sum from I equals one to M. Okay, and I won't bother to show it, but you can go home and sort of verify for yourself that this summation here, this is indeed the partial derivative with respect to theta I of J of theta, where if you use the original definition of J of theta for when you have M training examples.
Okay, so I'm just going to show – switch back to the laptop display. I'm going to show you what this looks like when you run the algorithm. So it turns out that for the specific problem of linear regression, or ordinary release squares, which is what we're doing today, the function J of theta actually does not look like this nasty one that I'll show you just now with a multiple local optima.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?