<< Chapter < Page | Chapter >> Page > |
The reader can easily verify that the quantity in the summation in the update rule above is just (for the original definition of ). So, this is simply gradient descent on the original cost function . This method looks at every example in the entire training set on every step, and iscalled batch gradient descent . Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we haveposed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate is not too large) to the global minimum. Indeed, is a convex quadratic function. Here is an example of gradient descent as it is run to minimize a quadratic function.
The ellipses shown above are the contours of a quadratic function. Also shown is the trajectory taken by gradient descent, which was initializedat (48,30). The 's in the figure (joined by straight lines) mark the successive values of that gradient descent went through.
When we run batch gradient descent to fit on our previous dataset, to learn to predict housing price as a function of living area, weobtain , . If we plot as a function of (area), along with the training data, we obtain thefollowing figure:
If the number of bedrooms were included as one of the input features as well, we get , .
The above results were obtained with batch gradient descent. There is an alternative to batch gradient descent that also works very well.Consider the following algorithm:
In this algorithm, we repeatedly run through the training set, and each time we encounter a training example, we update the parametersaccording to the gradient of the error with respect to that single training example only. This algorithm is called stochastic gradient descent (also incremental gradient descent ). Whereas batch gradient descent has to scan through the entire training setbefore taking a single step—a costly operation if is large—stochastic gradient descent can start making progress right away, and continues to makeprogress with each example it looks at. Often, stochastic gradient descent gets “close” to the minimum much faster than batch gradient descent. (Note however that it may never “converge” to the minimum,and the parameters will keep oscillating around the minimum of ; but in practice most of the values near the minimum will be reasonably good approximations to the true minimum. While it is more common to run stochastic gradient descent as we have described it and with afixed learning rate , by slowly letting the learning rate decrease to zero as the algorithm runs, it is also possible to ensure that the parameters will converge to the global minimum rather then merely oscillatearound the minimum. ) For these reasons, particularly when the training set is large, stochastic gradient descent is often preferred over batch gradient descent.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?