<< Chapter < Page | Chapter >> Page > |
Gradient descent gives one way of minimizing . Let's discuss a second way of doing so, this time performing the minimization explicitlyand without resorting to an iterative algorithm. In this method, we will minimize by explicitly taking its derivatives with respect to the 's, and setting them to zero. To enable us to do this without having to write reamsof algebra and pages full of matrices of derivatives, let's introduce some notation for doing calculus with matrices.
For a function mapping from -by- matrices to the real numbers, we define the derivative of with respect to to be:
Thus, the gradient is itself an -by- matrix, whose -element is . For example, suppose is a 2-by-2 matrix, and the function is given by
Here, denotes the entry of the matrix . We then have
We also introduce the trace operator, written “ .” For an -by- (square) matrix , the trace of is defined to be the sum of its diagonal entries:
If is a real number (i.e., a 1-by-1 matrix), then . (If you haven't seen this “operator notation” before, you shouldthink of the trace of as , or as application of the “trace” function to the matrix . It's more commonly written without the parentheses, however.)
The trace operator has the property that for two matrices and such that is square, we have that . (Check this yourself!) As corollaries of this, we also have, e.g.,
The following properties of the trace operator are also easily verified. Here, and are square matrices, and is a real number:
We now state without proof some facts of matrix derivatives (we won't need some of these until later this quarter). [link] applies only to non-singular square matrices , where denotes the determinant of . We have:
To make our matrix notation more concrete, let us now explain in detail the meaning of the first of these equations. Suppose we have some fixedmatrix . We can then define a function according to . Note that this definition makes sense, because if , then is a square matrix, and we can apply the trace operator to it; thus, does indeed map from to . We can then apply our definition of matrix derivativesto find , which will itself by an -by- matrix. [link] above states that the entry of this matrix will be given by the -entry of , or equivalently, by .
The proofs of the first three equations in [link] are reasonably simple, and are left asan exercise to the reader. The fourth equation in [link] can be derived using the adjoint representation of the inverse of a matrix. If we define to be the matrix whose element is times the determinant of the square matrix resulting from deleting row and column from , then it can be proved that . (You can check that this is consistent with the standard wayof finding when is a 2-by-2 matrix. If you want to see a proof of this more general result, see an intermediate or advancedlinear algebra text, such as Charles Curtis, 1991, Linear Algebra , Springer.) This shows that . Also, the determinant of a matrix can be written . Since does not depend on (as can be seen from its definition), this implies that . Putting all this together shows the result.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?