<< Chapter < Page Chapter >> Page >

And I know some of you may have seen some of what I'm about to do before, in like an undergraduate linear algebra course, and the way it's typically done requires [inaudible] projections, or taking lots of derivatives and writing lots of algebra. What I'd like to do is show you a way to derive the closed form solution of theta in just a few lines of algebra.

But to do that, I'll need to introduce a new notation for matrix derivatives, and it turns out that, sort of, the notation I'm about to define here just in my own personal work has turned out to be one of the most useful things that I actually use all the time, to have a notation of how to take derivatives with respect to matrixes, so that you can solve for the minimum of J of theta with, like, a few lines of algebra rather than writing out pages and pages of matrices and derivatives.

So then we're going to define this new notation first and then we'll go ahead and work out the minimization. Given a function J, since J is a function of a vector of parameters theta, right, I'm going to define the derivative of the gradient of J with respect to theta, as self of vector. Okay, and so this is going to be an N plus one dimensional vector. Theta is an n plus one dimensional vector with indices ranging from zero to N. And so I'm going to define this derivative to be equal to that.

Okay, and so we can actually rewrite the gradient descent algorithm as follows. This is batch gradient descent, and we write gradient descent as updating the parameter vector theta – notice there's no subscript I now – updating the parameter vector theta as the previous parameter minus alpha times the gradient.

Okay, and so in this equation all of these quantities, theta, and this gradient vector, all of these are n plus one dimensional vectors. I was using the boards out of order, wasn't I? So more generally, if you have a function F that maps from the space of matrices A, that maps from, say, the space of N by N matrices to the space of real numbers. So if you have a function, F of A, where A is an N by N matrix.

So this function is matched from matrices to real numbers, the function that takes this input to matrix. Let me define the derivative with respect to F of the matrix A. Now, I'm just taking the gradient of F with respect to its input, which is the matrix. I'm going to define this itself to be a matrix.

Okay, so the derivative of F with respect to A is itself a matrix, and the matrix contains all the partial derivatives of F with respect to the elements of A. One more definition is if A is a square matrix, so if A is an n by n matrix, number of rows equals number of columns, let me define the trace of A to be equal to the sum of A's diagonal elements. So this is just sum over I of A, I, I.

For those of you that haven't seen this sort of operator notation before, you can think of trace of A as the trace operator applied to the square matrix A, but it's more commonly written without the parentheses. So I usually write trace of A like this, and this just means the sum of diagonal elements.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask