<< Chapter < Page | Chapter >> Page > |
Here, is, as usual, the vector of partial derivatives of with respect to the 's; and is an -by- matrix (actually, -by- , assuming that we include the intercept term) called the Hessian , whose entries are given by
Newton's method typically enjoys faster convergence than (batch) gradient descent, and requires many fewer iterations to get very close to theminimum. One iteration of Newton's can, however, be more expensive than one iteration of gradient descent, since it requires finding andinverting an -by- Hessian; but so long as is not too large, it is usually much faster overall. When Newton's method is applied to maximize thelogistic regression log likelihood function , the resulting method is also called Fisher scoring .
So far, we've seen a regression example, and a classification example. In the regression example, we had , and in the classification one, , for some appropriate definitions of and as functions of and . In this section, we will show that both of these methods are special cases of a broader family of models, calledGeneralized Linear Models (GLMs). We will also show how other models in the GLM family can be derived and applied to other classificationand regression problems.
To work our way up to GLMs, we will begin by defining exponential family distributions. We say that a class of distributions is in the exponential family if it can be writtenin the form
Here, is called the natural parameter (also called the canonical parameter ) of the distribution; is the sufficient statistic (for the distributions we consider, it will often be the case that ); and is the log partition function . The quantity essentially plays the role of a normalization constant, that makes sure the distribution sums/integrates over to 1.
A fixed choice of , and defines a family (or set) of distributions that is parameterized by ; as we vary , we then get different distributions within this family.
We now show that the Bernoulli and the Gaussian distributions are examples of exponential family distributions. The Bernoulli distribution with mean , written , specifies a distribution over , so that ; . As we vary , we obtain Bernoulli distributions with different means. We now show that this class of Bernoullidistributions, ones obtained by varying , is in the exponential family; i.e., that there is a choice of , and so that Equation [link] becomes exactly the class of Bernoulli distributions.
We write the Bernoulli distribution as:
Thus, the natural parameter is given by . Interestingly, if we invert this definition for by solving for in terms of , we obtain . This is the familiar sigmoid function! This will come up again when we derive logisticregression as a GLM. To complete the formulation of the Bernoulli distribution as an exponential familydistribution, we also have
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?