<< Chapter < Page | Chapter >> Page > |
We could approach the classification problem ignoring the fact that is discrete-valued, and use our old linear regression algorithm to try to predict given . However, it is easy to construct examples where this method performs very poorly.Intuitively, it also doesn't make sense for to take values larger than 1 or smaller than 0 when we know that .
To fix this, let's change the form for our hypotheses . We will choose
where
is called the logistic function or the sigmoid function . Here is a plot showing :
Notice that tends towards 1 as , and tends towards 0 as . Moreover, g(z), and hence also , is always bounded between 0 and 1. As before, we are keeping the conventionof letting , so that .
For now, let's take the choice of as given. Other functions that smoothly increase from 0 to 1 can also be used, but for a couple of reasons that we'llsee later (when we talk about GLMs, and when we talk about generative learning algorithms),the choice of the logistic function is a fairly natural one. Before moving on, here's a useful property of the derivative of the sigmoid function,which we write as :
So, given the logistic regression model, how do we fit for it? Following how we saw least squares regression could be derived as the maximum likelihood estimatorunder a set of assumptions, let's endow our classification model with a set of probabilistic assumptions,and then fit the parameters via maximum likelihood.
Let us assume that
Note that this can be written more compactly as
Assuming that the training examples were generated independently, we can then write down the likelihood of the parameters as
As before, it will be easier to maximize the log likelihood:
How do we maximize the likelihood? Similar to our derivation in the case of linear regression, we can use gradient ascent. Written in vectorialnotation, our updates will therefore be given by . (Note the positive rather than negative sign in the update formula, since we're maximizing, ratherthan minimizing, a function now.) Let's start by working with just one training example , and take derivatives to derive the stochastic gradient ascent rule:
Above, we used the fact that . This therefore gives us the stochastic gradient ascent rule
If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because is now defined as a non-linear function of . Nonetheless, it's a little surprising that we end up with the same update rule for a rather differentalgorithm and learning problem. Is this coincidence, or is there a deeper reason behind this? We'll answer this when get get to GLM models. (See alsothe extra credit problem on Q3 of problem set 1.)
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?