Machine learning lecture 1 course notes (Page 8/13)

Machine learning

Page 8 / 13

Logistic regression

We could approach the classification problem ignoring the fact that $y$ is discrete-valued, and use our old linear regression algorithm to try to predict $y$ given $x$ . However, it is easy to construct examples where this method performs very poorly.Intuitively, it also doesn't make sense for $h_{θ} (x)$ to take values larger than 1 or smaller than 0 when we know that $y \in {0, 1}$ .

To fix this, let's change the form for our hypotheses $h_{θ} (x)$ . We will choose

h_{θ} (x) = g (θ^{T} x) = \frac{1}{1 + e^{- θ^{T} x}},

where

g (z) = \frac{1}{1 + e^{- z}}

is called the logistic function or the sigmoid function . Here is a plot showing $g (z)$ :

Notice that $g (z)$ tends towards 1 as $z \to \infty$ , and $g (z)$ tends towards 0 as $z \to - \infty$ . Moreover, g(z), and hence also $h (x)$ , is always bounded between 0 and 1. As before, we are keeping the conventionof letting $x_{0} = 1$ , so that $θ^{T} x = θ_{0} + \sum_{j = 1}^{n} θ_{j} x_{j}$ .

For now, let's take the choice of $g$ as given. Other functions that smoothly increase from 0 to 1 can also be used, but for a couple of reasons that we'llsee later (when we talk about GLMs, and when we talk about generative learning algorithms),the choice of the logistic function is a fairly natural one. Before moving on, here's a useful property of the derivative of the sigmoid function,which we write as $g^{'}$ :

\begin{matrix} g^{'} (z) & = & \frac{d}{d z} \frac{1}{1 + e^{- z}} \\ = & \frac{1}{{(1 + e^{- z})}^{2}} (e^{- z}) \\ = & \frac{1}{(1 + e^{- z})} \cdot (1 - \frac{1}{(1 + e^{- z})}) \\ = & g (z) (1 - g (z)) . \end{matrix}

So, given the logistic regression model, how do we fit $θ$ for it? Following how we saw least squares regression could be derived as the maximum likelihood estimatorunder a set of assumptions, let's endow our classification model with a set of probabilistic assumptions,and then fit the parameters via maximum likelihood.

Let us assume that

\begin{matrix} P (y = 1 ∣ x; θ) & = & h_{θ} (x) \\ P (y = 0 ∣ x; θ) & = & 1 - h_{θ} (x) \end{matrix}

Note that this can be written more compactly as

\begin{matrix} p (y ∣ x; θ) = {(h_{θ}, (x))}^{y} {(1 - h_{θ} (x))}^{1 - y} \end{matrix}

Assuming that the $m$ training examples were generated independently, we can then write down the likelihood of the parameters as

\begin{matrix} L (θ) & = & p (\vec{y} ∣ X; θ) \\ = & \prod_{i = 1}^{m} p (y^{(i)} ∣ x^{(i)}; θ) \\ = & \prod_{i = 1}^{m} {(h_{θ}, (x^{(i)}))}^{y^{(i)}} {(1 - h_{θ} (x^{(i)}))}^{1 - y^{(i)}} \end{matrix}

As before, it will be easier to maximize the log likelihood:

\begin{matrix} ℓ (θ) & = & log L (θ) \\ = & \sum_{i = 1}^{m} y^{(i)} log h (x^{(i)}) + (1 - y^{(i)}) log (1 - h (x^{(i)})) \end{matrix}

How do we maximize the likelihood? Similar to our derivation in the case of linear regression, we can use gradient ascent. Written in vectorialnotation, our updates will therefore be given by $θ : = θ + α \nabla_{θ} ℓ (θ)$ . (Note the positive rather than negative sign in the update formula, since we're maximizing, ratherthan minimizing, a function now.) Let's start by working with just one training example $(x, y)$ , and take derivatives to derive the stochastic gradient ascent rule:

\begin{matrix} \frac{\partial}{\partial θ_{j}} ℓ (θ) & = & (y \frac{1}{g (θ^{T} x)} - (1 - y) \frac{1}{1 - g (θ^{T} x)}) \frac{\partial}{\partial θ_{j}} g (θ^{T} x) \\ = & (y \frac{1}{g (θ^{T} x)} - (1 - y) \frac{1}{1 - g (θ^{T} x)}) g (θ^{T} x) (1 - g (θ^{T} x) \frac{\partial}{\partial θ_{j}} θ^{T} x \\ = & (y (1 - g (θ^{T} x)) - (1 - y) g (θ^{T} x)) x_{j} \\ = & (y - h_{θ} (x)) x_{j} \end{matrix}

Above, we used the fact that $g^{'} (z) = g (z) (1 - g (z))$ . This therefore gives us the stochastic gradient ascent rule

θ_{j} : = θ_{j} + α (y^{(i)} - h_{θ} (x^{(i)})) x_{j}^{(i)}

If we compare this to the LMS update rule, we see that it looks identical; but this is not the same algorithm, because $h_{θ} (x^{(i)})$ is now defined as a non-linear function of $θ^{T} x^{(i)}$ . Nonetheless, it's a little surprising that we end up with the same update rule for a rather differentalgorithm and learning problem. Is this coincidence, or is there a deeper reason behind this? We'll answer this when get get to GLM models. (See alsothe extra credit problem on Q3 of problem set 1.)

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask

	24 Biology 24 Fungi MCQ By OpenStax Start Quiz
©flickr:	Arctic Ease Telemarketing Campaign Quiz By Nicole Duquette Start Quiz
	2 Arts Society: Theater 2 By Jonathan Long Start Quiz
	19 AP 19 Cardiovascular System Heart MCQ By OpenStax Start Quiz
	Renaissance Baroque Arts By Marion Cabalfin Start Quiz
©flickr:	Biology 1 By Jill Zerressen Start Quiz
	13 AP 13 Nervous System Essay By OpenStax Start Flashcards
©flickr: anjelkam	Art By Caitlyn Gobble Start Exam
©flickr: Luis	Atoms By Carly Allen Start Quiz
	45 Biology 45 Population and Community Ecology MCQ By OpenStax Start Quiz