2.8 Machine learning lecture 8 course notes (Page 2/3)

Machine learning Page 2 / 3

\begin{matrix} \sum_{i} log p (x^{(i)}; θ) & = & \sum_{i} log \sum_{z^{(i)}} p (x^{(i)}, z^{(i)}; θ) \end{matrix}

\begin{matrix} = & \sum_{i} log \sum_{z^{(i)}} Q_{i} (z^{(i)}) \frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})} \end{matrix}

\begin{matrix} \geq & \sum_{i} \sum_{z^{(i)}} Q_{i} (z^{(i)}) log \frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})} \end{matrix}

The last step of this derivation used Jensen's inequality. Specifically, $f (x) = log x$ is a concave function, since $f^{''} (x) = - 1 / x^{2} < 0$ over its domain $x \in R^{+}$ . Also, the term

\sum_{z^{(i)}} Q_{i} (z^{(i)}) [\frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})}]

in the summation is just an expectation of the quantity $[p (x^{(i)}, z^{(i)}; θ) / Q_{i} (z^{(i)})]$ with respect to $z^{(i)}$ drawn according to the distribution given by $Q_{i}$ . By Jensen's inequality, we have

f (E_{z^{(i)} \sim Q_{i}}, [\frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})}]) \geq E_{z^{(i)} \sim Q_{i}} [f, (\frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})})],

where the “ $z^{(i)} \sim Q_{i}$ ” subscripts above indicate that the expectations are with respect to $z^{(i)}$ drawn from $Q_{i}$ . This allowed us to go from [link] to [link] .

Now, for any set of distributions $Q_{i}$ , the formula [link] gives a lower-bound on $ℓ (θ)$ . There're many possible choices for the $Q_{i}$ 's. Which should we choose? Well, if we have some current guess $θ$ of the parameters, it seems natural to try to make the lower-bound tight at that value of $θ$ . I.e., we'll make the inequality above hold with equality at ourparticular value of $θ$ . (We'll see later how this enables us to prove that $ℓ (θ)$ increases monotonically with successsive iterations of EM.)

To make the bound tight for a particular value of $θ$ , we need for the step involving Jensen's inequality inour derivation above to hold with equality. For this to be true, we know it is sufficient that that the expectationbe taken over a “constant”-valued random variable. I.e., we require that

\frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})} = c

for some constant $c$ that does not depend on $z^{(i)}$ . This is easily accomplished by choosing

Q_{i} (z^{(i)}) \propto p (x^{(i)}, z^{(i)}; θ) .

Actually, since we know $\sum_{z} Q_{i} (z^{(i)}) = 1$ (because it is a distribution), this further tells us that

\begin{matrix} Q_{i} (z^{(i)}) & = & \frac{p (x^{(i)}, z^{(i)}; θ)}{\sum_{z} p (x^{(i)}, z; θ)} \\ = & \frac{p (x^{(i)}, z^{(i)}; θ)}{p (x^{(i)}; θ)} \\ = & p (z^{(i)} | x^{(i)}; θ) \end{matrix}

Thus, we simply set the $Q_{i}$ 's to be the posterior distribution of the $z^{(i)}$ 's given $x^{(i)}$ and the setting of the parameters $θ$ .

Now, for this choice of the $Q_{i}$ 's, Equation [link] gives a lower-bound on the loglikelihood $ℓ$ that we're trying to maximize. This is the E-step. In the M-step of the algorithm, we then maximizeour formula in Equation [link] with respect to the parameters to obtain a new setting of the $θ$ 's. Repeatedly carrying out these two steps gives us the EM algorithm, which is as follows:

Repeat until convergence {
- (E-step) For each $i$ , set
  $Q_{i} (z^{(i)}) : = p (z^{(i)} | x^{(i)}; θ) .$
- (M-step) Set
  $θ : = arg max_{θ} \sum_{i} \sum_{z^{(i)}} Q_{i} (z^{(i)}) log \frac{p (x^{(i)}, z^{(i)}; θ)}{Q_{i} (z^{(i)})} .$
$}$

How will we know if this algorithm will converge? Well, suppose $θ^{(t)}$ and $θ^{(t + 1)}$ are the parameters from two successive iterations of EM. We will now prove that $ℓ (θ^{(t)}) \leq ℓ (θ^{(t + 1)})$ , which shows EM always monotonically improves the log-likelihood.The key to showing this result lies in our choice of the $Q_{i}$ 's. Specifically, on the iteration of EM in which the parametershad started out as $θ^{(t)}$ , we would have chosen $Q_{i}^{(t)} (z^{(i)}) : = p (z^{(i)} | x^{(i)}; θ^{(t)})$ . We saw earlier that this choice ensures that Jensen's inequality,as applied to get [link] , holds with equality, and hence

ℓ (θ^{(t)}) = \sum_{i} \sum_{z^{(i)}} Q_{i}^{(t)} (z^{(i)}) log \frac{p (x^{(i)}, z^{(i)}; θ^{(t)})}{Q_{i}^{(t)} (z^{(i)})} .

The parameters $θ^{(t + 1)}$ are then obtained by maximizing the right hand side of the equation above. Thus,

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask

	8 Sociology 08 Media and Technology MCQ By OpenStax Start Quiz
	Art History ARTH209 20th Century By Rebecca Butterfield Start Quiz
	NCE Ch 05 Theories in Counseling Helping... By Anh Dao Start Exam
©flickr: Derek	Treatment Of Psychological Disorders By Michael Nelson Start Quiz
	1 Endocrinology (MCQ) By Rohini Ajay Start Quiz
	22 AP 22 Respiratory System Essay By OpenStax Start Flashcards
©flickr: Luis	Final Exam Review By Madison Christian Start Exam
	4 BOD Hemolymphatic -Dr. Han By Brooke Delaney Start Exam
	Vocabulary for "A Rose for Emily" By Bonnie Hurst Start Quiz
	Principles of economics By OpenStax Read Online Course