2.12 Machine learning lecture 12 course notes (Page 8/8)

Machine learning Page 8 / 8

In detail, the algorithm is as follows:

Randomly sample $m$ states $s^{(1)}, s^{(2)}, ... s^{(m)} \in S$ .
Initialize $θ : = 0$ .
Repeat ${$
1. For i = 1 , ... , m {
  1. For each action $a \in A$ ${$
  2. 1. Sample $s_{1}^{'}, ..., s_{k}^{'} \sim P_{s^{(i)} a}$ (using a model of the MDP).
    2. Set $q (a) = \frac{1}{k} \sum_{j = 1}^{k} R (s^{(i)}) + γ V (s_{j}^{'})$
    3. $/ /$ Hence, $q (a)$ is an estimate of $R (s^{(i)}) + γ E_{s^{'} \sim P_{s^{(i)} a}} [V (s^{'})]$ .
  3. $}$
  4. Set $y^{(i)} = {max}_{a} q (a)$ .
  5. $/ /$ Hence, $y^{(i)}$ is an estimate of $R (s^{(i)}) + γ {max}_{a} E_{s^{'} \sim P_{s^{(i)} a}} [V (s^{'})]$ .
2. $}$
3. $/ /$ In the original value iteration algorithm (over discrete states)
4. $/ /$ we updated the value function according to $V (s^{(i)}) : = y^{(i)}$ .
5. $/ /$ In this algorithm, we want $V (s^{(i)}) \approx y^{(i)}$ , which we'll achieve
6. $/ /$ using supervised learning (linear regression).
7. Set $θ : = arg {min}_{θ} \frac{1}{2} \sum_{i = 1}^{m} {(θ^{T} φ (s^{(i)}) - y^{(i)})}^{2}$
$}$

Above, we had written out fitted value iteration using linear regression as the algorithm to try to make $V (s^{(i)})$ close to $y^{(i)}$ . That step of the algorithm is completely analogous to a standard supervised learning (regression) problem in which we have a training set $(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ..., (x^{(m)}, y^{(m)})$ , and want to learn a function mapping from $x$ to $y$ ; the only difference is that here $s$ plays the role of $x$ . Eventhough our description above used linear regression, clearly other regression algorithms (such as locally weighted linear regression) can also be used.

Unlike value iteration over a discrete set of states, fitted value iteration cannot be proved to always to converge. However, in practice, it often does converge (or approximately converge), and works well for many problems.Note also that if we are using a deterministic simulator/model of the MDP, then fitted value iteration can be simplified by setting $k = 1$ in the algorithm. This is because the expectation in Equation [link] becomes an expectation over a deterministic distribution, and so a single example is sufficient to exactly compute that expectation.Otherwise, in the algorithm above, we had to draw $k$ samples, and average to try to approximate that expectation (see the definition of $q (a)$ , in the algorithm pseudo-code).

Finally, fitted value iteration outputs $V$ , which is an approximation to $V^{*}$ . This implicitly defines our policy. Specifically,when our system is in some state $s$ , and we need to choose an action, we would like to choose the action

arg max_{a} E_{s^{'} \sim P_{s a}} [V (s^{'})]

The process for computing/approximating this is similar to the inner-loop of fitted value iteration, where for each action, we sample $s_{1}^{'}, ..., s_{k}^{'} \sim P_{s a}$ to approximate the expectation. (And again, if the simulator is deterministic, we can set $k = 1$ .)

In practice, there're often other ways to approximate this step as well. For example, one very common case is if thesimulator is of the form $s_{t + 1} = f (s_{t}, a_{t}) + ϵ_{t}$ , where $f$ is some determinstic function of the states (such as $f (s_{t}, a_{t}) = A s_{t} + B a_{t}$ ), and $ϵ$ is zero-mean Gaussian noise. In this case, we can pick the action given by

arg max_{a} V (f (s, a)) .

In other words, here we are just setting $ϵ_{t} = 0$ (i.e., ignoring the noise in the simulator), and setting $k = 1$ . Equivalently, this can be derived from Equation [link] using the approximation

\begin{matrix} E_{s^{'}} [V (s^{'})] & \approx & V (E_{s^{'}} [s^{'}]) \\ = & V (f (s, a)), \end{matrix}

where here the expection is over the random $s^{'} \sim P_{s a}$ . So long as the noise terms $ϵ_{t}$ are small, this will usually be a reasonable approximation.

However, for problems that don't lend themselves to such approximations, having to sample $k | A |$ states using the model, in order to approximate the expectation above, can be computationally expensive.

<< Chapter < Page Page > Chapter >>

Read also:

Get Jobilize Job Search Mobile App in your pocket Now!

100% Free Mobile Applications
Receive real-time job alerts and never miss the right job again

Source: OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4

Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask

	12 AP Key Terms 12 The Nervous System By OpenStax Start Key Terms
©flickr: Luis	Final Exam Review By Madison Christian Start Exam
	23 Pesticides/Small animal poison Test By Brooke Delaney Start Exam
	English Proficiency Test By Anindyo Mukhopadhyay Start Quiz
	9 Domain Driven Design By JavaChamp Team Start Quiz
	Understanding basic music theory By OpenStax Read Online Course
	17 Biology 17 Biotechnology and Genomics MCQ By OpenStax Start Quiz
	14 AP Key Terms 14 Brain Cranial Nerves By OpenStax Start Key Terms
	9 Neuroanatomy 09 The Auditory System By Stephen Voron Start Quiz
	Anatomy & Physiology By OpenStax Read Online Course