<< Chapter < Page Chapter >> Page >

It turns out that policy search algorithms are especially effective when you can choose a simple policy class pi. So the question really is for your problem does there exist a simple function like a linear function or a logistic function that maps from features of the state to the action that works pretty well. So the problem with the inverted pendulum – this is quite likely to be true. Going through all the different choices of parameters, you can say things like if the pole’s leaning towards the right, then accelerate towards the right to try to catch it. Thanks to the inverted pendulum, this is probably true. For lots of what’s called low level control tasks, things like driving a car, the low level reflexes of do you steer your car left to avoid another car, do you steer your car left to follow the car road, flying a helicopter, again very short time scale types of decisions – I like to think of these as decisions like trained operator for like a trained driver or a trained pilot. It would almost be a reflex, these sorts of very quick instinctive things where you map very directly from the inputs, data, and action. These are problems for which you can probably choose a reasonable policy class like a logistic function or something, and it will often work well. In contrast, if you have problems that require long multistep reasoning, so things like a game of chess where you have to reason carefully about if I do this, then they’ll do that, then they’ll do this, then they’ll do that. Those I think of as less instinctual, very high level decision making. For problems like that, I would sometimes use a value function approximation approaches instead.

Let me say more about this later. The last thing I want to do is actually tell you about – I guess just as a side comment, it turns out also that if you have POMDP, if you have a partially observable MDP – I don’t want to say too much about this – it turns out that if you only have an approximation – let’s call it S hat of the true state, and so this could be S hat equals S of T given T from Kalman filter – then you can still use these sorts of policy search algorithms where you can say pi theta of S hat comma A – There are various other ways you can use policy search algorithms for POMDPs, but this is one of them where if you only have estimates of the state, you can then choose a policy class that only looks at your estimate of the state to choose the action. By using the same way of estimating the states in both training and testing, this will usually do some – so these sorts of policy search algorithms can be applied often reasonably effectively to POMDPs as well. There’s one more algorithm I wanna talk about, but some final words on the reinforce algorithm. It turns out the reinforce algorithm often works well but is often extremely slow. So it [inaudible]works, but one thing to watch out for is that because you’re taking these gradient ascent steps that are very noisy, you’re sampling a state action sequence, and then you’re sort of taking a gradient ascent step in essentially a sort of random direction that only on expectation is correct.

Get Jobilize Job Search Mobile App in your pocket Now!

Get it on Google Play Download on the App Store Now




Source:  OpenStax, Machine learning. OpenStax CNX. Oct 14, 2013 Download for free at http://cnx.org/content/col11500/1.4
Google Play and the Google Play logo are trademarks of Google Inc.

Notification Switch

Would you like to follow the 'Machine learning' conversation and receive update notifications?

Ask