<< Chapter < Page | Chapter >> Page > |
The gradient ascent direction for reinforce can sometimes be a bit noisy, and so it’s not that uncommon to need like a million iterations of gradient ascent, or ten million, or 100 million iterations of gradient ascent for reinforce [inaudible], so that’s just something to watch out for. One consequence of that is in the reinforce algorithm – I shouldn’t really call it reinforce. In what’s essentially the reinforce algorithm, there’s this step where you need to sample a state action sequence. So in principle you could do this on your own robot. If there were a robot you were trying to control, you can actually physically initialize in some state, pick an action and so on, and go from there to sample a state action sequence. But if you need to do this ten million times, you probably don’t want to [inaudible]your robot ten million times. I personally have seen many more applications of reinforce in simulation. You can easily run ten thousand simulations or ten million simulations of your robot in simulation maybe, but you might not want to do that – have your robot physically repeat some action ten million times. So I personally have seen many more applications of reinforce to learn using a simulator than to actually do this on a physical device.
The last thing I wanted to do is tell you about one other algorithm, one final policy search algorithm. [Inaudible] the laptop display please. It’s a policy search algorithm called Pegasus that’s actually what we use on our autonomous helicopter flight things for many years. There are some other things we do now. So here’s the idea. There’s a reminder slide on RL formalism. There’s nothing here that you don’t know, but I just want to pictorially describe the RL formalism because I’ll use that later. I’m gonna draw the reinforcement learning picture as follows. The initialized [inaudible]system, say a helicopter or whatever in sum state S0, you choose an action A0, and then you’ll say helicopter dynamics takes you to some new state S1, you choose some other action A1, and so on. And then you have some reward function, which you reply to the sequence of states you summed out, and that’s your total payoff.
So this is just a picture I wanna use to summarize the RL problem. Our goal is to maximize the expected payoff, which is this, the expected sum of the rewards. And our goal is to learn the policy, which I denote by a green box. So our policy – and I’ll switch back to deterministic policies for now. So my deterministic policy will be some function mapping from the states to the actions.
As a concrete example, you imagine that in the policy search setting, you may have a linear class of policies. So you may imagine that the action A will be say a linear function of the states, and your goal is to learn the parameters of the linear function. So imagine trying to do linear progression on policies, except you’re trying to optimize the reinforcement learning objective. So just [inaudible] imagine that the action A is state of transpose S, and you go and policy search this to come up with good parameters theta so as to maximize the expected payoff. That would be one setting in which this picture applies. There’s the idea. Quite often we come up with a model or a simulator for the MDP, and as before a model or a simulator is just a box that takes this input some state ST, takes this input some action AT, and then outputs some [inaudible]state ST plus one that you might want to take in the MDP. This ST plus one will be a random state. It will be drawn from the random state transition probabilities of MDP. This is important. Very important, ST plus one will be a random function ST and AT. In the simulator, this is [inaudible].
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?