<< Chapter < Page | Chapter >> Page > |
So for example, for autonomous helicopter flight, you [inaudible] build a simulator using supervised learning, an algorithm like linear regression [inaudible]to linear regression, so we can get a nonlinear model of the dynamics of what ST plus one is as a random function of ST and AT. Now once you have a simulator, given any fixed policy you can quite straightforwardly evaluate any policy in a simulator. Concretely, our goal is to find the policy pi mapping from states to actions, so the goal is to find the green box like that. It works well. So if you have any one fixed policy pi, you can evaluate the policy pi just using the simulator via the picture shown at the bottom of the slide. So concretely, you can take your initial state S0, feed it into the policy pi, your policy pi will output some action A0, you plug it in the simulator, the simulator outputs a random state S1, you feed S1 into the policy and so on, and you get a sequence of states S0 through ST that your helicopter flies through in simulation. Then sum up the rewards, and this gives you an estimate of the expected payoff of the policy.
This picture is just a fancy way of saying fly your helicopter in simulation and see how well it does, and measure the sum of rewards you get on average in the simulator. The picture I’ve drawn here assumes that you run your policy through the simulator just once. In general, you would run the policy through the simulator some number of times and then average to get an average over M simulations to get a better estimate of the policy’s expected payoff. Now that I have a way – given any one fixed policy, this gives me a way to evaluate the expected payoff of that policy. So one reasonably obvious thing you might try to do is then just search for a policy, in other words search for parameters theta for your policy, that gives you high estimated payoff. Does that make sense? So my policy has some parameters theta, so my policy is my actions A are equal to theta transpose S say if there’s a linear policy. For any fixed value of the parameters theta, I can evaluate – I can get an estimate for how good the policy is using the simulator. One thing I might try to do is search for parameters theta to try to maximize my estimated payoff. It turns out you can sort of do that, but that idea as I’ve just stated is hard to get to work. Here’s the reason. The simulator allows us to evaluate policy, so let’s search for policy of high value.
The difficulty is that the simulator is random, and so every time we evaluate a policy, we get back a very slightly different answer. So in the cartoon below, I want you to imagine that the horizontal axis is the space of policies. In other words, as I vary the parameters in my policy, I get different points on the horizontal axis here. As I vary the parameters theta, I get different policies, and so I’m moving along the X axis, and my total payoff I’m gonna plot on the vertical axis, and the red line in this cartoon is the expected payoff of the different policies. My goal is to find the policy with the highest expected payoff. You could search for a policy with high expected payoff, but every time you evaluate a policy – say I evaluate some policy, then I might get a point that just by chance looked a little bit better than it should have. If I evaluate a second policy and just by chance it looked a little bit worse. I evaluate a third policy, fourth, sometimes you look here – sometimes I might actually evaluate exactly the same policy twice and get back slightly different answers just because my simulator is random, so when I apply the same policy twice in simulation, I might get back slightly different answers.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?