<< Chapter < Page | Chapter >> Page > |
So when I started to talk about reinforcement learning, I showed that video of a helicopter flying upside down. That was actually done using exactly method, using exactly this policy search algorithm. This seems to scale well even to fairly large problems, even to fairly high dimensional state spaces. Typically Pegasus policy search algorithms have been using – the optimization problem is still – is much easier than the stochastic version, but sometimes it’s not entirely trivial, and so you have to apply this sort of method with maybe on the order of ten parameters or tens of parameters, so 30, 40 parameters, but not thousands of parameters, at least in these sorts of things with them.
Student: So is that method different than just assuming that you know your simulator exactly, just throwing away all the random numbers entirely?
Instructor (Andrew Ng) :So is this different from assuming that we have a deterministic simulator? The answer is no. In the way you do this, for the sake of simplicity I talked about one sequence of random numbers. What you do is – so imagine that the random numbers are simulating different wind gusts against your helicopter. So what you want to do isn’t really evaluate just against one pattern of wind gusts. What you want to do is sample some set of different patterns of wind gusts, and evaluate against all of them in average. So what you do is you actually sample say 100 – some number I made up like 100 sequences of random numbers, and every time you want to evaluate a policy, you evaluate it against all 100 sequences of random numbers and then average. This is in exactly the same way that on this earlier picture you wouldn’t necessarily evaluate the policy just once. You evaluate it maybe 100 times in simulation, and then average to get a better estimate of the expected reward. In the same way, you do that here but with 100 fixed sequences of random numbers. Does that make sense? Any other questions?
Student: If we use 100 scenarios and get an estimate for the policy, [inaudible] 100 times [inaudible]random numbers [inaudible] won’t you get similar ideas [inaudible]?
Instructor (Andrew Ng) :Yeah. I guess you’re right. So the quality – for a fixed policy, the quality of the approximation is equally good for both cases. The advantage of fixing the random numbers is that you end up with an optimization problem that’s much easier. I have some search problem, and on the horizontal axis there’s a space of control policies, and my goal is to find a control policy that maximizes the payoff.
The problem with this earlier setting was that when I evaluate policies I get these noisy estimates, and then it’s just very hard to optimize the red curve if I have these points that are all over the place. And if I evaluate the same policy twice, I don’t even get back the same answer. By fixing the random numbers, the algorithm still doesn’t get to see the red curve, but at least it’s now optimizing a deterministic function. That makes the optimization problem much easier. Does that make sense?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?