<< Chapter < Page | Chapter >> Page > |
So as I evaluate more and more policies, these are the pictures I get. My goal is to try to optimize the red line. I hope you appreciate this is a hard problem, especially when all [inaudible] optimization algorithm gets to see are these black dots, and they don’t have direct access to the red line. So when my input space is some fairly high dimensional space, if I have ten parameters, the horizontal axis would actually be a 10-D space, all I get are these noisy estimates of what the red line is. This is a very hard stochastic optimization problem. So it turns out there’s one way to make this optimization problem much easier. Here’s the idea. And the method is called Pegasus, which is an acronym for something. I’ll tell you later. So the simulator repeatedly makes calls to a random number generator to generate random numbers RT, which are used to simulate the stochastic dynamics. What I mean by that is that the simulator takes this input of state and action, and it outputs the mixed state randomly, and if you peer into the simulator, if you open the box of the simulator and ask how is my simulator generating these random mixed states ST plus one, pretty much the only way to do so – pretty much the only way to write a simulator with random outputs is we’re gonna make calls to a random number generator, and get random numbers, these RTs, and then you have some function that takes this input S0, A0, and the results of your random number generator, and it computes some mixed state as a function of the inputs and of the random number it got from the random number generator.
This is pretty much the only way anyone implements any random code, any code that generates random outputs. You make a call to a random number generator, and you compute some function of the random number and of your other inputs. The reason that when you evaluate different policies you get different answers is because every time you rerun the simulator, you get a different sequence of random numbers from the random number generator, and so you get a different answer every time, even if you evaluate the same policy twice. So here’s the idea. Let’s say we fix in advance the sequence of random numbers, choose R1, R2, up to RT minus one. Fix the sequence of random numbers in advance, and we’ll always use the same sequence of random numbers to evaluate different policies. Go into your code and fix R1, R2, through RT minus one. Choose them randomly once and then fix them forever.
If you always use the same sequence of random numbers, then the system is no longer random, and if you evaluate the same policy twice, you get back exactly the same answer. And so these sequences of random numbers, [inaudible] call them scenarios, and Pegasus is an acronym for policy evaluation of gradient and search using scenarios. So when you do that, this is the picture you get. As before, the red line is your expected payoff, and by fixing the random numbers, you’ve defined some estimate of the expected payoff. And as you evaluate different policies, they’re still only approximations to their true expected payoff, but at least now you have a deterministic function to optimize, and you can now apply your favorite algorithms, be it gradient ascent or some sort of local [inaudible]search to try to optimize the black curve. This gives you a much easier optimization problem, and you can optimize this to find hopefully a good policy. So this is the Pegasus policy search method.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?