<< Chapter < Page | Chapter >> Page > |
So here’s my somewhat arbitrary choice. I’m gonna say that the probability of action A1, so pi of S comma A1, I’m gonna write as – okay? And I just chose the logistic function because it’s a convenient function we’ve used a lot. So I’m gonna say that my policy is parameterized by a set of parameters theta, and for any given set of parameters theta, that gives me a stochastic policy. And if I’m executing that policy with parameters theta, that means that the chance of my choosing to a set of [inaudible] is given by this number. Because my chances of executing actions A1 or A2 must sum to one, this gives me pi of S A2. So just [inaudible], this means that when I’m in sum state S, I’m going to compute this number, compute one over one plus E to the minus state of transpose S. And then with this probability, I will execute the accelerate right action, and with one minus this probability, I’ll execute the accelerate left action. And again, just to give you a sense of why this might be a reasonable thing to do, let’s say my state vector is – this is [inaudible] state, and I added an extra one as an interceptor, just to give my logistic function an extra feature. If I choose my parameters and my policy to be say this, then that means that at any state, the probability of my taking action A1 – the probability of my taking the accelerate right action is this one over one plus E to the minus state of transpose S, which taking the inner product of theta and S, this just gives you phi, equals one over one plus E to the minus phi.
And so if I choose my parameters theta as follows, what that means is that just depending on the angle phi of my inverted pendulum, the chance of my accelerating to the right is just this function of the angle of my inverted pendulum. And so this means for example that if my inverted pendulum is leaning far over to the right, then I’m very likely to accelerate to the right to try to catch it. I hope the physics of this inverted pendulum thing make sense. If my pole’s leaning over to the right, then I wanna accelerate to the right to catch it. And conversely if phi is negative, it’s leaning over to the left, and I’ll accelerate to the left to try to catch it. So this is one example for one specific choice of parameters theta. Obviously, this isn’t a great policy because it ignores the rest of the features. Maybe if the cart is further to the right, you want it to be less likely to accelerate to the right, and you can capture that by changing one of these coefficients to take into account the actual position of the cart. And then depending on the velocity of the cart and the angle of velocity, you might want to change theta to take into account these other effects as well. Maybe if the pole’s leaning far to the right, but is actually on its way to swinging back, it’s specified to the angle of velocity, then you might be less worried about having to accelerate hard to the right. And so these are the sorts of behavior you can get by varying the parameters theta.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?