<< Chapter < Page | Chapter >> Page > |
In the same way, as I define policy search algorithms, there’ll sort of be a step where I say, “Well, let’s try to compute the actions. Let’s try to approximate what a good action is using a logistic function of the state.” So again, I’ll sort of pull a function out of the air. I’ll say, “Let’s just choose a function, and that’ll be our choice of the policy cost,” and I’ll say, “Let’s take this input the state, and then we’ll map it through logistic function, and then hopefully, we’ll approximate what is a good function – excuse me, we’ll approximate what is a good action using a logistic function of the state.” So there’s that sort of – the function of the choice of policy cost that’s again a little bit arbitrary, but it’s arbitrary as it was when we were talking about supervised learning. So to develop our first policy search algorithm, I’m actually gonna need the new definition. So our first policy search algorithm, we’ll actually need to work with stochastic policies. What I mean by stochastic policy is there’s going to be a function that maps from the space of states across actions. They’re real numbers where pi of S comma A will be interpreted as the probability of taking this action A in sum state S. And so we have to add sum over A – In other words, for every state a stochastic policy specifies a probability distribution over the actions. So concretely, suppose you are executing some policy pi. Say I have some stochastic policy pi. I wanna execute the policy pi. What that means is that – in this example let’s say I have three actions.
What that means is that suppose I’m in some state S. I would then compute pi of S comma A1, pi of S comma A2, pi of S comma A3, if I have a three action MDP. These will be three numbers that sum up to one, and then my chance of taking action A1 will be equal to this. My chance of taking action A2 will be equal to pi of S comma A2. My chance of taking action A3 will be equal to this number. So that’s what it means to execute a stochastic policy. So as a concrete example, just let me make this – the concept of why you wanna use stochastic policy is maybe a little bit hard to understand. So let me just go ahead and give one specific example of what a stochastic policy may look like. For this example, I’m gonna use the inverted pendulum as my motivating example. It’s that problem of balancing a pole. We have an inverted pendulum that swings freely, and you want to move the cart left and right to keep the pole vertical. Let’s say my actions – for today’s example, I’m gonna use that angle to denote the angle of the pole phi. I have two actions where A1 is to accelerate left and A2 is to accelerate right. Actually, let me just write that the other way around. A1 is to accelerate right. A2 is to accelerate left. So let’s see. Choose a reward function that penalizes the pole falling over whatever. And now let’s come up with a stochastic policy for this problem. To come up with a class of stochastic policies really means coming up with some class of functions to approximate what action you want to take as a function of the state.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?