<< Chapter < Page | Chapter >> Page > |
And so our goal is to tune the parameters theta – our goal in policy search is to tune the parameters theta so that when we execute the policy pi subscript theta, the pole stays up as long as possible. In other words, our goal is to maximize as a function of theta – our goal is to maximize the expected value of the payoff for when we execute the policy pi theta. We want to choose parameters theta to maximize that. Are there questions about the problem set up, and policy search and policy classes or anything? Yeah.
Student: In a case where we have more than two actions, would we use a different theta for each of the distributions, or still have the same parameters?
Instructor (Andrew Ng) :Oh, yeah. Right. So what if we have more than two actions. It turns out you can choose almost anything you want for the policy class, but you have say a fixed number of discrete actions, I would sometimes use like a softmax parameterization. Similar to softmax regression that we saw earlier in the class, you may say that – [inaudible] out of space. You may have a set of parameters theta 1 through theta D if you have D actions and – pi equals E to the theta I transpose S over – so that would be an example of a softmax parameterization for multiple actions. It turns out that if you have continuous actions, you can actually make this be a density over the actions A and parameterized by other things as well.
But the choice of policy class is somewhat up to you, in the same way that the choice of whether we chose to use a linear function or linear function with quadratic features or whatever in supervised learning that was sort of up to us. Anything else? Yeah.
Student: [Inaudible] stochastic?
Instructor (Andrew Ng) :Yes.
Student: So is it possible to [inaudible] a stochastic policy using numbers [inaudible]?
Instructor (Andrew Ng) :I see. Given that MDP has stochastic transition probabilities, is it possible to use [inaudible] policies and [inaudible]the stochasticity of the state transition probabilities. The answer is yes, but for the purposes of what I want to show later, that won’t be useful. But formally, it is possible. If you already have a fixed – if you have a fixed policy, then you’d be able to do that. Anything else? Yeah. No, I guess even a [inaudible] class of policy can do that, but for the derivation later, I actually need to keep it separate. Actually, could you just – I know the concept of policy search is sometimes a little confusing. Could you just raise your hand if this makes sense? Okay. Thanks. So let’s talk about an algorithm. What I’m gonna talk about – the first algorithm I’m going to present is sometimes called the reinforce algorithm. What I’m going to present it turns out isn’t exactly the reinforce algorithm as it was originally presented by the author Ron Williams, but it sort of captures its essence.
Here’s the idea. In the sequel – in what I’m about to do, I’m going to assume that S0 is some fixed initial state. Or it turns out if S0 is drawn from some fixed initial state distribution then everything else [inaudible], but let’s just say S0 is some fixed initial state. So my goal is to maximize this expected sum [inaudible]. Given the policy and whatever else, drop that. So the random variables in this expectation is a sequence of states and actions: S0, A0, S1, A1, and so on, up to ST, AT are the random variables. So let me write out this expectation explicitly as a sum over all possible state and action sequences of that – so that’s what an expectation is. It’s the probability of the random variables times that. Let me just expand out this probability. So the probability of seeing this exact sequence of states and actions is the probability of the MDP starting in that state. If this is a deterministic initial state, then all the probability mass would be on one state. Otherwise, there’s some distribution over initial states. Then times the probability that you chose action A0 from that state as zero, and then times the probability that the MDP’s transition probabilities happen to transition you to state S1 where you chose action A0 to state S0, times the probability that you chose that and so on. The last term here is that, and then times that.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?