<< Chapter < Page | Chapter >> Page > |
So, okay. So this is how you might come up with a linear model and if you do that then, oh, you can – let’s see. So for LQR we also have this sort of quadratic reward function, right? Where the matrixes UT and VT are positive semi-definite, so the rewards are always negative, that’s this minus sign. And then if you take exactly the dynamic programming algorithm, that I’ve written down just now, then – lets see. It turns out that the value function at every state, excuse me. It turns out the value function for every time step will be a quadratic function of the state and can be written like that. And so you initialize the dynamic programming algorithm as follows. And I just write this down, but there’s actually just one property I want to point out later, but this equation is, well, somewhat big and hairy, but don’t worry about most of its details.
Let me just put this. Shoot, there’s one more equation I want to fit in. Well, okay. All right. So it turns out the value function is a quadratic function where V star T is this and, so you initialize the dynamic programming step with this. This fi and this si gives you V capital T and then it records backwards. So these two equations express – will give you V subscript T as a function of VT plus one. Okay? So it incurs backwards for this learning algorithm. And the last thing, get this on the same board, so – sorry about the disorganized use of the boards. I just wanted this on the same place.
And, finally, the actual policy pi star of ST is given by some linear function of ST, so LT here is a matrix where LT is equal to – numerous times. Okay? And so when you do this you now have the actual policy, right? So just concretely, you run the dynamic programming algorithm to compute fi T and si T for all values of T and then you plug it in to compute the matrixes LT and now you know the optimal action stake and [inaudible]. Okay? So there’s one very interesting property – these equations are a huge mess, but you can re-derive them yourself, but don’t worry about the details. But this is one specific property of this dynamic programming algorithm that’s very interesting that I want to point out. Which is the following. Notice that to compute the optimal policy I need to compute these matrixes LT and notice that LT depends on A, it depends on B, it depends on D, and it depends on fi, but it doesn’t depend on si, right? Notice this further that when I carry out my dynamic programming algorithm my recursive definition for si – well it depends on – oh, excuse me. It should be si T, right. Si T plus one. Okay. In order to carry out my dynamic programming algorithm, right? For si T I need to know what si T plus one is. So si T depends on these things, but in order to carry out the dynamic programming for fi T, fi T doesn’t actually depend on si T plus one, right? And so in other words in order to compute the fi T’s I don’t need these si’s. So if all I want is the fi’s I can actually omit this step of the dynamic programming algorithm and not bother to carry out the dynamic programming algorithm in terms of the fi’s. And then having done my dynamic programming algorithm just for – excuse me, I misspoke. You – I can forget about the si’s and just do the dynamic programming updates for the fi matrixes and then having done my DP updates for the fi T I can then plug this into this formula to compute LT. Okay?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?