<< Chapter < Page | Chapter >> Page > |
Then we came up with this dynamic programming algorithm in the last lecture, where you compute V star of capital T that’s one value function for the last time step. So, in other words, what’s the value if your star in the state S and you just get to take one action and then the clock runs out. The aerodynamic programming algorithm that we repeatedly compute V star lowercase t in terms of V star t plus one, so we compute V star capital T and then recurs backwards, right? And so on, until we get down to V star zero and then pi star was given by, as usual, the augmax of the thing we had in the definition of the value function.
So last time, the specific example we saw of – or one specific example, that sort of define a horizon problem that we solved by DP was the LQR problem where we worked directly with continuous state and actions. And in the LQR problem we had these linear dynamics where the state ST plus one is a linear function of the previous state and action and then plus this Gaussian noise WT, which is covariance sigma W. And I said briefly last time, that one specific way to come up with these linear dynamics, right? Oh, excuse me. One specific way to take a system and come up with a linear model for it is if you have some simulator, say, right? So in this cartoon, the vertical axis represents ST plus one and the horizontal axis represents STAT. So say you have a simulator, right? And let’s say it determines [inaudible] simulator, so we have a [inaudible]simulator that tells you what the next state is ST plus one is a function of previous state and action. And then you can choose a point around which to linearize this simulator, by which I mean that you choose a point, an approximate your – approximate the function F using a linear function, this tangent, to the function F at that point. So if you do that you have ST plus one equals – shoot.
Sorry about writing down so many lines, but this will be the linearization approximation to the function F where I’ve taken a linearization around the specific point as bar T, A bar T. Okay? And so you can take those and just narrow it down to a linear equation like that where the next state ST plus one is now some linear function of the previous state ST and AT and these matrixes, AT and BT, will depend on your choice of location around which to linearize this function. Okay? I said last time, that this linearization approximation you, sort of, expect to be particularly good in the vicinity of, as bar T, A bar T because this linear function is a pretty good approximation to F, right? So if in this little neighborhood there. And – yes?
Student: [Inaudible] is there an assumption that you are looking at something the second recently indicates like the helicopter, are you assuming pilot behavior is the same as [inaudible]behavior or –
Instructor (Andrew Ng) :Yeah, right. So let me not call this as an assumption. Let me just say that when I use this algorithm, when I choose to linearize this way, then my approximation would be physically good in the vicinity here and it may be less good elsewhere and, so let me – when I actually talk about DDP I actually make use of this property. Which is why I’m going over it now. Okay? But, yeah. There is an intuition that you want to linearize near the vicinity—near of states, so you expect your system to spend the most time. Right.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?