<< Chapter < Page | Chapter >> Page > |
Okay. So welcome back, and let’s continue our discussion of reinforcement learning algorithms. On for today, the first thing I want to do is actually talk a little bit about debugging reinforcement learning algorithms and then I’ll continue the technical discussion from last week on LQR, on linear-quadratic regulation. In particular, I want to tell you about an algorithm called the French dynamic programming, which I think is actually a very effective absolute controls lack reinforcement learning algorithm for many problems. Then we’ll talk about Kalman filters and linear-quadratic Gaussian control, LGG. Let’s start with debugging RL algorithms. And can I switch to the laptop display, please?
And so this was actually – what I’m about to do here is this is actually a specific example that I had done earlier in this quarter, but that I promised to do again. So remember, you know, what was it? Roughly halfway through the quarter I’d given a lecture on debugging learning algorithms, right? This idea that very often you run a learning algorithm and it, you know, maybe does roughly what you want to and maybe it doesn’t do as well as you’re hoping. Then what do you do next? And the talk of this idea that, you know, some of the really, really good people in machine learning, the people that really understand learning algorithms, they’re really good at getting these things to work. Very often what they’re really good at is at figuring out why a learning algorithm is working or is not working and that prevents them from doing things for six months that someone else may be able to just look at and say gee, there was no point collecting more training data, because your learning algorithm had high bias rather than high variance. So that six months you spent collecting more training data – I could have told you six months ago it was a waste of time. Right?
So these are sorts of things that some of the people are really good at machine learning, that they really get machine learning, are very good at. Well, just a few of my slides. These slides I won’t actually talk about these, but these are exactly the same slides you saw last time. Actually, I’ll just skip ahead, I guess. So last time you saw this discussion on – right. Diagnostics for whether you happen to have a bias problem, or a variance problem, or in other cases where the – your optimization algorithm is converging or whether it’s the [inaudible]optimization objective and so on. And we’ll talk about that again, but the one example that I, sort of, promised to show again was actually a reinforcement learning example, but at that time we hadn’t talked about reinforcement learning yet, so I promised to do exactly the same example again, all right?
So let’s go for the example. The motivating example was robotic control. Let’s see. Write a – let’s say you want to design a controller for this helicopter. So this is a very typical way by which you might apply machine-learning algorithm or several machine-learning algorithms to a control problem, right? Which is you might first build a simulator – so control problem is you want to build a controller to make the helicopter hover in place, right? So the first thing you want to do is build a simulator of the helicopter. And this just means model of the state transition probabilities, a piece of SA of the helicopter, and you can do this by many different ways. Maybe you can try reading a helicopter textbook and building a simulator based on what’s known about aerodynamics of helicopters. It actually turns out this is very hard to do. Another thing you could do is collect data and maybe fit a linear model, or maybe fit a non-linear model, to what the next stage is as a function of current state and current action. Okay? So there’s different ways of estimating the state transition probabilities. So, now, you now have a simulator and I’m showing a screen shot of our simulator we have a stand fit on the upper right there. Second thing you might do is then choose a reward function. So you might choose this sort of quadratic cos function, right? So the reward for being at a state X is going to be minus the norm difference between a current state and some desired state of your one simple example of a reward function. And this, sort of, quadratic reward function is what we’ve been using in the last lecture in LQR control, linear-quadratic regulation control.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?