<< Chapter < Page | Chapter >> Page > |
And finally, right? Random reinforcement learning algorithm in simulation, meaning that you use your model of the dynamics to try to maximize that final horizon sum of rewards and when you do that you get a policy out, which I’m gonna call the policy pi subscript RL to denote the policy output by the reinforcement learning algorithm. Okay? Let’s say you do this and the resulting controller gets much worse performance than a human pilot that you hire to fly the helicopter for you. So how do you go about figuring out what to do next? Well, actually, you have several things you might do, right? You might try to improve the simulator, so there are exactly three steps. You want say maybe after the new model from the helicopter dynamics, but I think it’s non – it is actually non-linear. Or maybe you want to collect more training data, so you can get a better estimates of the transition probabilities of the helicopter. Or maybe you want to fiddle with the features you used to model the dynamics of your helicopter, right? Other things you might do is, you might modify the reward function R if you think, you know, it’s not just a quadratic function, maybe it’s something else. I don’t know. Or maybe you’re unsatisfied with the reinforcement learning algorithm. Maybe you think, you know, the algorithm isn’t quite doing the right thing. Or maybe you think you need to discritize the states more finely in order to apply your reinforcement learning algorithm. Or maybe you need to fiddle with the features you used in value function approximations. Okay? So these are three examples of things you might do and, again, quite often if you chose the wrong one to work on you can easily spend, you know – actually this one I don’t want to say six months. You can easily spend a year or two working on the wrong thing. Hey, Dan, this is a favor. I’m, sort of, out of chalk could you wander around and help me get? Thanks.
So the team does three things; they’ll copy the yellow box to the upper right of this slide. What can you do? So this is the, sort of, reasoning we actually go through on the helicopter project often and to decide what to work on. So let me just step for this example fairly slowly. So this, sort of, reasoning you might go through. Suppose these three assumptions hold true, right? Suppose that the helicopter simulator is accurate, so let’s suppose that you built an accurate model of the dynamics. And suppose that, sort of, I turned two under slides, so suppose that the reinforcement learning algorithm correctly controls the helicopter in simulation. So it’s a maximized that expected payoff, right? And suppose that maximizing the expected payoff corresponds to autonomous flight, right? If all of these three assumptions holds true, then that means that you would expect the learned controller pi subscript RL to fly well on the actual helicopter. Okay? So this is the – I’m, sort of, showing you the source of the reasoning that I go through when I’m trying to come up with a set of diagnostics for this problem. So these are some of the diagnostics we actually use routinely on various revised control problems. So pi subscript RL, right? We said it doesn’t fly well on the actual helicopter. So the first diagnostic you want to run is just check if it flies well in simulation, all right? So if it flies well in simulation, but not in real life, then the problem is in the simulator, right? Because the simulator predicts that your controller, the pi subscript RL, flies well, but it doesn’t actually fly well in real life. So if this holds true, then that suggests a problem is in the simulator. Question?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?