<< Chapter < Page | Chapter >> Page > |
Student: [Inaudible] the helicopter pilots try to fly on the simulator? Do you have the real helicopter pilots try to fly on the simulator?
Instructor (Andrew Ng) :Do I try to have the real helicopter flying simulator?
Student: Real helicopter pilots.
Instructor (Andrew Ng) :Oh, I see. Do we ask the helicopter pilots to fly in simulation? So, yeah. It turns out one of the later diagnostics you could do that. On our project we don’t do that very often. We informally ask the pilot, who’s one of the best pilots in the world, Gary Zoco, to look at the simulator sometimes. We don’t very often ask him to fly in the simulator. That answers your question. But let me actually go on and show you some of the other diagnostics that you might use then. Okay? Second is let me use pi subscript human to denote the human control policy, right? This is pi subscript human is policy that, you know, however the human flies it. So one thing to do is look at the value of pi RL compared to the value of pi subscript human. Okay? So what this means really is, look at how the helicopter looks like when it’s flying under control of the pi subscript RL and look at what the helicopter does when it’s flying under the human pilot control and evaluate – and then, sort of, compute the sum of rewards, right? For your human pilot performance and compute the sum of rewards for the learning controller performance and see on, say, the real helicopter or that question or you can do this on the real helicopter or on simulation actually, but you can see does the human obtain a higher or a lower sum of rewards on average than does the controller you just learned. Okay? And then the way you do this you actually can go and fly the helicopter and just measure the sum of rewards, right? On the actual sequence of states the helicopter flew through, right? So if this condition holds true, right? Where my mouse pointer is if – oh, excuse me. Okay. If this condition holds true where my mouse pointer is, if it holds true that pi subscript RL is less than pi subscript human, those of you watching online I don’t know if you can see this, but this V pi subscript RL of as zero less than V pi subscript human of as zero. But if this holds true, then that suggests that a problem is in the reinforcement learning algorithm because the human has found a policy that obtains a higher reward than does the reinforcement learning algorithm. So this proves, or this shows, that your reinforcement learning algorithm is not maximizing the sum of rewards, right? And lastly, the last condition is this – the last test is this, if the inequality holds in the opposite direction – so if the reinforcement learning algorithm obtains a higher sum of rewards on average than does the human, but the reinforcement learning algorithm still flies worse than the human does, right? Then this suggests that the problem is in the cos functions than the reward function because the reinforcement learning algorithm is obtaining a higher sum of rewards than the human, but it still flies worse than the human. So that suggests that maximizing the sum of rewards does not correspond to very good autonomous flight. So if this holds true, then the problem is in your reward function. Or in other words, the problem is in your optimization objective and then rather than the algorithm that’s trying to maximize the optimization objective and, so you might change your optimization objective. In other words, you might change reward function. Okay?
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?