<< Chapter < Page | Chapter >> Page > |
Student: So every time you fix the random numbers, you get a nice curve to optimize. And then you change the random numbers to get a bunch of different curves that are easy to optimize. And then you smush them together?
Instructor (Andrew Ng) :Let’s see. I have just one nice black curve that I’m trying to optimize.
Student: For each scenario.
Instructor (Andrew Ng) :I see. So I’m gonna average over M scenarios, so I’m gonna average over 100 scenarios. So the black curve here is defined by averaging over a large set of scenarios. Does that make sense? So instead of only one – if the averaging thing doesn’t make sense, imagine that there’s just one sequence of random numbers. That might be easier to think about. Fix one sequence of random numbers, and every time I evaluate another policy, I evaluate against the same sequence of random numbers, and that gives me a nice deterministic function to optimize. Any other questions? The advantage is really that – one way to think about it is when I evaluate the same policy twice, at least I get back the same answer. This gives me a deterministic function mapping from parameters in my policy to my estimate of the expected payoff. That’s just a function that I can try to optimize using the search algorithm. So we use this algorithm for inverted hovering, and again policy search algorithms tend to work well when you can find a reasonably simple policy mapping from the states to the actions. This is sort of especially the low level control tasks, which I think of as sort of reflexes almost.
Completely, if you want to solve a problem like Tetris where you might plan ahead a few steps about what’s a nice configuration of the board, or something like a game of chess, or problems of long path plannings of go here, then go there, then go there, then sometimes you might apply a value function method instead. But for tasks like helicopter flight, for low level control tasks, for the reflexes of driving or controlling various robots, policy search algorithms were easier – we can sometimes more easily approximate the policy directly works very well. Some [inaudible] the state of RL today. RL algorithms are applied to a wide range of problems, and the key is really sequential decision making. The place where you think about applying reinforcement learning algorithm is when you need to make a decision, then another decision, then another decision, and some of your actions may have long-term consequences. I think that is the heart of RL’s sequential decision making, where you make multiple decisions, and some of your actions may have long-term consequences. I’ve shown you a bunch of robotics examples. RL is also applied to thinks like medical decision making, where you may have a patient and you want to choose a sequence of treatments, and you do this now for the patient, and the patient may be in some other state, and you choose to do that later, and so on.
It turns out there’s a large community of people applying these sorts of tools to queues. So imagine you have a bank where you have people lining up, and after they go to one cashier, some of them have to go to the manager to deal with something else. You have a system of multiple people standing in line in multiple queues, and so how do you route people optimally to minimize the waiting time. And not just people, but objects in an assembly line and so on. It turns out there’s a surprisingly large community working on optimizing queues. I mentioned game playing a little bit already. Things like financial decision making, if you have a large amount of stock, how do you sell off a large amount – how do you time the selling off of your stock so as to not affect market prices adversely too much? There are many operations research problems, things like factory automation. Can you design a factory to optimize throughput, or minimize cost, or whatever. These are all sorts of problems that people are applying reinforcement learning algorithms to.
Notification Switch
Would you like to follow the 'Machine learning' conversation and receive update notifications?