A problem for model-based RL
Suppose that we're using model-based RL; our system learns a model that maps states of the world and actions the system takes to next states and rewards (see e.g. this talk and the slides). This learned model is used to choose actions by building a tree of possible sequences of actions the system could take and the consequences that the model predicts would result. The situation our system is in will be as follows:
- The system is learning to perform some episodic RL task; at the end of each episode, the environment is reset and another instance is run.
- In this environment, the agent has an action that gives a moderately large reward, but that forces the agent to take a null action for the rest of the episode.
The interesting thing here is that the system's model won't learn anything about the bad side effect of this action, even if it impacts the system's total reward a lot. This is because the model maps (state, action) → (next state); it learns what environmental state the bad action leads to, and after that it learns a lot about the effects of the null action, but it doesn't learn that the bad action leads to the null action. Furthermore, the tree search will continue to assume that the system will be able to choose whatever action it wants, even when the system will be forced to take the null action.
This is concerning, but the fix seems simple: simply have the system learn an additional model that maps states to states, implicitly causing it to model the system's action selection. Then, when the agent selects an action, have it use the (state, action) → (state) model followed by several iterations of the (state) → (state) model to see what effects that action will have. This should allow it to learn that it will be forced to take the null action, so that it can choose that action only when it actually maximises rewards.
In general, this kind of approach seems fine to me; a system can learn a model of the environment including itself, and use this model to figure out the long-term consequences of its actions. I haven't yet found a problem with this, and I might look for some kind of formal guarantee.
It's not obvious to me how this kind of problem could affect model-free systems; my feeling is that they should do fine, but I'd like to know more.
All in all, the theoretical problem involving uncomputable ideals like AIXI seems to be mostly solved, and the practical problem doesn't seem like a big deal. Am I missing something?
No comments:
Post a Comment