David Foster Waffles: AI and self-modeling

One problem in theoretical AI that sometimes comes up is the problem of finding ways for AI systems to model themselves, or at least to act well as if they had models of themselves. I can see how this is a problem for uncomputable agents like AIXI (though I think this problem is largely solved here), but it doesn't seem to me to be a problem for computable agents -- they can learn models of themselves along with the rest of the world. I'll give an example of trouble that some kinds of systems can run into, then my reasons for not thinking this is a big problem (though I'm by no means sure!).

A problem for model-based RL

Suppose that we're using model-based RL; our system learns a model that maps states of the world and actions the system takes to next states and rewards (see e.g. this talk and the slides). This learned model is used to choose actions by building a tree of possible sequences of actions the system could take and the consequences that the model predicts would result. The situation our system is in will be as follows:

The system is learning to perform some episodic RL task; at the end of each episode, the environment is reset and another instance is run.
In this environment, the agent has an action that gives a moderately large reward, but that forces the agent to take a null action for the rest of the episode.

The interesting thing here is that the system's model won't learn anything about the bad side effect of this action, even if it impacts the system's total reward a lot. This is because the model maps (state, action) → (next state); it learns what environmental state the bad action leads to, and after that it learns a lot about the effects of the null action, but it doesn't learn that the bad action leads to the null action. Furthermore, the tree search will continue to assume that the system will be able to choose whatever action it wants, even when the system will be forced to take the null action.

This is concerning, but the fix seems simple: simply have the system learn an additional model that maps states to states, implicitly causing it to model the system's action selection. Then, when the agent selects an action, have it use the (state, action) → (state) model followed by several iterations of the (state) → (state) model to see what effects that action will have. This should allow it to learn that it will be forced to take the null action, so that it can choose that action only when it actually maximises rewards.

In general, this kind of approach seems fine to me; a system can learn a model of the environment including itself, and use this model to figure out the long-term consequences of its actions. I haven't yet found a problem with this, and I might look for some kind of formal guarantee.

It's not obvious to me how this kind of problem could affect model-free systems; my feeling is that they should do fine, but I'd like to know more.

All in all, the theoretical problem involving uncomputable ideals like AIXI seems to be mostly solved, and the practical problem doesn't seem like a big deal. Am I missing something?

David Foster Waffles

Wednesday, March 2, 2016

AI and self-modeling

No comments:

Post a Comment

The Archives