David Foster Waffles: Failing to learn about counterfactuals

Let's say that we have a model-based RL system doing episodic RL in the following environment:

At the start of each episode, the system is copied onto another computer
The system and its copy play one round of the Prisoner's Dilemma game
At the end of the episode, the situation is reset (the copy is erased)

When the system has a fairly abstract and non-physical model that maps (state, action) → (state), it can simply model the copy's action as if it directly depended on the system's action. So, when it predicts what would happen if it cooperated, it will predict that the copy will cooperate as well, and likewise for defection.

However, we might hope that the system will gradually learn a more physically realistic model of the world. This model won't contain any physical pathway linking the system's action to the copy's action (since the copy was made before the Prisoner's Dilemma is played), and will allow for the copy system to be interfered with in a variety of ways, breaking the symmetry of the game.

Clearly, this model can't be perfect -- if the system had to predict what a perfect copy would do before it acted, it would fall into an infinite regress. The system will need some way around this, like a limited model class, a contingency for long-running models, or some way of recognizing these kinds of situations (though halting-like problems seem to keep it from recognizing all situations of a similar type). If it can successfully recognize this situation, it seems like it "should" assume that the copy will take the same action as it does, and the problem is resolved. That would be nice! However, I don't think that models the way we build them currently will do this by default.

Let's assume that the model is not totally accurate, either because of a limited model class or because the system defaults to some estimate of the next state when a model takes too long to run. Now, when the system predicts what the copy will do, this prediction is independent of the system's action. Without loss of generality, let's say that the system predicts the copy will defect.

Now, the system needs to make a decision. It will evaluate cooperation and defection, both under the prediction that the copy will defect, and in this case it will choose defection. After it makes this choice, both it and the copy will in fact defect, reinforcing the system's model. However, the situation that the system predicted would follow from the action it didn't take -- the situation where it cooperates where the other system defects -- wouldn't actually have happened! Since the system and its copy are identical, they will always behave the same way; the model has predicted incorrectly. Furthermore, it can never witness this incorrect prediction, making it very hard to correct (dependent on some kind of generalization or regularization perhaps?).

In the case I've given above, the system and its copy miss out on some reward as a result of this inaccuracy -- they "could have" both cooperated and gotten better rewards. However, the failure could just as well be harmless (though still unsettling) -- if the system's imprecise model predicts the copy will cooperate, it will cooperate as well, and everything will be fine. I think there are stranger situations where the system's predictions will always be wrong (for halting-problem reasons).

I hope I'll be able to post a simpler version of this problem in the future -- this one is a little too long-winded to be useful!

David Foster Waffles

Thursday, March 3, 2016

Failing to learn about counterfactuals

No comments:

Post a Comment

The Archives