- Brilliant moves: moves that accomplish the system's goal – winning the game – in ways that a human wouldn't think of, and that might take a while to even be understandable in retrospect (or might elude our understanding altogether).
- Blunders: moves that humans can identify (though sometimes only in retrospect) as bad for the system's goal.
As AI systems become more capable, it will be harder to tell the difference between brilliant moves and blunders until their effects are felt, and even in retrospect they may be hard to diagnose. If hard-to-understand AI systems are given safety-critical or high-impact tasks, blunders could become a source of significant harm.
However, I think we should be at least as concerned about a third kind of behavior:
- Backfires: moves that accomplish the system's nominal goal, but that don't do what the user actually wanted or that have unintended side-effects, and that might only be identified as backfires in retrospect.
Like blunders, we have the challenge that backfires won't easily be distinguished from brilliant moves. Backfires bring additional challenges; unlike blunders, improving a system's ability to achieve its nominal goals won't fix backfires, and may actually make them worse:
- A backfire might accomplish what we really wanted, but with additional effects that we don't want – getting a ball into a hoop while also smashing every vase in the room, or making a cup of coffee while also lighting the house on fire or breaking the law. As systems become more capable, they will be able to cause broader effects, making this problem worse.
- A backfire might accomplish the nominal goal without accomplishing what we really want, e.g. by manipulating a reward signal directly instead of by winning games of Go or Atari. As systems become more capable, they will find more ways of accomplishing their nominal goals, making this problem worse.
Backfires could happen because it's difficult to specify in full what we want a system to accomplish and what unintended consequences we want it to avoid, difficult to know in advance what means a system might use to accomplish its nominal goal, and difficult to specify goals in a way that can't be "gamed".
No comments:
Post a Comment