Interruptibility

https://arbital.com/p/interruptibility

by Eliezer Yudkowsky Feb 13 2017 updated Feb 13 2017

A subproblem of corrigibility under the machine learning paradigm: when the agent is interrupted, it must not learn to prevent future interruptions.


"Interruptibility" is a subproblem of corrigibility (creating an advanced agent that allows us, its creators, to 'correct' what we see as our mistakes in constructing it), as seen from a machine learning paradigm. In particular, "interruptibility" says, "If you do interrupt the operation of an agent, it must not learn to avoid future interruptions."

The groundbreaking paper on interruptibility, "Safely Interruptible Agents", was published by [ Laurent Orseau] and [ Stuart Armstrong]. This says, roughly, that to avoid a model-based reinforcement-learning algorithm from learning to avoid interruption, we should, after any interruption, propagate internal weight updates as if the agent had received exactly its expected reward from before the interruption. This approach was inspired by [ Stuart Armstrong]'s earlier idea of Utility indifference.

Contrary to some uninformed media coverage, the above paper doesn't solve the general problem of getting an AI to not try to prevent itself from being switched off. In particular, it doesn't cover the advanced-safety case of a sufficiently intelligent AI that is trying to achieve particular future outcomes and that realizes it needs to go on operating in order to achieve those outcomes.

Rather, if a non-general AI is operating by policy reinforcement - repeating policies that worked well last time, and avoiding policies that worked poorly last time, in some general sense of a network being trained - then 'interruptibility' is about making an algorithm that, after being interrupted, doesn't define this as a poor outcome to be avoided (nor a good outcome to be repeated).

One way of seeing that Interruptibility doesn't address the general-cognition form of the problem is that Interruptibility only changes what happens after an actual interruption. So if a problem can arise from an AI foreseeing interruption in advance, before having ever actually been shut off, interruptibility won't address that (on the current paradigm).

Similarly, interruptibility would not be consistent under cognitive reflection; a sufficiently advanced AI that knew about the existence of the interruptibility code would have no reason to want that code to go on existing. (It's hard to even phrase that idea inside the reinforcement learning framework.)

Metaphorically speaking, we could see the general notion of 'interruptibility' as the modern-day shadow of corrigibility problems for non-generally-intelligent, non-future-preferring, non-reflective machine learning algorithms.

For an example of ongoing work on the advanced-agent form of Corrigibility, see the entry on Armstrong's original proposal of Utility indifference.