AI-boxing is the theory that deals in machine intelligences that are allegedly safer due to allegedly having extremely restricted manipulable channels of causal interaction with the outside universe.
AI-boxing theory includes:
- The straightforward problem of building elaborate sandboxes (computers and simulation environments designed not to have any manipulable channels of causal interaction with the outside universe).
- Foreseeable difficulties whereby the remaining, limited channels of interaction may be exploited to manipulate the outside universe, especially the human operators.
- The attempt to design preference frameworks that are not incentivized to go outside the Box, not incentivized to manipulate the outside universe or human operators, and incentivized to answer questions accurately or perform whatever other activity is allegedly to be performed inside the box.
The central difficulty of AI boxing is to describe a channel which cannot be used to manipulate the human operators, but which provides information relevant enough to be pivotal or game-changing relative to larger events. For example, it seems not unthinkable that we could safely extract, from a boxed AI setup, reliable information that prespecified theorems had been proved within Zermelo-Fraenkel set theory, but there is no known way to save the world if only we could sometimes know that prespecified theorems had been reliably proven in Zermelo-Fraenkel set theory.