I agree that reflective degrees of freedom won't "fix themselves" automatically, and that this is a useful concept.
There are at least two different approaches to getting the reflective degrees of freedom right:
- Figure out the right settings and build a reflectively consistent system that has those settings.
- Build a system which is motivated to defer to human judgments or to hypothetical human judgments.
A system of type 2 might be motivated to adopt the settings that humans would endorse upon reflection, rather than to continue using its interim decision theory/prior/etc.
On its face, I think that type 2 approach seems significantly more promising. The techniques needed to defer to human views about decision theory / priors / etc. already seem necessary to defer to human values.
You've given the argument that the interim prior/decision theory/whatever would lead to catastrophically bad outcomes, either because there are exotic failures, or because we wouldn't have a good enough theory and so would be forced to use a less principled approach (which we wouldn't actually be able to make aligned).
I don't find this argument especially convincing. I think it is particularly weak in the context of act-based agents, and especially proposals like this one. In this context I don't think we have compelling examples of plausible gotchas. We've seen some weird cases like simulation warfare, but these appear to be ruled out by the kinds of robustness guarantees that are already needed in more prosaic cases. Others, like blackmail or Pascal's mugging, don't seem to come up.
Comments
Eliezer Yudkowsky
I think we have a foundational disagreement here about to what extent saying "Oh, the AI will just predict that by modeling humans" solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans.
Let's say you have a schmuck human who hasn't studied Pascal's Mugging. They build a Solomonoff-like prior into their AI, and an aggregative utility function, which both seem to them like reasonable approximate models of how humans behave. The AI seems to behave reasonably during the training phase, but once it's powerful enough is Pascal's Mugged into weird edge-case behavior.
When I imagine trying to use a 'predict human acts' system, I worry that, unless we have strong transparency into the system internals and we know about the Pascal's Mugging problem, what would happen to the equivalent schmuck would be that the system generalized something a lot like consequentialism and aggregative ethics as mostly compactly predicting the acts that the humans approved or produced after a lot of reflection, and then the generalization would break down later on the same edge case.
Some of this probably reflects the degree to which you're imagining using an act-based agent that is a strong superintelligence with access to brain scans which is hence relatively epistemically efficient on every prediction, while I'm imagining trying to use something that isn't yet that smart (because we can't let it FOOM up to superintelligence, because we don't fully trust it, or because there's a chicken-and-egg problem with requiring trustworthy predictions to bootstrap in a trustworthy way).
You also seem to be imagining that the problem of corrigibility has otherwise already been solved, or is maybe being solved via some other predictive thing, whereas I'm treating generalization failures that can kill you before you have time to register or spot the prediction failure as being indeed failures - you seem to assume there's a mature corrigibility system which catches that.
I'm not sure this is the right page to have this discussion; we should probably be talking about inside the act-based system pages.
Paul Christiano
I responded here.
I'm not imagining superintelligences. If we have this conversation in the context of existing machine learning systems, I feel just as good about it.