Reflectively consistent degree of freedom

When an instrumentally efficient, self-modifying AI can be like X or like X' in such a way that X wants to be X and X' wants to be X', that's a reflectively consistent degree of freedom.

A "reflectively consistent degree of freedom" is when a self-modifying AI can have multiple possible properties $~$X_i \in X$~$ such that an AI with property $~$X_1$~$ wants to go on being an AI with property $~$X_1,$~$ and an AI with $~$X_2$~$ will ceteris paribus only choose to self-modify into designs that are also $~$X_2,$~$ etcetera.

The archetypal reflectively consistent degree of freedom is a [humean_freedom Humean degree of freedom], the refective consistency of many different possible utility functions. If Gandhi doesn't want to kill you, and you offer Gandhi a pill that makes him want to kill people, then [gandhi_stability_argument Gandhi will refuse the pill], because he knows that if he takes the pill then pill-taking-future-Gandhi will kill people, and the current Gandhi rates this outcome low in his preference function. Similarly, a paperclip maximizer wants to remain a paperclip maximizer. Since these two possible preference frameworks are both consistent under reflection, they constitute a "reflectively consistent degree of freedom" or "reflective degree of freedom".

From a design perspective, or the standpoint of an AI safety mindset, the key fact about a reflectively consistent degree of freedom is that it doesn't automatically self-correct as a result of the AI trying to improve itself. The problem "Has trouble understanding General Relativity" or "Cannot beat a human at poker" or "Crashes on seeing a picture of a dolphin" is something that you might expect to correct automatically and without specifically directed effort, assuming you otherwise improved the AI's general ability to understand the world and that it was self-improving. "Wants paperclips instead of eudaimonia" is not self-correcting.

Another way of looking at it is that reflective degrees of freedom describe information that is not automatically extracted or learned given a sufficiently smart AI, the way it would automatically learn General Relativity. If you have a concept whose borders (membership condition) relies on knowing about General Relativity, then when the AI is sufficiently smart it will see a simple definition of that concept. If the concept's borders instead rely on [ value-laden] judgments, there may be no algorithmically simple description of that concept, even given lots of knowledge of the environment, because the [humean_freedom Humean degrees of freedom] need to be independently specified.

Other properties besides the preference function look like they should be reflectively consistent in similar ways. For example, [ son of CDT] and [ UDT] both seem to be reflectively consistent in different ways. So an AI that has, from our perspective, a 'bad' decision theory (one that leads to behaviors we don't want), isn't 'bugged' in a way we can rely on to self-correct. (This is one reason why MIRI studies decision theory and not computer vision. There's a sense in which mistakes in computer vision automatically fix themselves, given a sufficiently advanced AI, and mistakes in decision theory don't fix themselves.)

Similarly, Bayesian priors are by default consistent under reflection - if you're a Bayesian with a prior, you want to create copies of yourself that have the same prior or Bayes-updated versions of the prior. So 'bugs' (from a human standpoint) like being Pascal's Muggable might not automatically fix themselves in a way that correlated with sufficient growth in other knowledge and general capability, in the way we might expect a specific mistaken belief about gravity to correct itself in a way that correlated to sufficient general growth in capability. (This is why MIRI thinks about [ naturalistic induction] and similar questions about prior probabilities.)

Comments

Paul Christiano

I agree that reflective degrees of freedom won't "fix themselves" automatically, and that this is a useful concept.

There are at least two different approaches to getting the reflective degrees of freedom right:

Figure out the right settings and build a reflectively consistent system that has those settings.
Build a system which is motivated to defer to human judgments or to hypothetical human judgments.

A system of type 2 might be motivated to adopt the settings that humans would endorse upon reflection, rather than to continue using its interim decision theory/prior/etc.

On its face, I think that type 2 approach seems significantly more promising. The techniques needed to defer to human views about decision theory / priors / etc. already seem necessary to defer to human values.

You've given the argument that the interim prior/decision theory/whatever would lead to catastrophically bad outcomes, either because there are exotic failures, or because we wouldn't have a good enough theory and so would be forced to use a less principled approach (which we wouldn't actually be able to make aligned).

I don't find this argument especially convincing. I think it is particularly weak in the context of act-based agents, and especially proposals like this one. In this context I don't think we have compelling examples of plausible gotchas. We've seen some weird cases like simulation warfare, but these appear to be ruled out by the kinds of robustness guarantees that are already needed in more prosaic cases. Others, like blackmail or Pascal's mugging, don't seem to come up.

Eliezer Yudkowsky

I think we have a foundational disagreement here about to what extent saying "Oh, the AI will just predict that by modeling humans" solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans.

Let's say you have a schmuck human who hasn't studied Pascal's Mugging. They build a Solomonoff-like prior into their AI, and an aggregative utility function, which both seem to them like reasonable approximate models of how humans behave. The AI seems to behave reasonably during the training phase, but once it's powerful enough is Pascal's Mugged into weird edge-case behavior.

When I imagine trying to use a 'predict human acts' system, I worry that, unless we have strong transparency into the system internals and we know about the Pascal's Mugging problem, what would happen to the equivalent schmuck would be that the system generalized something a lot like consequentialism and aggregative ethics as mostly compactly predicting the acts that the humans approved or produced after a lot of reflection, and then the generalization would break down later on the same edge case.

Some of this probably reflects the degree to which you're imagining using an act-based agent that is a strong superintelligence with access to brain scans which is hence relatively epistemically efficient on every prediction, while I'm imagining trying to use something that isn't yet that smart (because we can't let it FOOM up to superintelligence, because we don't fully trust it, or because there's a chicken-and-egg problem with requiring trustworthy predictions to bootstrap in a trustworthy way).

You also seem to be imagining that the problem of corrigibility has otherwise already been solved, or is maybe being solved via some other predictive thing, whereas I'm treating generalization failures that can kill you before you have time to register or spot the prediction failure as being indeed failures - you seem to assume there's a mature corrigibility system which catches that.

I'm not sure this is the right page to have this discussion; we should probably be talking about inside the act-based system pages.

Paul Christiano

I responded here.

Some of this probably reflects the degree to which you're imagining using an act-based agent that is a strong superintelligence with access to brain scans which is hence relatively epistemically efficient on every prediction

I'm not imagining superintelligences. If we have this conversation in the context of existing machine learning systems, I feel just as good about it.