Missing the weird alternative

https://arbital.com/p/missing_weird

by Eliezer Yudkowsky Jun 9 2016 updated Jun 27 2016

People might systematically overlook "make tiny molecular smileyfaces" as a way of "producing smiles", because our brains automatically search for high-utility-to-us ways of "producing smiles".


The "Unforeseen maximum" problem is alleged to be a foreseeable difficulty of coming up with a good goal for an AGI (part of the alignment problem for advanced agents). Roughly, an "unforeseen maximum" happens when somebody thinks that "produce smiles" would be a great goal for an AGI, because you can produce lots of smiles by making people happy, and making people happy is good. However, while it's true that making people happy by ordinary means will produce some smiles, what will produce even more smiles is administering regular doses of heroin or turning all matter within reach into tiny molecular smileyfaces.

"Missing the weird alternative" is an attempt to psychologize about why people talking about AGI utility functions might make this kind of oversight systematically. To avoid Bulverism, if you're not yet convinced that missing a weird alternative would be a dangerous oversight, please read Unforeseen maximum first or instead.

In what follows we'll use $~$U$~$ to denote a proposed utility function for an AGI, $~$V$~$ to denote our own normative values, $~$\pi_1$~$ to denote the high-$~$V$~$ policy that somebody thinks is the attainable maximum of $~$U,$~$ and $~$\pi_0$~$ to denote what somebody else suggests is a higher-$~$U$~$ lower-$~$V$~$ alternative.

Alleged historical cases

Some historical instances of AGI goal systems proposed in a publication or conference presentation, that have been argued to be "missing the weird alternative" are:

Many other instances of this alleged issue have allegedly been spotted in more informal dicussion.

Psychologized reasons to miss a weird alternative

Psychologizing some possible reasons why some people might systematically "miss the weird alternative", assuming that was actually happening:

Our brain doesn't bother searching V-bad parts of policy space

Arguendo: The human brain is built to implicitly search for high-$~$V$~$ ways to accomplish a goal. Or not actually high-$~$V$~$, but high-$~$W$~$ where $~$W$~$ is what we intuitively want, which has something to do with $~$V.$~$ "Tile the universe with tiny smiley-faces" is low-$~$W$~$ so doesn't get considered.

Arguendo, your brain is built to search for policies it prefers. If you were looking for a way to open a stuck jar, your brain wouldn't generate the option of detonating a stick of dynamite, because that would be a policy ranked very low in your preference-ordering. So what's the point of searching that part of the policy space?

This argument seems to prove too much in that it suggests that a chess player would be unable to search for their opponent's most preferred moves, if human brains could only search for policies that were high inside their own preference ordering. But there could be an explicit perspective-taking operation required, and somebody modeling an AI they had warm feelings about might fail to fully take the AI's perspective; that is, they fail to carry out an explicit cognitive step needed to switch off the "only $~$W$~$-good policies" filter.

We might also have a limited native ability to take perspectives on goals not our own. I.e., without further training, our brain can readily imagine that a chess opponent wants us to lose, or imagine that an AI wants to kill us because it hates us, and consider "reasonable" policy options along those lines. But this expanded policy search still fails to consider policies on the lines of "turn everything into tiny smileyfaces" when asking for ways to produce smiles, because nobody in the ancestral environment would have wanted that option and so our brain has a hard time natively modeling it.

Our brain doesn't automatically search weird parts of policy space

Arguendo: The human brain doesn't search "weird" (generalization-violating) parts of the policy space without an explicit effort.

The potential issue here is that "tile the galaxy with tiny smileyfaces" or "build environmental objects that encrypt streams of 1s or 0s, then reveal secrets" would be weird in the sense of violating generalizations that usually hold about policies or consequences in human experience. Not generalizations like, "nobody wants smiles smaller than an inch", but rather, "most problems are not solved with tiny molecular things".

Edge instantiation would tend to push the maximum (attainable optimum) of $~$U$~$ in "weird" or "extreme" directions - e.g., the most smiles can be obtained by making them very small, if this variable is not otherwise constrained. So the unforeseen maxima might tend to violate implicit generalizations that usually govern most goals or policies and that our brains take for granted. Aka, the unforeseen maximum isn't considered/generated by the policy search, because it's weird.

Conflating the helpful with the optimal

Arguendo: Someone might simply get as far as "$~$\pi_1$~$ increases $~$U$~$" and then stop there and conclude that a $~$U$~$-agent does $~$\pi_1.$~$

That is, they might just not realize that the argument "an advanced agent optimizing $~$U$~$ will execute policy $~$\pi_1$~$" requires "$~$\pi_1$~$ is the best way to optimize $~$U$~$" and not just "ceteris paribus, doing $~$\pi_1$~$ is better for $~$U$~$ than doing nothing". So they don't realize that establishing "a $~$U$~$-agent does $~$\pi_1$~$" requires establishing that no other $~$\pi_k$~$ produces higher expected $~$U$~$. So they just never search for a $~$\pi_k$~$ like that.

They might also be implicitly modeling $~$U$~$-agents as only weakly optimizing $~$U$~$, and hence not seeing a $~$U$~$-agent as facing tradeoffs or opportunity costs; that is, they implicitly model a $~$U$~$-agent as having no desire to produce any more $~$U$~$ than $~$\pi_1$~$ produces. Again psychologizing, it does sometimes seem like people try to mentally model a $~$U$~$-agent as "an agent that sorta wants to produce some $~$U$~$ as a hobby, so long as nothing more important comes along" rather "an agent whose action-selection criterion entirely consists of doing whatever action is expected to lead to the highest $~$U$~$".

This would well-reflect the alleged observation that people allegedly "overlooking the weird alternative" seem more like they failed to search at all, than like they conducted a search but couldn't think of anything.

Political persuasion instincts on convenient instrumental strategies

If the above hypothetical was true - that people just hadn't thought of the possibility of higher-$~$U$~$ $~$\pi_k$~$ existing - then we'd expect them to quickly change their minds upon this being pointed out. Actually, it's been empirically observed that there seems to be a lot more resistance than this.

One possible force that could produce resistance to the observation "$~$\pi_0$~$ produces more $~$U$~$" - over and above the null hypothesis of ordinary pushback in argument, admittedly sometimes a very powerful force on its own - might be a brain running in a mode of "persuade another agent to execute a strategy $~$\pi$~$ which is convenient to me, by arguing to the agent that $~$\pi$~$ best serves the agent's own goals". E.g. if you want to persuade your boss to give you a raise, one would be wise to argue "you should give me a raise because it will make this project more efficient" rather than "you should give me a raise because I like money". By the general schema of the political brain, we'd be very likely to have built-in support for searching for arguments that policy $~$\pi$~$ that we just happen to like, is a great way to achieve somebody else's goal $~$U.$~$

Then on the same schema, a competing policy $~$\pi_0$~$ which is better at achieving the other agent's $~$U$~$, but less convenient for us than $~$\pi_1$~$, is an "enemy soldier" in the political debate. We'll automatically search for reasons why $~$\pi_0$~$ is actually really bad for $~$U$~$ and $~$\pi_1$~$ is actually really good, and feel an instinctive dislike of $~$\pi_0.$~$ By the standard schema on the self-deceptive brain, we'd probably convince ourselves that $~$\pi_0$~$ is really bad for $~$U$~$ and $~$\pi_1$~$ is really best for $~$U.$~$ It would not be advantageous to our persuasion to go around noticing ourselves all the reasons that $~$\pi_0$~$ is good for $~$U.$~$ And we definitely wouldn't start spontaneously searching for $~$\pi_k$~$ that are $~$U$~$-better than $~$\pi_1,$~$ once we'd already found some $~$\pi_1$~$ that was very convenient to us.

(For a general post on the "fear of third alternatives", see here. This essay also suggests that a good test for whether you might be suffering from "fear of third alternatives" is to ask yourself whether you instinctively dislike or automatically feel skeptical of any proposed other options for achieving the stated criterion.)

The [apple_pie_problem apple pie problem]

Sometimes people propose that the only utility function an AGI needs is $~$U$~$, where $~$U$~$ is something very good, like democracy or freedom or [apple_pie_problem apple pie].

In this case, perhaps it sounds like a good thing to say about $~$U$~$ that it is the only utility function an AGI needs; and refusing to agree with this is not praising $~$U$~$ as highly as possible, hence an enemy soldier against $~$U.$~$

Or: The speaker may not realize that "$~$U$~$ is really quite amazingly fantastically good" is not the same proposition as "an agent that maximizes $~$U$~$ and nothing else is beneficial", so they treat contradictions of the second statement as though they contradicted the first.

Or: Pointing out that $~$\pi_0$~$ is high-$~$U$~$ but low-$~$V$~$ may sound like an argument against $~$U,$~$ rather than an observation that apple pie is not the only good. "A universe filled with nothing but apple pie has low value" is not the same statement as "apple pie is bad and should not be in our utility function".

If the "apple pie problem" is real, it seems likely to implicitly rely on or interact with some of the other alleged problems. For example, someone may not realize that their own complex values $~$W$~$ contain a number of implicit filters $~$F_1, F_2$~$ which act to filter out $~$V$~$-bad ways of achieving $~$U,$~$ because they themselves are implicitly searching only for high-$~$W$~$ ways of achieving $~$U.$~$