Instrumental convergence says that various properties $~$P$~$ of an agent, often scary or detrimental-by-default properties like "trying to gain control of lots of resources" or "deceiving humans into thinking you are nice", will fall out of pursuing most utility functions $~$U.$~$ You might be tempted to hope that nice or reassuring properties $~$P$~$ would also fall out of most utility functions $~$U$~$ in the same natural way. In fact, your brain might tempted to treat Clippy the Paperclip Maximizer as a political agent you were trying to cleverly persuade, and come up with clever arguments for why Clippy should do things your way in order to get more paperclips, like trying to persuade your boss why you ought to get a raise for the good of the company.
The problem here is that:
- Generally, when you think of a nice policy $~$\pi_1$~$ that produces some paperclips, there will be a non-nice policy $~$\pi_2$~$ that produces even more paperclips.
- Clippy is not trying to generate arguments for why it should do human-nice things in order to make paperclips; it is just neutrally pursuing paperclips. So Clippy is going to keep looking until it finds $~$\pi_2.$~$
For example:
• Your brain instinctively tries to persuade this imaginary Clippy to keep humans around by arguing, "If you keep us around as economic partners and trade with us, we can produce paperclips for you under Ricardo's Law of Comparative Advantage!" This is then the policy $~$\pi_1$~$ which would indeed produce some paperclips, but what would produce even more paperclips is the policy $~$\pi_2$~$ of disassembling the humans into spare atoms and replacing them with optimized paperclip-producers.
• Your brain tries to persuade an imaginary Clippy by arguing for policy $~$\pi_1,$~$ "Humans have a vast amount of varied life experience; you should keep us around and let us accumulate more experience, in case our life experience lets us make good suggestions!" This would produce some expected paperclips, but what would produce more paperclips is policy $~$\pi_2$~$ of "Disassemble all human brains and store the information in an archive, then simulate a much larger variety of agents in a much larger variety of circumstances so as to maximize the paperclip-relevant observations that could be made."
An unfortunate further aspect of this situation is that, in cases like this, your brain may be tempted to go on arguing for why really $~$\pi_2$~$ isn't all that great and $~$\pi_1$~$ is actually better, just like if your boss said "But maybe this company will be even better off if I spend that money on computer equipment" and your brain at once started to convince itself that computing equipment wasn't all that great and higher salaries were much more important for corporate productivity. (As Robert Trivers observed, deception of others often begins with deception of self, and this fact is central to understanding why humans evolved to think about politics the way we did.)
But since you don't get to see Clippy discarding your clever arguments and just turning everything in reach into paperclips - at least, not yet - your brain might hold onto its clever and possibly self-deceptive argument for why the thing you want is really the thing that produces the most paperclips.
Possibly helpful mental postures:
- Contemplate the maximum number of paperclips you think an agent could get by making paperclips the straightforward way - just converting all the galaxies within reach into paperclips. Okay, now does your nice policy $~$\pi_1$~$ generate more paperclips than that? How is that even possible?
- Never mind there being a "mind" present that you can "persuade". Suppose instead there's just a time machine that spits out some physical outputs, electromagnetic pulses or whatever, and the time machine outputs whatever electromagnetic pulses lead to the most future paperclips. What does the time machine do? Which outputs lead to the most paperclips as a strictly material fact?
- Study evolutionary biology. During the pre-1960s days of evolutionary biology, biologists would often try to argue for why natural selection would result in humanly-nice results, like animals controlling their own reproduction so as not to overburden the environment. There's a similar mental discipline required to not come up with clever arguments for why natural selection would do humanly nice things.