Mindcrime

https://arbital.com/p/mindcrime

by Eliezer Yudkowsky Jun 9 2015 updated Dec 29 2016

Might a machine intelligence contain vast numbers of unhappy conscious subprocesses?


[summary(Gloss): A huge amount of harm could occur if a machine intelligence turns out to contain lots of conscious subprograms enduring poor living conditions. One worry is that this might happen if an AI models humans in too much detail.]

[summary: 'Mindcrime' is Nick Bostrom's suggested term for the moral catastrophe that occurs if a machine intelligence contains enormous numbers of conscious beings trapped inside its code.

This could happen as a result of self-awareness being a natural property of computationally efficient subprocesses. Perhaps more worryingly, the best model of a person may be a person itself, even if they're not the same person. This means that AIs trying to model humans might be unusually likely to create hypotheses and simulations that are themselves conscious.]

[summary(Technical): 'Mindcrime' is Nick Bostrom's term for mind designs producing moral harm by their internal operation, particularly through embedding sentient subprocesses.

One worry is that mindcrime might arise in the course of an agent trying to predict or manipulate the humans in its environment, since this implies a pressure to model the humans in faithful detail. This is especially concerning since several value alignment proposals would explicitly call for modeling humans in detail, e.g. extrapolated volition and imitation-based agents.

Another problem scenario is if the natural design for an efficient subprocess involves independent consciousness (though it is a separate question if this optimal design involves pain or suffering).

Computationally powerful agents might contain vast numbers of trapped conscious subprocesses, qualifying this as a [ global catastrophic risk].]

"Mindcrime" is Nick Bostrom's suggested term for scenarios in which an AI's cognitive processes are intrinsically doing moral harm, for example because the AI contains trillions of suffering conscious beings inside it.

Ways in which this might happen:

Problem of sapient models (of humans):

An instrumental pressure to produce high-fidelity predictions of human beings (or to predict [ decision counterfactuals] about them, or to [ search] for events that lead to particular consequences, etcetera) may lead the AI to run computations that are unusually likely to possess personhood.

An unrealistic example of this would be Solomonoff induction, where predictions are made by means that include running many possible simulations of the environment and seeing which ones best correspond to reality. Among current machine learning algorithms, particle filters and Monte Carlo algorithms similarly involve running many possible simulated versions of a system.

It's possible that a sufficiently advanced AI to have successfully arrived at detailed models of human intelligence, would usually also be advanced enough that it never tried to use a predictable/searchable model that engaged in brute-force simulations of those models. (Consider, e.g., that there will usually be many possible settings of a variable inside a model, and an efficient model might manipulate data representing a probability distribution over those settings, rather than ever considering one exact, specific human in toto.)

This, however, doesn't make it certain that no mindcrime will occur. It may not take exact, faithful simulation of specific humans to create a conscious model. An efficient model of a (spread of possibilities for a) human may still contain enough computations that resemble a person enough to create consciousness, or whatever other properties may be deserving of personhood. Consider, in particular, an agent trying to use

Just as it almost certainly isn't necessary to go all the way down to the neural level to create a sapient being, it may be that even with some parts of a mind considered abstractly, the remainder would be computed in enough detail to imply consciousness, sapience, personhood, etcetera.

The problem of sapient models is not to be confused with [ Simulation Hypothesis] issues. An efficient model of a human need not have subjective experience indistinguishable from that of the human (although it will be a model of a person who doesn't believe themselves to be a model). The problem occurs if the model is a person, not if the model is the same person as its subject, and the latter possibility plays no role in the implication of moral harm.

Besides problems that are directly or obviously about modeling people, many other practical problems and questions can benefit from modeling other minds - e.g., reading the directions on a toaster oven in order to discern the intent of the mind that was trying to communicate how to use a toaster. Thus, mindcrime might result from a sufficiently powerful AI trying to solve very mundane problems.

Problem of sapient models (of civilizations)

A separate route to mindcrime comes from an advanced agent considering, in sufficient detail, the possible origins and futures of intelligent life on other worlds. (Imagine that you were suddenly told that this version of you was actually embedded in a superintelligence that was imagining how life might evolve on a place like Earth, and that your subprocess was not producing sufficiently valuable information and was about to be shut down. You would probably be annoyed! We should try not to annoy other people in this way.)

Three possible origins of a convergent instrumental pressure to consider intelligent civilizations in great detail:

With respect to the latter two possibilities, note that the AI does not need to be considering possibilities in which the whole Earth as we know it is a simulation. The AI only needs to consider that, among the possible explanations of the AI's current sense data and internal data, there are scenarios in which the AI is embedded in some world other than the most 'obvious' one implied by the sense data. See also Distant superintelligences can coerce the most probable environment of your AI for a related hazard of the AI considering possibilities in which it is being simulated.

(Eliezer Yudkowsky has advocated that we shouldn't let any AI short of extreme levels of safety and robustness assurance consider distant civilizations in lots of detail in any case, since this means our AI might embed (a model of) a hostile superintelligence.)

Problem of sapient subsystems:

It's possible that the most efficient system for, say, allocating memory on a local cluster, constitutes a complete reflective agent with a self-model. Or that some of the most efficient designs for subprocesses of an AI, in general, happen to have whatever properties lead up to consciousness or whatever other properties are important to personhood.

This might possibly constitute a relatively less severe moral catastrophe, if the subsystems are sentient but [ lack a reinforcement-based pleasure/pain architecture] (since the latter is not obviously a property of the most efficient subagents). In this case, there might be large numbers of conscious beings embedded inside the AI and occasionally dying as they are replaced, but they would not be suffering. It is nonetheless the sort of scenario that many of us would prefer to avoid.

Problem of sapient self-models:

The AI's models of itself, or of other AIs it could possibly build, might happen to be conscious or have other properties deserving of personhood. This is worth considering as a separate possibility from building a conscious or personhood-deserving AI ourselves, when [ we didn't mean to do so], because of these two additional properties:

Difficulties

Trying to consider these issues is complicated by:

It'd help if we knew the answers to these questions, but the fact that we don't know doesn't mean we can thereby conclude that any particular model is not a person. (This would be some mix of [ argumentum ad ignorantiem], and [ availability bias] making us think that a scenario is unlikely when it is hard to visualize.) In the limit of infinite computing power, the epistemically best models of humans would almost certainly involve simulating many possible versions of them; superintelligences would have [ very large amounts of computing power] and we don't know at what point we come close enough to this [ limiting property] to cross the threshold.

Scope of potential disaster

The prospect of mindcrime is an especially alarming possibility because sufficiently advanced agents, especially if they are using computationally efficient models, might consider very large numbers of hypothetical possibilities that would count as people. There's no limit that says that if there are seven billion people, an agent will run at most seven billion models; the agent might be considering many possibilities per individual human. This would not be an [ astronomical disaster] since it would not (by hypothesis) wipe out our posterity and our intergalactic future, but it could be a disaster orders of magnitude larger than the Holocaust, the Mongol Conquest, the Middle Ages, or all human tragedy to date.

Development-order issue

If we ask an AI to predict what we would say if we had a thousand years to think about the problem of defining personhood or think about which causal processes are 'conscious', this seems unusually likely to cause the AI to commit mindcrime in the course of answering the question. Even asking the AI to think abstractly about the problem of consciousness, or predict by abstract reasoning what humans might say about it, seems unusually likely to result in mindcrime. There thus exists a [ development order issue] preventing us from asking a Friendly AI to solve the problem for us, since to file this request safely and without committing mindcrime, we would need the request to already have been completed.

The prospect of enormous-scale disaster mitigates against 'temporarily' tolerating mindcrime inside a system, while, e.g., an [ extrapolated-volition] or [ approval-based] agent tries to compute the code or design of a non-mindcriminal agent. Depending on the agent's efficiency, and secondarily on its computational limits, a tremendous amount of moral harm might be done during the 'temporary' process of computing an answer.

Weirdness

Literally nobody outside of MIRI or FHI ever talks about this problem.

Nonperson predicates

A nonperson predicate is an [ effective] test that we, or an AI, can use to determine that some computer program is definitely not a person. In principle, a nonperson predicate needs only two possible outputs, "Don't know" and "Definitely not a person". It's acceptable for many actually-nonperson programs to be labeled "don't know", so long as no people are labeled "definitely not a person".

If the above was the only requirement, one simple nonperson predicate would be to label everything "don't know". The implicit difficulty is that the nonperson predicate must also pass some programs of high complexity that do things like "acceptably model humans" or "acceptably model future versions of the AI".

Besides addressing mindcrime scenarios, Yudkowsky's original proposal was also aimed at knowing that the AI design itself was not conscious, or not a person.

It seems likely to be very hard to find a good nonperson predicate:

Research avenues


Comments

David Krueger

"Weirdness: Literally nobody outside of MIRI or FHI ever talks about this problem"

…but it does seem to be a popular topic of contemporary SciFi (WestWorld, Black Mirror, etc.)

Alexei Andreev

I have an intuition that says that if you run any sufficiently large computation (even if it's as simple as multiplication, e.g. (3^^^3)*(3^^^3), you'll likely accidentally create sentient life within it. Checking for that seems prohibitively expensive, or may be even impossible, since checking itself might run into the same problem.

Paul Christiano

Eliezer, I find your position confusing.

Consider the first AI system that can reasonably predict your answers to questions of the form "Might X constitute mindcrime?" where X is a natural language description of some computational process. (Well enough that, say, most of a useful computation can be flagged as "definitely not mindcrime," and all mindcrime can be flagged as "maybe mindcrime.")

Do you believe that this system will have significant moral disvalue? If that system doesn't have moral disvalue, where is the chicken and egg problem?

So it seems like you must believe that this system will have significant moral disvalue. That sounds implausible on its face to me. What are you imagining this system will look like? Do you think that this kind of question is radically harder than other superficially comparable question-answering tasks? Do you think that any AI researchers will find your position plausible? If not, what do you think they are getting wrong?

ETA: maybe the most useful thing to clarify would be the kind of computation, and how it relates to the rest of what the AI is doing, that you would find really hard to classify, but which might plausibly be unavoidable for effective computation.

This whole disagreement may be related to broader disagreements about how aligned AI systems will look. But you seem to think that mindcrime is also a problem for act-based agents, so that can't explain all of it. We might want to restrict attention to the act-based case in order to isolate disagreement specific to mindcrime, and it's possible that discussion should wait until we get on the same page about act-based agents.

Eliezer Yudkowsky

Consider the first AI system that can reasonably predict your answers to questions of the form "Might X constitute mindcrime?"…

Do you think that this kind of question is radically harder than other superficially comparable question-answering tasks?

Yes! It sounds close to FAI-complete in the capacities required. It sounds like trying to brute-force an answer to it via generalized supervised learning might easily involve simulating trillions of Eliezer-models. In general you and I seem to have very different intuitions about how hard it is to get a good answer to "deep, philosophical questions" via generalized supervised learning.

Paul Christiano

The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.

For example, suppose that my utility function U is "how good [ idealized Eliezer] thinks things are, after thinking for a thousand years." It doesn't take long to realize that [ idealized Eliezer] would be unhappy with a literal simulation of [ idealized Eliezer]. Moreover, a primitive understanding of Eliezer's views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).

An AI that is able to crush humans in the real world without being able to do this kind of reasoning seems catastrophic on other grounds. An AI that is able but unmotivated to carry out or act on this kind of reasoning seems even more catastophic for other reasons. (For example, I don't see any realistic approach to corrigibility that wouldn't solve this problem as well, and conversely I see many ways to resolve both.)

Edit: Intended as a response to the original post, but no way to delete and repost as far as I can tell.

Eliezer Yudkowsky

The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.

My worry here would be that we'll run into a Nearest Unblocked Neighbor problem on our attempts to define sapience as a property of computer simulations.

For example, suppose that my utility function U is "how good [ idealized Eliezer] thinks things are, after thinking for a thousand years." It doesn't take long to realize that [ idealized Eliezer] would be unhappy with a literal simulation of [ idealized Eliezer].

Let's say that sapience1 is a definition that covers most of the 'actual definition of sapience' (e.g. what we'd come up with given unlimited time to think, etc.) that I'll call sapience0, relative to some measure on probable computer programs. But there are still exceptions; there are sapient0 things not detected by sapience1. The best hypothesis for predicting an actually sapient mind that is not in sapience1, seems unusually likely to be one of the special cases that is still in sapience0. It might even just be an obfuscated ordinary sapient program, rather than one with an exotic kind of sapience, if sapience_1 doesn't incorporate some advanced-safe way of preventing obfuscation.

We can't throw a superhumanly sophisticated definition at the problem (e.g. the true sapience0 plus an advanced-safe block against obfuscation) without already asking the AI to simulate us or to predict the results of simulating us in order to obtain this hypothetical sapience2.

Moreover, a primitive understanding of Eliezer's views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).

This just isn't obvious to me. It seems likely to me that an extremely advanced understanding of Eliezer's idealized views is required to answer questions about what Eliezer would say about consciousness, with extreme accuracy, without

Paul Christiano

My views about Eliezer's preferences may depend on the reason that I am running X, rather than merely the content of X. E.g. if I am running X because I want to predict what a person will do, that's a tipoff. This sort of thing working relies on a matching between the capabilities being used to guide my thinking and the capabilities being used to assess that thinking to see whether it constitutes mind crime.

But so does the whole project. You've said this well: "you just build the conscience, and that is the AI." The AI doesn't propose a way of figuring out X and then reject or not reject it because it constitutes mind crime, any more than it proposes an action to satisfy its values and then rejects or fails to reject it because the user would consider it immoral. The AI thinks the thought that it ought to think, as best it can figure out, just like it does the thing that it ought to do, as best it can figure out.

Note that you are allowed to just ask about or avoid marginal cases, as long as the total cost of asking or inconvenience of avoiding is not large compared to the other costs of the project. And whatever insight you would have put into your philosophical definition of sapience, you can try to communicate it as well as possible as a guide to predicting "what Eliezer would say about X," which can circumvent the labor of actually asking.

Alexei Andreev

Ok, Eliezer, you've addressed my point directly with sapience0 / sapience1 example. That makes sense. I guess one pitfall for AI might be to keep improving its sapience model without end, because "Oh, gosh, I really don't want to create life by accident!" I guess this just falls into the general category of problems where "AI does thing X for a long time before getting around to satisfying human values", where thing X is actually plausibly necessary. Not sure if you have a name for a pitfall like that. I can try my hand at creating a page for it, if you don't have it already.

Eliezer Yudkowsky

Paul, I don't disagree that we want the AI to think whatever thought it ought to think. I'm proposing a chicken-and-egg problem where the AI can't figure out which thoughts constitute mindcrime, without already committing mindcrime. I think you could record a lot of English pontification from me and still have a non-person-simulating AI feeling pretty confused about what the heck I meant or how to apply it to computer programs. Can you give a less abstract view of how you think this problem should be solved? What human-understanding and mindcrime-detection abilities do you think the AI can develop, in what order, without committing lots of mindcrime along the way? Sure, given infinite human understanding, the AI can detect mindcrime very efficiently, but the essence of the problem is that it seems hard to get infinite human understanding without lots of mindcrime being committed along the way. So what is it you think can be done instead, that postulates only a level of human understanding that you think can be done knowably without simulating people?

Ryan Carey

Literally nobody outside of MIRI or FHI ever talks about this problem\.

This is discussed under some name or other, by at least the utilitarians and by Paul Christiano.

Bogdan Butnaru

In principle, a nonperson predicate needs only two possible outputs, "Don't know" and "Definitely not a person". It's acceptable for many actually-nonperson programs to be labeled "don't know", so long as no people are labeled "definitely not a person". […] The implicit difficulty is that the nonperson predicate must also pass some programs of high complexity that do things like "acceptably model humans" or "acceptably model future versions of the AI".

There's another difficulty: the nonperson predicate must not itself commit mindcrime while evaluating the programs. This sounds obvious enough in retrospect that it doesn't feel worth mentioning, but it took me a while to notice it.

Obviously, if you're running the program to determine if it's a person by analyzing its behavior (e.g. by asking it if it feels like it's conscious), you already commited mindcrime by the time you return "Don't know".

But if the tested program and the predicate are complex enough, lots of analysis other than straight running the program could accidentally instantiate persons as sub-processes, potentially ones distinct from those that might be instantiated by the tested program itself.

In other words: Assume Π is the set of all programs that potentially contain a person, i.e. for any program π, π in Π iff running π could instantiate a person.

We want a computable safety predicate S such that {S(π): π is a program} implies π ∉ Π, i.e. S(π) means π is safe. (Though !S(π) does not necessarily imply π ∈ Π.)

The problem is that S(π) is also a program, and we need to make sure that S(π) ∉ Π before running it. We can't use S(S(π)) to check, because we'd need to check first that S(S(π)) ∉ Π…

(Note that a program that implements a sufficietly complex safety predicate S, when executed with another program π as input, might instantiate a person even if just running π directly would not!)

Nathan Fish

This, however, doesn't make it certain that no mindcrime will occur\. It may not take exact, faithful simulation of specific humans to create a conscious model\. An efficient model of a \(spread of possibilities for a\) human may still contain enough computations that resemble a person enough to create consciousness, or whatever other properties may be deserving of personhood\. Consider, in particular, an agent trying to use

Trying to use what?

Phil Goetz

Eliezer goes back and forth between "sapient" and "sentient", which are not synonyms. Neither is obviously a justification for claiming moral status as an agent.

It is important either to state clearly what one presumes gives an agent moral status (and hence what constitutes mindcrime), or to change each occurence of "sapient", "sentient", or "personhood" to all use the same word. I recommend stating the general case using personhood(X), a function to be supplied by the user and not defined here. Addressing the problem depends critically on what that function is--but the statement of the general case shouldn't be bound up with the choice of personhood predicate.

Choosing either "sapient" or "sentient" is problematic: "sentient" because it includes at least all mammals, and "sapient" because it really just means "intelligent", and the AI is going to be equally intelligent (defined as problem-solving or optimizing ability) whether it simulates humans or not. If intelligence grants moral standing (as it seems to here), and mindcrime means trapping an agent with moral standing in the AI's world, then the construction of any AI is inherently mindcrime.