"I have an intuition that sa..."

https://arbital.com/p/78

by Alexei Andreev Jun 17 2015


I have an intuition that says that if you run any sufficiently large computation (even if it's as simple as multiplication, e.g. (3^^^3)*(3^^^3), you'll likely accidentally create sentient life within it. Checking for that seems prohibitively expensive, or may be even impossible, since checking itself might run into the same problem.


Comments

Paul Christiano

Eliezer, I find your position confusing.

Consider the first AI system that can reasonably predict your answers to questions of the form "Might X constitute mindcrime?" where X is a natural language description of some computational process. (Well enough that, say, most of a useful computation can be flagged as "definitely not mindcrime," and all mindcrime can be flagged as "maybe mindcrime.")

Do you believe that this system will have significant moral disvalue? If that system doesn't have moral disvalue, where is the chicken and egg problem?

So it seems like you must believe that this system will have significant moral disvalue. That sounds implausible on its face to me. What are you imagining this system will look like? Do you think that this kind of question is radically harder than other superficially comparable question-answering tasks? Do you think that any AI researchers will find your position plausible? If not, what do you think they are getting wrong?

ETA: maybe the most useful thing to clarify would be the kind of computation, and how it relates to the rest of what the AI is doing, that you would find really hard to classify, but which might plausibly be unavoidable for effective computation.

This whole disagreement may be related to broader disagreements about how aligned AI systems will look. But you seem to think that mindcrime is also a problem for act-based agents, so that can't explain all of it. We might want to restrict attention to the act-based case in order to isolate disagreement specific to mindcrime, and it's possible that discussion should wait until we get on the same page about act-based agents.

Eliezer Yudkowsky

Consider the first AI system that can reasonably predict your answers to questions of the form "Might X constitute mindcrime?"…

Do you think that this kind of question is radically harder than other superficially comparable question-answering tasks?

Yes! It sounds close to FAI-complete in the capacities required. It sounds like trying to brute-force an answer to it via generalized supervised learning might easily involve simulating trillions of Eliezer-models. In general you and I seem to have very different intuitions about how hard it is to get a good answer to "deep, philosophical questions" via generalized supervised learning.

Paul Christiano

The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.

For example, suppose that my utility function U is "how good [ idealized Eliezer] thinks things are, after thinking for a thousand years." It doesn't take long to realize that [ idealized Eliezer] would be unhappy with a literal simulation of [ idealized Eliezer]. Moreover, a primitive understanding of Eliezer's views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).

An AI that is able to crush humans in the real world without being able to do this kind of reasoning seems catastrophic on other grounds. An AI that is able but unmotivated to carry out or act on this kind of reasoning seems even more catastophic for other reasons. (For example, I don't see any realistic approach to corrigibility that wouldn't solve this problem as well, and conversely I see many ways to resolve both.)

Edit: Intended as a response to the original post, but no way to delete and repost as far as I can tell.

Eliezer Yudkowsky

The obvious patch is for a sufficiently sophisticated system to have preferences over its own behavior, which motivate it to avoid reasoning in ways that we would dislike.

My worry here would be that we'll run into a Nearest Unblocked Neighbor problem on our attempts to define sapience as a property of computer simulations.

For example, suppose that my utility function U is "how good [ idealized Eliezer] thinks things are, after thinking for a thousand years." It doesn't take long to realize that [ idealized Eliezer] would be unhappy with a literal simulation of [ idealized Eliezer].

Let's say that sapience1 is a definition that covers most of the 'actual definition of sapience' (e.g. what we'd come up with given unlimited time to think, etc.) that I'll call sapience0, relative to some measure on probable computer programs. But there are still exceptions; there are sapient0 things not detected by sapience1. The best hypothesis for predicting an actually sapient mind that is not in sapience1, seems unusually likely to be one of the special cases that is still in sapience0. It might even just be an obfuscated ordinary sapient program, rather than one with an exotic kind of sapience, if sapience_1 doesn't incorporate some advanced-safe way of preventing obfuscation.

We can't throw a superhumanly sophisticated definition at the problem (e.g. the true sapience0 plus an advanced-safe block against obfuscation) without already asking the AI to simulate us or to predict the results of simulating us in order to obtain this hypothetical sapience2.

Moreover, a primitive understanding of Eliezer's views suffices to avoid the worst offenses (or at least to realize that they are the kinds of things which Eliezer would prefer that a human be asked about first).

This just isn't obvious to me. It seems likely to me that an extremely advanced understanding of Eliezer's idealized views is required to answer questions about what Eliezer would say about consciousness, with extreme accuracy, without

Paul Christiano

My views about Eliezer's preferences may depend on the reason that I am running X, rather than merely the content of X. E.g. if I am running X because I want to predict what a person will do, that's a tipoff. This sort of thing working relies on a matching between the capabilities being used to guide my thinking and the capabilities being used to assess that thinking to see whether it constitutes mind crime.

But so does the whole project. You've said this well: "you just build the conscience, and that is the AI." The AI doesn't propose a way of figuring out X and then reject or not reject it because it constitutes mind crime, any more than it proposes an action to satisfy its values and then rejects or fails to reject it because the user would consider it immoral. The AI thinks the thought that it ought to think, as best it can figure out, just like it does the thing that it ought to do, as best it can figure out.

Note that you are allowed to just ask about or avoid marginal cases, as long as the total cost of asking or inconvenience of avoiding is not large compared to the other costs of the project. And whatever insight you would have put into your philosophical definition of sapience, you can try to communicate it as well as possible as a guide to predicting "what Eliezer would say about X," which can circumvent the labor of actually asking.

Alexei Andreev

Ok, Eliezer, you've addressed my point directly with sapience0 / sapience1 example. That makes sense. I guess one pitfall for AI might be to keep improving its sapience model without end, because "Oh, gosh, I really don't want to create life by accident!" I guess this just falls into the general category of problems where "AI does thing X for a long time before getting around to satisfying human values", where thing X is actually plausibly necessary. Not sure if you have a name for a pitfall like that. I can try my hand at creating a page for it, if you don't have it already.

Eliezer Yudkowsky

Paul, I don't disagree that we want the AI to think whatever thought it ought to think. I'm proposing a chicken-and-egg problem where the AI can't figure out which thoughts constitute mindcrime, without already committing mindcrime. I think you could record a lot of English pontification from me and still have a non-person-simulating AI feeling pretty confused about what the heck I meant or how to apply it to computer programs. Can you give a less abstract view of how you think this problem should be solved? What human-understanding and mindcrime-detection abilities do you think the AI can develop, in what order, without committing lots of mindcrime along the way? Sure, given infinite human understanding, the AI can detect mindcrime very efficiently, but the essence of the problem is that it seems hard to get infinite human understanding without lots of mindcrime being committed along the way. So what is it you think can be done instead, that postulates only a level of human understanding that you think can be done knowably without simulating people?