[summary: The problem of communicating to an AI a very simple local concept on the order of "strawberries" or "give me a strawberry". This level of the problem is meant to include subproblems of local categorization like "I meant a real strawberry like the ones I already showed you, not a fake strawberry that looks similar to your webcam". It isn't meant to include larger problems like verifying that a plan uses only known, whitelisted methods or identifying all possible harmful effects we could care about.]
The problem of trying to figure out how to communicate to an AGI an intended goal [ concept] on the order of "give me a strawberry, and not a fake plastic strawberry either".
At this level of the problem, we're not concerned with e.g. larger problems of safe plan identification such as not mugging people for strawberries, or minimizing side effects. We're not (at this level of the problem) concerned with identifying each and every one of the components of human value, as they might be impacted by side effects more distant in the causal graph. We're not concerned with [ philosophical uncertainty] about what we [normativity should] mean by "strawberry". We suppose that in an intuitive sense, we do have a pretty good idea of what we intend by "strawberry", such that there are things that are definitely strawberries and we're pretty happy with our sense of that so long as nobody is deliberately trying to fool it.
We just want to communicate a local goal concept that distinguishes edible strawberries from plastic strawberries, or nontoxic strawberries from poisonous strawberries. That is: we want to say "strawberry" in an understandable way that's suitable for fulfilling a task of "just give Sally a strawberry", possibly in conjunction with other features like conservatism or low impact or mild optimization.
For some open subproblems of the obvious approach that goes through showing actual strawberries to the AI's webcam, see "Identifying causal goal concepts from sensory data" and "Look where I'm pointing, not at my finger".
Comments
Paul Christiano
I think it's going to be hard to talk or think clearly about these problems (even at the level of separating them into distinct problems or telling which are real problems) until we get more specific about what a goal is, what a concept is, etc. What does the overall system actually look like, even very roughly?
I guess your take is that this is tied up in a very hard-to-separate way from the design of AI itself.
I understand that it is good to throw out some concrete problems before embarking on the project of clarifying our models of powerful AI systems. But I suspect you need at least some model of a powerful AI system where the questions make sense, just to keep things vaguely on track.
Eliezer Yudkowsky
Questions like these seem to me to have obvious unbounded formulations. If we're talking about a modern policy-reinforcement neural network, then yes, the notion of a separable goal is more ephemeral. Does this seem to agree with your own state of mind, or would you disagree that we understand the notion of 'goal concepts' in unbounded formulations, or…?
A concept is something that discretely or fuzzily classifies states of the world, or states of a slice through the world, into positive or negative instances. A "goal concept", for a satisficing agent, then describes the set of worlds that it's trying to steer us into. The more general version of this is a utility function.
Paul Christiano
On this definition, what is the difference between "communicating a goal concept" and "communicating a goal"?
Is the problem in this post equivalent to the special case of value learning where the values to be learned are simple (to us), local, and philosophically unproblematic?
Eliezer Yudkowsky
I think we're going to have to specialize the terminology so we have separate words for "learn any goal concept" and "learn human normativity" instead of calling these both "value", which is something I'm currently trying to think how to revise. But if by value learning you mean "outcome-preference-criterion learning" and not value learning, then yes, we're looking for outcome-preference-criterion learning where the criterion seems simple to us, is hopefully local, and is philosophically unproblematic by our own standards. Like, say, having the outcome be one in which we just have a damn strawberry.
In the language being used here, it sounds to me like "communicating a goal" should parse to "communicating a goal concept to an agent which will then optimize for the outcome-preference-criterion you're about to communicate to it."