I think it's going to be hard to talk or think clearly about these problems (even at the level of separating them into distinct problems or telling which are real problems) until we get more specific about what a goal is, what a concept is, etc. What does the overall system actually look like, even very roughly?
I guess your take is that this is tied up in a very hard-to-separate way from the design of AI itself.
I understand that it is good to throw out some concrete problems before embarking on the project of clarifying our models of powerful AI systems. But I suspect you need at least some model of a powerful AI system where the questions make sense, just to keep things vaguely on track.
Comments
Eliezer Yudkowsky
Questions like these seem to me to have obvious unbounded formulations. If we're talking about a modern policy-reinforcement neural network, then yes, the notion of a separable goal is more ephemeral. Does this seem to agree with your own state of mind, or would you disagree that we understand the notion of 'goal concepts' in unbounded formulations, or…?
A concept is something that discretely or fuzzily classifies states of the world, or states of a slice through the world, into positive or negative instances. A "goal concept", for a satisficing agent, then describes the set of worlds that it's trying to steer us into. The more general version of this is a utility function.
Paul Christiano
On this definition, what is the difference between "communicating a goal concept" and "communicating a goal"?
Is the problem in this post equivalent to the special case of value learning where the values to be learned are simple (to us), local, and philosophically unproblematic?
Eliezer Yudkowsky
I think we're going to have to specialize the terminology so we have separate words for "learn any goal concept" and "learn human normativity" instead of calling these both "value", which is something I'm currently trying to think how to revise. But if by value learning you mean "outcome-preference-criterion learning" and not value learning, then yes, we're looking for outcome-preference-criterion learning where the criterion seems simple to us, is hopefully local, and is philosophically unproblematic by our own standards. Like, say, having the outcome be one in which we just have a damn strawberry.
In the language being used here, it sounds to me like "communicating a goal" should parse to "communicating a goal concept to an agent which will then optimize for the outcome-preference-criterion you're about to communicate to it."