In the context of a Task AGI, one application of what we call 'conservatism' is the Burrito Problem\. Suppose I show the AI five burritos and five non\-burritos\. Rather than learning the simplest concept that distinguishes burritos from non\-burritos and then creating something that is maximally a burrito under this concept, we would like the AI to learn a simple and narrow concept that classifies these five things as burritos according to some simple rule \(not just the rule, "only these exact five objects are burritos"\) but which also classifies as few other objects as burritos as possible\. This concept however must still be broad enough to permit the construction of a sixth burrito that is not molecularly identical to any of the first five\. But not so broad that the burrito includes butolinum toxin \(because, hey, anything made out of mostly carbon\-hydrogen\-oxygen\-nitrogen that looks like a burrito ought to be fine\)\.
To me, the most natural way to approach this is to take a probability distribution over "what it means to be a burrito," and to produce a thing that is maximally likely to be a burrito rather than a thing which is maximally burrito-like. Of course this still depends on having a good distribution over "what it means to be a burrito" (as does your approach).
Comments
Eliezer Yudkowsky
It's not obvious to me that these two approaches mean the same thing. Let's say that an AI sees some stale burritos and some fresh burritos, with the former being classified as negative examples and the latter being specified as positive examples. If you use the simplest but not conservative concept that classifies the training data, maybe you max out the probability that something will be classified as a burrito by eliminating every trace of staleness… or moving even further along some dimension that distinguishes stale from fresh burritos.
Now, it's possible that this would be fixed automatically by having a mixture of hypotheses about what might underlie the good-burrito classification and that one of the hypotheses would be "maybe a burrito can't be too fresh", but again, this is not obvious to me.
It seems to me that, in general, when we learn a mixture of the simplest concepts that might assign probabilities well over previously labeled classifications, we might still be ending up with something with a nonconservative maximum. Maybe the AI learns to model the human system for classifying burritos and then presents us with a weird object whose appearance hacks us to suddenly be absolutely certain that it is a burrito - this is just me trying to wave my hands in the direction of what seems like it might be an underlying difference between "learn a probabilistic classification rule and max it out" and "try to draw a simple concept that is conservatively narrow".
It might be the case that given sufficient imagination to consider many possible hypotheses, trying to fit all of those hypotheses well (which might not be the same as maxing out the mixture) is an implementation of conservatism, or even that just trying to max out the mixture turns out to implement conservatism in practice. But then it might also be the case that in the not-far-superhuman regime, taking a direct approach to making merely powerful learning systems be 'conservative' rather than 'max out the probability considering many hypotheses' would be more tractable or straightforward as an engineering problem.