This (and many of your concerns) seem basically sensible to me. But I tend to read them more broadly as a reductio against particular approaches to building aligned AI systems (e.g. building an AI that pursues an explicit and directly defined goal). And so I tend to say things like "I don't expect X to be a problem," because any design that suffers from problem X is likely to be totally unworkable for a wide range of reasons. You tend to say "X seems like a serious problem." But it's not clear if we disagree.
One way we may disagree is about what we expect people to do. I think that for the most part reasonable people will be exploring workable designs, or designs that are unworkable for subtle reasons, rather than trying to fix manifestly unworkable designs. You perhaps doubt that there are any reasonable people in this sense.
Another difference is that I am inclined to look at people who say "X is not a problem" and imagine them saying something closer to what I am saying. E.g. if you present a difficulty with building rational agents with explicitly represented goals and an AI researcher says that they don't belive this is a real difficulty, it may be because your comments are (at best) reinforcing their view that sophisticated AI systems will not be agents pursuing explicitly represented goals.
(Of course, I agree that both happen. If we disagree, it's about whether the charitable interpretation is sometimes accurate vs. almost never accurate, or perhaps about whether proceeding under maximally charitable assumptions is tactically worthwhile even if it often proves to be wrong.)
Comments
Paul Christiano
It seems unlikely we'll ever build systems that "maximize X, but rule out some bad solutions with the ad hoc penalty term Y," because that looks totally doomed. If you want to maximize something that can't be explicitly defined, it looks like you have to build a system that doesn't maximize something which is explicitly defined. (This is an even broader point---"do X but not Y" is just one kind of ad hoc proxy for our values, and the broader point is that ad hoc proxies to what we really care about just don't seem very promising.)
In some sense this is merely strong agreement with the basic view behind this post. I'm not sure if there is any real disagreement.
Eric Rogstad
Did you mean to say "will not be"?
Paul Christiano
Yeah, thanks.
Eliezer Yudkowsky
Paul, I'm having trouble isolating a background proposition on which we could more sharply disagree. Maybe it's something like, "Will relevant advanced agents be consequentialists or take the maximum of anything over a rich space?" where I think "Yes" and you think "No, because approval agents aren't like that" and I reply "I bet approval agents will totally do that at some point if we cash out the architecture more." Does that sound right?
I'll edit the article to flag that Nearest Neighbor emerges from consequentialism and/or bounded maximizing on a rich domain where values cannot be precisely and accurately hardcoded.