So you see the difference as whether the programmers have to actually supply the short-term objective, or whether the AI learns the short-term objective they would have defined / which they would accept/prefer?
The distinction seems to buy you relatively little safety at a great cost (basically taking the system from "maybe it's good enough?" to "obviously operating at an incredible disadvantage"). You seem to think that it buys you much more safety than I do.
This statement confuses me. (Remember that you know more about my scenarios than I know about your scenarios, so it will help if you can be more specific and concrete than your first-order intuition claims to be necessary.)
Considering these two scenarios…
- X. Online genie. The AI is getting short-term objectives from humans and carrying them out under some general imperative to do things conservatively or with 'low unnecessary impact' in some sense of that, and describes plans and probable consequences that are subject to further human checking, and then does them, and then the humans observe the results and file more requests.
- Y. Friendly sovereign. The AI is running something like coherent extrapolated volition, deciding all on its own what is a good idea in the long term, and doing what seems like a good idea.
…it seems to me that the gap between X and Y very plausibly describes a case where it's much easier to safely build X, though I also reserve some probability mass for the case where almost-all the difficulty of value alignment is in things like reflective stability and "getting the AI to do anything you specify, at all" so that it's only 1% more real difficulty to go from X to Y. I also don't think that X would be at a computational disadvantage compared to Y. X seems to need to solve much fewer of the sort of problems that I think are dangerous and philosophically fraught (though I think we have a core disagreement where you think 'philosophically fraught' is much less dangerous).
I suspect you're parsing up the AI space differently, such that X and Y are not natural clusters to you. Rather than my guessing, do you want to go ahead and state your own parsing?