There seems to be some equivocation here between two motivations for studying corrigibility.
As far as I can tell, there are two obvious routes to solving the "switch problem:"
- Have a principled treatment of normative uncertainty + indirect normativity that yields the desired behavior with respect to reflective consistency (and VOI)
- Adopt the instrumental preferences of users over possible shutdown / self-modification / etc.
It looks like both of these will probably work if we are able to solve the rest of the AI control problem.
With this in mind, I thought the motivation for studying corrigibility was the intuition that it should follow from some kind of intellectual humility, which we don't yet understand or have any model of. This seems pretty sensible to me. It's also explicit in the Arbital page on [ corrigibility ].
But utility indifference doesn't seem to address this motivation at all, no matter how well it works out. Instead it is aimed at resolving some of the symptoms of the underlying issue. So talking about it as an approach to corrigibility (and indeed one of the only concrete approaches) seems to undermine the offered motivation for corrigibility, and to presuppose that the more natural approaches to the "switch problem" don't work. This at least requires some kind of explanation.
I think this may be practically relevant because many mainstream AI researchers might be very sympathetic to work on corrigibility if they understood the problem (and would be much open to the intellectual humility angle).