[summary: There's a laundry list of things that might go wrong when we suppose that an advanced AI is checking something Potentially Bad with the user/operator/programmer to see if the user labels the thing as Considered Bad, and relying on this step of the workflow to exclude things that are Actually Bad. E.g., the user might not be able to detect Actually Bad things reliably, the space of Potentially Bad things might be so broad that the Actually Bad things are 1,000 items down the list of things that are Potentially Bad, the AI might just learn to do things that won't be Considered Bad and thereby seek out special cases of bad things that the user can't detect as bad, etcetera.]
If we're supposing that an advanced agent is checking something Potentially Bad with its user to find out if the thing is Considered Bad by that user, we need to worry about the following generic issues:
- Can the AI tell which things are Potentially Bad in a way that includes all things that are Actually Bad?
- Can the user reliably tell which Potentially Bad things are Actually Bad?
- Does the AI, emergently or deliberately, seek out Potentially Bad things that the user will not label as Considered Bad, thereby potentially optimizing for Actually Bad things that the user mislabels as Not Bad? (E.g., if the agent learns to avoid new tries similar to those already labeled bad, we're excluding the Considered Bad space, but what's left may still contain Actually Bad things via Nearest unblocked strategy or a similar phenomenon.)
- Is the criterion for Potentially Bad so broad, and Actually Bad things hard enough to reliably prioritize within that space, that 10% of the time an Actually Bad thing will not be in the top 1,000 Potentially Bad things the user can afford the time to check?
- Can the AI successfully communicate to the user the details of what set off the flag for Potential Badness, or even communicate to the user exactly what was flagged as Potentially Bad, if this is an important part of the user making the decision?
- Do the AI's communication goals risk [optimizing_user optimizing the user]?
- Are the details of Potential Badness or even the subject of Potential Badness so inscrutable as to be impenetrable? (E.g., AlphaGo trying to explain to a human why a Go move is potentially bad, or for that matter, a Go professional trying to explain to an amateur why a Go move is potentially bad - we might just be left with blind trust, at which point we might as well just tell the AI not to do Potentially Bad things rather than asking it to pointlessly check with the user.)
- Does the AI, emergently or instrumentally, optimize for the user not labeling things as Potentially Bad, thereby potentially leading to user deception?