These arguments seem weak to me.
- I think the basic issue is that you are not properly handling uncertainty about what will be practically needed to train an agent out of infrahuman errors in language understanding. Your arguments seem much more reasonable under a particular model (e.g. a system that is making predictions or plans and develops language understanding as a tool for making better predictions), but it seems hard to justify 90% confidence in that model.
- It's not at all clear that language understanding means identifying "natural" categories. Whether values have high information content doesn't seem like a huge consideration given what I consider plausible approaches to language learning---it's the kind of thing that makes the problem linearly harder / require linearly more data, rather than causing a qualitative change.
- It seems clear that "right" does not mean "a human would judge right given a persuasive argument." That's a way we might try to define right, but it's clearly an alternative to a natural language understanding of right (an alternative I consider more plausible), not an aspect of it.
- "Do the right thing" does not have to cash out as a function from outcomes --> rightness followed by rightness-maximization. That's not even really an intuitive way to cash it out.
- The key issue may be how well natural language understanding degrades under uncertainty. Again, you seem to be imagining a distribution over vague maps from outcome --> rightness which is then maximized in expectation, whereas I (and I think most people) are imagining an incomplete set of tentative views about rightness. The incomplete set of tentative views about rightness can include strong claims about things like violations of human autonomy (even though autonomy is similarly defined by an incomplete set of tentative views rather than a distribution over maps from outcome ---> autonomy).
I agree that many commenters and some researchers are too optimistic about this kind of thing working automatically or by default. But I think your post doesn't engage with the substantive optimistic view.
It would be easier to respond if you gave a tighter argument for your conclusion, but it might also be worth someone actively making a tighter argument for the optimistic view, especially if you actually don't understand the strong optimistic view (rather than initially responding to a weak version of it for clarity).