As you've probably gathered, I feel hopeless about case (1).
Okay, I didn't understand this. My reaction was something like "Isn't conservatively generalizing burritos from sample burritos a much simpler problem than defining an ideal criterion for burritos which probably requires something like an ideal advisor theory over extrapolated humans to talk about all the poisons that people could detect given enough computing power?" but I think I should maybe just ask you to clarify what you mean. The interpretation my brain generated was something like "Predicting a human 9p's Go moves is easier than generating 9p-level Go moves" which seems clearly false to me so I probably misunderstood you.
In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).
I don't understand this at all. Are we supposing that we have an inviolable physical machine that outputs burrito ratings and can't be shorted by seizing control of the reward channel or by including poisons that the machine-builders didn't know about? …actually I should just ask you to clarify this paragraph.