My weak claim is that the pseudo-genie will not have catastrophic failures unless either (1) it makes an inaccurate prediction or (2) the real genie has a catastrophic failure. This seems obvious on its face.
This seems true "so long as nothing goes wrong", i.e., so long as the human behavior doesn't change when they're not actually familiar with the last 99 simulated questions as opposed to the case where they did encounter the last 99 simulated questions, so long as the pseudo-genie's putative outputs to the human don't change in any way from the real-genie case and in particular don't introduce any new cases of operator maximization that didn't exist in the real-genie case, etcetera.
It should be noted that I would not expect many classes of Do What I Mean genies that we'd actually want to build in practice, to be capable of making knowably reliably accurate predictions at all the most critical junctures. In other words, I think that for most genies we'd want to build, the project to convert them to pseudo-genies would fail at the joint of inaccurate prediction. I think that if we had a real genie that was capable of knowable, reliable, full-coverage prediction of which orders it received, we could probably convert it to an acceptable working pseudo-genie using a lot less further effort and insight than was required to build the real-genie, even taking account that the genie might previously have relied on the human remembering the interactions from previous queries, etcetera. I think in this sense we're probably on mostly the same page about the weak claim, and differ mainly in how good a prediction of human behavior we expect from 'AIs we ought to construct' (about which we also currently have different beliefs).
Oh, and of course a minor caveat that mindcrime must not be considered catastrophic in this case.
My strong claim is that if the human behaves sensibly the pseudo-genie will not have catastrophic failures unless either (1) it makes a prediction which seems obviously and badly wrong, or (2) the real genie has a catastrophic failure.
The big caveat I have about this is that "obviously and badly wrong" must be evaluated relative to human notions of "obviously and badly wrong" rather than some formal third-party sense of how much it was a reasonable mistake to make given the previous data. (Imagine a series of ten barrels where the human can look inside the barrel but the AI can't. The first nine barrels contain red balls, and the human says 'red'. The tenth barrel has white balls and the AI says 'red' and the human shouts "No, you fool! Red is nothing like white!" It's an obvious big error from a human perspective but not from a reasonable-modeling perspective. The contents of the barrels, in this case, are metaphors for reflectively stable or 'philosophical' degrees of freedom which are not fully correlated.) The possible divergence in some critical new case between 'obviously objectionable to a human' and 'obviously human-objectionable to a good modeler given the previously observed data provided on human objectionality' is of course exactly the case for continuing to have humans in the loop.
The minor caveat that jumps out at me about the strong claim is that, e.g., we can imagine a case where the real-genie is supposed to perform 10 checks from different angles, such that it's not an "obviously and badly wrong misprediction" to say that the human misses any one of the checks, but normal operation would usually have the human catching at least one of the checks. This exact case seems unlikely because I'd expect enough correlation on which checks the actual humans miss, that if the pseudo-genie can make a catastrophic error that way, then the real genie probably fails somewhere (if not on the exact same problem). But the general case of systematic small divergences adding up to some larger catastrophe strikes me as more actually worrisome. To phrase it in a more plausible way, imagine that there's some important thing that a human would say in at least 1 out of 100 rounds of commands, so that failing to predict the statement on any given round is predictively reasonable and indeed the modal prediction, but having it not appear in all 100 rounds is catastrophic. I expect you to respond that in this case the human should have a conditionally very high probability of making the statement on the 100th round and therefore it's a big actual prediction error not to include it, but you can see how it's more the kind of prediction error that a system with partial coverage might make, plus we have to consider the situation if there is no final round and just a gradual decline into catastrophic badness over one million simulated rounds, etcetera.
No, I'm just arguing that if you had an AI that works well with human involvement, then you can make one that works well with minimal human involvement, modulo certain well-specified problems in AI (namely making good enough predictions about humans).
I think I mostly agree, subject to the aforementioned caveat that "good enough prediction" means "really actually accurate" rather than "reasonable given previously observed data". I'm willing to call that well-specified, and it's possible that total coverage of it could be obtained given a molecular-nanotech level examination of a programmer's brain plus some amount of mindcrime.
This is like one step of ten in the act-based approach, and so to the extent that we disagree it seems important to clear that up.
I'm sorry if I seem troublesome or obstinate here. My possibly wrong or strawmanning instinctive model of one of our core disagreements is that, in general, Eliezer thinks "The problem of making a smarter-than-human intelligence that doesn't kill you, on the first try, is at least as metaphorically difficult as building a space shuttle in a realm where having the wrong temperature on one O-Ring will cause the massive forces cascading through the system to blow up and kill you, unless you have some clever meta-system that prevents that, and then the meta-system has to not blow up and kill you" and Paul does not feel quite the same sense of "if you tolerate enough minor-seeming structural problems it adds up to automatic death".