Can we properly classify this as an error? If there's an AI that will be hacked, or maybe hack itself, only if it correctly forecasts that distant superintelligences are creating millions more simulations than the actual AI, then I'd expect distant superintelligences to create millions of simulations. Simulating a pre-intelligence-explosion AI is extremely cheap. Sure, not doing it is even cheaper, but if the AI has a sufficiently good model of the distant SI to not be fooled by fakeouts in one decision that get corrected by another decision, then the distant SI will expend the resources to actually simulate.
It seems to me that we'd have to address this issue in a way that's robust to the case where the distant SI is actually simulating a million copies of our local AI that our local AI can't distinguish from itself. If we only correct erroneous beliefs about such simulation by processes that only work to eject false beliefs, then perhaps the distant SI can hack us by making the local AI's belief not be erroneous.