One natural standard: it should be hard to distinguish an adequate model from the system-to-be-modeled, based on input/output behavior alone.
How hard? Ideally we'd have an "equally competent" modeler and distinguisher, and ask the modeler to try to fool the distinguisher. This is a popular approach to generative modeling, and something I've talked about in the context of AI control (as has Jessica).
This definition runs into many subtleties, but I think it is a natural starting point for a discussion. In particular, we are already way beyond concerns like "the brain is almost certainly a chaotic system and hence we can't hope to produce exactly the same result as a biological brain."
The key property we want from the distinguisher is that it can learn to detect relevant differences between the model and the real system. This seems like it might be the kind of problem that I would classify as "probably easy if the agent is powerful and the difference is really important" and you would classify as "way too hard to count on."
You could also ask the model to output various intermediate results or to simulate requested measurements on the simulated brain, and give this extra information to the distinguisher. (Though I don't think this would really help.)
Comments
Eliezer Yudkowsky
Counting on things before you've found a solution to them isn't very mindset, but I do consider this a promising approach. Definitely, the generative-adversarial approach in modern neural networks causes me to hope that this is the sort of thing that actually works in practice. So I might not be as pessimistic as you think? I still think in general that one does not go about taking things for granted, but the notion of faithful simulation seems like one that could prove to have a tractable core after hammering on it for a bit, and it also seems very possible that if you're reasonably smart and you can't detect any expected differences in the behavior of neural columns then the corresponding human simulation is faithful.
My current thoughts on possible failure modes:
Paul Christiano
Methodologically, I am trying to understand what approaches may or may not work and what the key difficulties are. I am trying to anticipate what problems are hard or easy in order to understand what approaches may or may not work. I wouldn't describe this as "taking things for granted," I think we are probably miscommunicating.
This is a big problem, I think that it's the more real version of "perfect simulation will be out of the question." Note that this is only a concern for some processes (e.g. if the simulation output is one bit, then you don't have this problem).
(Note that in practice generative adversarial models are extremely finicky to train, at least partly for this reason.)
I think the other big problem is the complementary one, that even an equally smart adversary can't reliably distinguish a crappy simulation from a good simulation (where a dumb example is that no distinguisher can detect a steganographically encoded message even though that implies the simulation was poor).