The Plan is backchained from the desired goal:
Jill's team has solved the hard problem of intelligence, meaning they understand how intelligence works and can implement it. They also solved the problem of value-loading.
Jill is sitting at her computer ready to run the next, improved iteration of the AGI she and her team have been working on. We assume that this version of AI is different in that it will go FOOM, whereas every version before then did not (whether by design or not). She and her team also believe this AGI will implement something like CEV.
There is currently no ongoing AI arms race, or at the very least Jill strongly believes her team is ahead by many months, where she would feel comfortable taking additional time to launch this version if she felt like her team actually needed to do that.
What can go wrong?
Assuming the given scenario, what can go wrong? I think the most likely failure scenario is that Jill and her team were actually wrong in thinking that this AGI will implement their version of CEV. We can assume that the previous versions they ran showed correct behavior. So what went wrong with this new version:
- Mistaken about CEV: it's possible Jill's team was mistaken about how to value-load the AGI.
- Previous version lied: previous versions of their AI were "advanced" enough that they could lie on the tests.
- Value drift: AGI was modifying its value function, and this additional iteration modified it in a bad way. If Jill's team had proved than this shouldn't have happened, it's now clear that they made a mistake.
- A code bug: a code bug was introduced since the time they ran the previous version. The team didn't code a correct unit test / there was no proof checker.
Arguments for a different goal
There are some arguments about why the goal stated above shouldn't be the one we aim for…
There are also arguments against some of the assumptions stated in the goal…
ML variation
It's possible that the first AGI to go FOOM will employ some variation on the current ML algorithms. In that case, the scenario and the list of things that can go wrong are quite different…