MIRI and related organizations have recently become more interested in trying to sponsor (technical) work on Task AGI subproblems. A task-based agent, aka Genie in Bostrom's lexicon, is an AGI that's meant to implement short-term goals identified to it by the users, rather than the AGI being a Bostromian "Sovereign" that engages in long-term strategic planning and self-directed, open-ended operations.
A Task AGI might be safer than a Sovereign because:
- It is possible to query the user before and during task performance, if an ambiguous situation arises and is successfully identified as ambiguous.
- The tasks are meant to be limited in scope - to be accomplishable, once and for all, within a limited space and time, using some limited amount of effort.
- The AGI itself can potentially be limited in various ways, since it doesn't need to be as powerful as possible in order to accomplish its limited-scope goals.
- If the users can select a valuable and pivotal task, identifying an adequately safe way of accomplishing this task might be simpler than identifying all of human value.
This page is about open problems in Task AGI safety that we think might be ready for further technical research.
Introduction: The safe Task AGI problem
A safe Task AGI or safe Genie is an agent that you can safely ask to paint all the cars on Earth pink.
Just paint all cars pink.
Not tile the whole future light cone with tiny pink-painted cars. Not paint everything pink so as to be sure of getting everything that might possibly be a car. Not paint cars white because white looks pink under the right color of pink light and white paint is cheaper. Not paint cars pink by building nanotechnology that goes on self-replicating after all the cars have been painted.
The Task AGI superproblem is to formulate a design and training program for a real-world AGI that we can trust to just paint the damn cars pink.
To go into this at some greater depth, to build a safe Task AGI:
• You need to be able to identify the goal itself, to the AGI, such that the AGI is then oriented on achieving that goal. If you put a picture of a pink-painted car in front of a webcam and say "do this", all the AI has is the sensory pixel-field from the webcam. Should it be trying to achieve more pink pixels in future webcam sensory data? Should it be trying to make the programmer show it more pictures? Should it be trying to make people take pictures of cars? Assuming you can in fact identify the goal that singles out the futures to achieve, is the rest of the AI hooked up in such a way as to optimize that concept?
• You need to somehow handle the just part of the just paint the cars pink. This includes not tiling the whole future light cone with tiny pink-painted cars. It includes not building another AI which paints the cars pink and then tiles the light cone with pink cars. It includes not painting everything in the world pink so as to be sure of getting everything that might count as a car. If you're trying to make the AI have "low impact" (intuitively, prefer plans that result in fewer changes to other quantities), then "low impact" must not include freezing everything within reach to minimize how much it changes, or making subtle changes to people's brains so that nobody notices their cars have been painted pink.
• The AI needs to not shoot people who are standing between the painter and the car, and not accidentally run them over, and not use poisonous paint even if the poisonous paint is cheaper.
• The AI should have an 'abort' button which gets it to safely stop doing what it's currently doing. This means that if the AI was in the middle of building nanomachines, the nanomachines need to also switch off when the abort button is pressed, rather than the AI itself just shutting off and leaving the nanomachines to do whatever. Assuming we have a safe measure of "low impact", we could define an "abortable" plan as one which can, at any time, be converted relatively quickly to one that has low impact.
• The AI should not want to self-improve or control further resources beyond what is necessary to paint the cars pink, and should query the user before trying to develop any new technology or assimilate any new resources it does need to paint cars pink.
This is only a preliminary list of some of the requirements and use-cases for a Task AGI, but it gives some of the flavor of the problem.
Further work on some facet of the open subproblems below might proceed by:
- Trying to explore examples of the subproblem and potential solutions within some contemporary machine learning paradigm.
- Building a toy model of some facet of the subproblem, and hopefully observing some non-obvious fact that was not predicted in advance by existing researchers skilled in the art.
- Doing mathematical analysis of an unbounded agent encountering or solving some facet of a subproblem, where the setup is sufficiently precise that claims about the consequences of the premise can be checked and criticized.
Conservatism
A conservative concept boundary is a boundary which is (a) relatively simple and (b) classifies as few things as possible as positive instances of the category.
If we see that 3, 5, 13, and 19 are positive instances of a category and 4, 14, and 28 are negative instances, then a simple boundary which separates these instances is "All odd numbers." A simple and conservative boundary is "All odd numbers between 3 and 19" or "All primes between 3 and 19". (A non-simple boundary is "Only 3, 5, 13, and 19 are members of the category.")
E.g., if we imagine presenting an AI with smiling faces as instances of a goal concept to be learned, then a conservative concept boundary might lead the future AI to pursue only smiles attached to human heads, rather than tiny molecular smileyfaces (not that this necessarily solves everything).
If we imagine presenting the AI with 20 positive instances of a burrito, then a conservative boundary might lead the AI to produce a 21st burrito very similar to those. Rather than, e.g., needing to explicitly present the AGI with a poisonous burrito that's labeled negative somewhere in the training data, in order to force the simplest boundary around the goal concept to be one that excludes poisonous burritos.
Conservative planning is a related problem in which the AI tries to create plans that are similar to previously whitelisted plans or to previous causal events that occur in the environment. A conservatively planning AI, shown burritos, would try to create burritos via cooking rather than via nanotechnology, if the nanotechnology part wasn't especially necessary to accomplish the goal.
Detecting and flagging non-conservative goal instances or non-conservative steps of a plan for user querying is a related approach.
Safe impact measure
A low-impact agent is one that's intended to avoid large bad impacts at least in part by trying to avoid all large impacts as such.
Suppose we ask an agent to fill up a cauldron, and it fills the cauldron using a self-replicating robot that goes on to flood many other inhabited areas. We could try to get the agent not to do this by letting it know that flooding inhabited areas is bad. An alternative approach is trying to have an agent that avoids needlessly large impacts in general - there's a way to fill the cauldron that has a smaller impact, a smaller footprint, so hopefully the agent does that instead.
The hopeful notion is that while "bad impact" is a highly value-laden category with a lot of complexity and detail, the notion of "big impact" will prove to be simpler and to be more easily identifiable. Then by having the agent avoid all big impacts, or check all big impacts with the user, we can avoid bad big impacts in passing.
Possible gotchas and complications with this idea include, e.g., you wouldn't want the agent to freeze the universe into stasis to minimize impact, or try to edit people's brains to avoid them noticing the effects of its actions, or carry out offsetting actions that cancel out the good effects of whatever the users were trying to do.
Two refinements of the low-impact problem are a shutdown utility function and abortable plans.
Identifying ambiguous inductions
An 'inductive ambiguity' is when there's more than one simple concept that fits the data, even if some of those concepts are much simpler than others, and you want to figure out which simple concept was intended.
Suppose you're given images that show camouflaged enemy tanks and empty forests, but it so happens that the tank-containing pictures were taken on sunny days and the forest pictures were taken on cloudy days. Given the training data, the key concept the user intended might be "camouflaged tanks", or "sunny days", or "pixel fields with brighter illumination levels".
The last concept is by far the simplest, but rather than just assume the simplest explanation is correct (has most of the probability mass), we want the algorithm (or AGI) to detect that there's more than one simple-ish boundary that might separate the data, and check with the user about which boundary was intended to be learned.
Mild optimization
"Mild optimization" or "soft optimization" is when, if you ask the Task AGI to paint one car pink, it just paints one car pink and then stops, rather than tiling the galaxies with pink-painted cars, because it's not optimizing that hard.
This is related, but distinct from, notions like "low impact". E.g., a low impact AGI might try to paint one car pink while minimizing its other footprint or how many other things changed, but it would be trying as hard as possible to minimize that impact and drive it down as close to zero as possible, which might come with its own set of pathologies. What we want instead is for the AGI to try to paint one car pink while minimizing its footprint, and then, when that's being done pretty well, say "Okay done" and stop.
This is distinct from [eu_satisficer satisficing expected utility] because, e.g., rewriting yourself as an expected utility maximizer might also satisfice expected utility - there's no upper limit on how hard a satisficer approves of optimizing, so a satisficer is not reflectively stable.
The open problem with mild optimization is to describe mild optimization that (a) captures what we mean by "not trying so hard as to seek out every single loophole in a definition of low impact" and (b) is reflectively stable and doesn't approve e.g. the construction of environmental subagents that optimize harder.
Look where I'm pointing, not at my finger
Suppose we're trying to give a Task AGI the task, "Give me a strawberry". User1 wants to identify their intended category of strawberries by waving some strawberries and some non-strawberries in front of the AI's webcam, and User2 in the control room will press a button to indicate which of these objects are strawberries. Later, after the training phase, the AI itself will be responsible for selecting objects that might be potential strawberries, and User2 will go on pressing the button to give feedback on these.
The "look where I'm pointing, not at my finger" problem is getting the AI to focus on the strawberries rather than User2 - the concepts "strawberries" and "events that make User2 press the button" are very different goals even though they'll both well-classify the training cases; an AI might pursue the latter goal by psychologically analyzing User2 and figuring out how to get them to press the button using non-strawberry methods.
One way of pursuing this might be to try to zero in on particular nodes inside the huge causal lattice that ultimately produces the AI's sensory data, and try to force the goal concept to be about a simple or direct relation between the "potential strawberry" node (the objects waved in front of the webcam) and the observed button values, without this relation being allowed to go through the User2 node.
See also the related problem of "Identifying causal goal concepts from sensory data".
More open problems
This page is a work in progress. A longer list of Task AGI open subproblems:
- Low impact
- Shutdown utility function
- Conservatism
- Mild optimization
- Aversion of instrumental self-improvement goal
- Ambiguity identification
- Utility indifference
- Shutdown button
- Task identification
- Ontology identification
- Identifying causal goal concepts from sensory data - Look where I'm pointing, not at my finger
- Hooking up a directable optimization to an identified task
- Training protocols
- Which things do you think can be well-identified by what kind of labeled datasets plus queried ambiguities plus conservatism, and what pivotal acts can you do with combinations of them plus assumed other abilities?
- Faithful simulation
- Safe imitation for act-based agents
- Generative imitation with a probability of the human doing that act, guaranteed not to hindsight bias
- Typicality (related to conservatism)
- Plan transparency
- Epistemic-only hypotheticals (when you ask how in principle how the AI might paint cars pink, it doesn't run a planning subprocess that plans to persuade the actual programmers to paint things pink).
- Epistemic exclusion
- Behaviorism
(…more, this is a page in progress)
Comments
Paul Christiano
On the act-based model, the user would say something like "paint all the cars pink," and the AI would take this as evidence about what individual steps the user would approve of. Effectiveness at painting all cars pink is one consideration that the user would use. Most of the problems on your list are other considerations that would affect the user's judgment.
The difference between us seems to be something like: I feel it is best to address almost all of these problems by using learning, and so I am trying to reduce them to a traditional learning problem. For example, I would like a human to reject plans that have huge side effects, and for the agent to learn that big side effects should be avoided. You don't expect that it will be easy to learn to address these problems, and so think that we should solve them ourselves to make sure they really get solved. (I think you called my position optimism about "special case sense.")
I might endorse something like your approach at some stage---once we have set everything up as a learning problem, we can ask what parts of the learning problem are likely to be especially difficult+important, and focus our efforts on making sure that systems can solve those problems (which may involve solving them ourselves, or may just involve differential ML progress). But it seems weird to me to start this way.
Some considerations that seem relevant to me:
It's possible that the difference between us is that I think it is feasible to reduce almost all of these problems to traditional learning problems, where you disagree. But when we've actually talked about it, you seem to have consistently opted for positions like "in some sense this is 'just' a prediction problem, but I suspect that solving it will require us to understand X." And concretely, it seems to me like we have an extremely promising approach for reducing most of these problems to learning problems.
Eliezer Yudkowsky
The main thing I'd be nervous about is having the difference in our opinions be testable before the mission-critical stage. Like, maybe simple learning systems exhibit pathologies and you're like "Oh that'll be fixed with sufficient predictive power" and I say "Even if you're right, I'm not sure the world doesn't end before then." Or conversely, maybe toy models seem to learn the concept perfectly and I'm like "That's because you're using a test set that's an identical set of problems to the training set" and you're like "That's a pretty good model for how I think superhuman intelligence would also go, because it would be able to generalize better over the greater differences" and I'm like "But you're not testing the mission-critical part of the assumption."
We might have an empirical disagreement about to what extent theory plays a role in practice in ML, but I suspect we also have a policy disagreement about how important transparency is in practice to success - i.e., how likely we are to die like squirrels if we try to use a system whose desired/required dynamics we don't understand on an abstract level.
I'm not against trying both approaches in parallel.
Paul Christiano
I was talking to Chelsea Finn about IRL a few weeks ago, and she said that they had encountered the situation where they
At which point it positioned the block so that it looked (to its cameras) like the block was in a slot, while in fact it was far away.
I think they then added joint position information so that the AI could more reliably estimate whether the block was in the slot, and that fixed the problem.
Of course this problem can be solved in many ways and this instance doesn't illustrate the full difficulty etc. but I think it's a nice illustration anyway.
Paul Christiano
Presumably the advantage of this approach---rather than simply learning to imitate the human burrito-making process or even human burritos, is that it might be easier to do. Is that right?
I think that's a valid goal, but I'm not sure how well "conservative generalizations" actually address the problem. Certainly it still leaves you at a significant disadvantage relative to a non-conservative agent, and it seems more natural to first consider direct approaches to making imitation effective (like bootstrapping + meeting halfway).
Of course all of these approaches still involve a lot of extra work, so maybe the difference is are expectations about how different research angles will work out.
Paul Christiano
To me, the most natural way to approach this is to take a probability distribution over "what it means to be a burrito," and to produce a thing that is maximally likely to be a burrito rather than a thing which is maximally burrito-like. Of course this still depends on having a good distribution over "what it means to be a burrito" (as does your approach).
Eliezer Yudkowsky
It's not obvious to me that these two approaches mean the same thing. Let's say that an AI sees some stale burritos and some fresh burritos, with the former being classified as negative examples and the latter being specified as positive examples. If you use the simplest but not conservative concept that classifies the training data, maybe you max out the probability that something will be classified as a burrito by eliminating every trace of staleness… or moving even further along some dimension that distinguishes stale from fresh burritos.
Now, it's possible that this would be fixed automatically by having a mixture of hypotheses about what might underlie the good-burrito classification and that one of the hypotheses would be "maybe a burrito can't be too fresh", but again, this is not obvious to me.
It seems to me that, in general, when we learn a mixture of the simplest concepts that might assign probabilities well over previously labeled classifications, we might still be ending up with something with a nonconservative maximum. Maybe the AI learns to model the human system for classifying burritos and then presents us with a weird object whose appearance hacks us to suddenly be absolutely certain that it is a burrito - this is just me trying to wave my hands in the direction of what seems like it might be an underlying difference between "learn a probabilistic classification rule and max it out" and "try to draw a simple concept that is conservatively narrow".
It might be the case that given sufficient imagination to consider many possible hypotheses, trying to fit all of those hypotheses well (which might not be the same as maxing out the mixture) is an implementation of conservatism, or even that just trying to max out the mixture turns out to implement conservatism in practice. But then it might also be the case that in the not-far-superhuman regime, taking a direct approach to making merely powerful learning systems be 'conservative' rather than 'max out the probability considering many hypotheses' would be more tractable or straightforward as an engineering problem.
Paul Christiano
It seems critical to distinguish the cases where
As you've probably gathered, I feel hopeless about case (1).
In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).
I've never managed to get quite clear on your picture. My impression is that:
I think your optimism about case (1) is defensible; I disagree, but not for super straightforward reasons. The main disagreement is probably about case (2).
I think that your concern about generating a good enough burrito-evaluator is also defensible; I am optimistic, but even on my view this would require resolving a number of big research problems.
I think your concern about mistakes, and especially about something like "conservative concepts" as a way to reduce the scope for mistakes, is less defensible. I don't feel like this is as complex an issue---the case for delegating this to the learning algorithm seems quite strong, and I don't feel you've really given a case on the other side.
Note that this is related to what you've been calling Identifying ambiguous inductions, and I do think that there are techniques in that space that could help avoid mistakes. (Though I would definitely frame that problem differently.) So it's possible we're not really disagreeing here either. But my best guess is that you are underestimating to the extent to which some of these issues could/should be delegated to the learner itself, supposing that we could resolve your other concerns (i.e. supposing that we could construct a good enough burrito-evaluator).
Eliezer Yudkowsky
Okay, I didn't understand this. My reaction was something like "Isn't conservatively generalizing burritos from sample burritos a much simpler problem than defining an ideal criterion for burritos which probably requires something like an ideal advisor theory over extrapolated humans to talk about all the poisons that people could detect given enough computing power?" but I think I should maybe just ask you to clarify what you mean. The interpretation my brain generated was something like "Predicting a human 9p's Go moves is easier than generating 9p-level Go moves" which seems clearly false to me so I probably misunderstood you.
I don't understand this at all. Are we supposing that we have an inviolable physical machine that outputs burrito ratings and can't be shorted by seizing control of the reward channel or by including poisons that the machine-builders didn't know about? …actually I should just ask you to clarify this paragraph.
Paul Christiano
I think the key question is whether:
In world 1 I agree that the burrito-evaluator seems pretty tough to build. We certainly have disagreements about that case, but I'm happy to set it aside for now.
In world 2 things seem much less scary. Because I only need to run these evaluations with e.g. 1% probability, the judge can use 50x more resources than the burrito producer. So it's imaginable that the judge can be more powerful than the producer.
You seem to think that we are in world 1. I think that we are probably in world 2, but I'm certainly not sure. I discuss the issue in this post.
Some observations:
So I don't think that we can just ask the judge to evaluate the burrito; but the judge has enough going for her that I expect we can find some strategy that lets her win. I think this is the biggest open problem for my current approach.
Emma Borhanian
Is this really more complex than "All primes between 3 and 19"? I think you need more numbers before you can import the definition of prime and have that be simple.
Ryan Carey
the strawberry diagrams are currently unavailable