I’ve recently discussed three kinds of learning systems:

Approval-directed agents which take the action the user would most approve of.
Imitation learners which take the action that the user would tell them to take.
Narrow value learners which take the actions that the user would prefer they take.

These proposals all focus on the short-term instrumental preferences of their users. From the perspective of AI control I think this is the interesting aspect that deserves more attention.

Going forward I’ll call this kind of approach “act-based” unless I hear something better (credit to Eliezer), and I’ll call agents of this type “act-based agents.”

Robustness

Act-based agents seem to be robust to certain kinds of errors. You need only the vaguest understanding of humans to guess that killing the user is: (1) not something they would approve of, (2) not something they would do, (3) not in line with their instrumental preferences.

So in order to get bad outcomes here you have to really mess up your model of what humans want (or more likely mess up the underlying framework in an important way). If we imagine a landscape of possible interpretations of human preferences, there is a “right” interpretation that we are shooting for. But if you start with a wrong answer that is anywhere in the neighborhood, you will do things like “ask the user what to do, and don’t manipulate them.” And these behaviors will eventually get you where you want to go.

That is to say, the “right” behavior is surrounded by a massive crater of “good enough” behaviors, and in the long-term they all converge to the same place. We just need to land in the crater.

Human enhancement

All of these approaches have a common fundamental drawback: they only have as much foresight as the user. In some sense this is why they are robust.

In order for these systems to behave wisely, the user has to actually be wise. Roughly, the users need to be intellectual peers of the AI systems they are using.

This may sound quite demanding. But after making a few observations, I think it may be a realistic goal:

The user can draw upon every technology at their disposal — including other act-based agents. (This is discussed more precisely here under the heading of “efficacy.”)
The user doesn’t need to be quite as smart as the AI systems they are using, they merely need to be within striking distance. For example, it seems fine if it takes a human a few days make a decision, or to understand and evaluate a decision, that an AI can make in a few seconds.
The user can delegate this responsibility to other humans whom they are willing to trust (e.g. Google engineers), just like they do today.

In this story the capabilities of humans grow in parallel with the capabilities of AI systems, driven by close interaction between the two. AI systems do not pursue explicitly defined goals, but instead help the humans do whatever the humans want to do at any given time. The entire process remains necessarily comprehensible to humans — if humans can’t understand how an action helps them achieve their goals, then that action doesn’t get taken.

In speculations about the long-term future of AI, I think this may be the most common positive vision. But I don’t think there has been much serious thinking about what this situation actually looks like, and certainly not much thinking about how to actually realize such a vision.

Note that the involvement of actual of humans is not intended as a very long-term solution. It’s a solution built to last (at most) until all contemporary thinking about AI has been thoroughly obsoleted — until the capability of society is perhaps ten or a hundred times greater than it is today. I don’t think there is a strong case for thinking much further ahead than that.

What is “narrow” anyway?

There is clearly a difference between act-based agents and traditional rational agents. But it’s not entirely clear what the key difference is.

Consider a machine choosing a move in a game of chess. I could articulate preferences over that move (castling looks best to me), over its consequences (I don’t want to lose the bishop), over the outcome of the game (I want to win), over immediate consequences of that outcome (I want people to respect my research team), over distant consequences (I want to live a fulfilling life).

We could also go the other direction and get even narrower: rather than thinking about preferences over moves we can think about preferences over particular steps of the cognitive strategy that produces moves.

As I advance from “narrow” to “broad” preferences, many things are changing. It’s not really clear what the important differences are, what exactly we mean by “narrow” preferences, at what scales outcomes are robust to errors, at what scales learning is feasible, and so on. I would like to understand the picture better.

The upshot

Thinking about act-based agents suggests a different (and in my view more optimistic) picture of AI control. There are a number of research problems that are common across act-based approaches, especially related to keeping humans up to speed, and I think that for the moment these are the most promising directions for work on AI control.

Comments

Paul Christiano

Act\-based agents seem to be robust to certain kinds of errors\. You need only the vaguest understanding of humans to guess that killing the user is: \(1\) not something they would approve of, \(2\) not something they would do, \(3\) not in line with their instrumental preferences\.

Eliezer objects to this post's optimism about robustness.

Concretely, the complaint seems to be that a human-predictor would form generalizations like "the human takes and approves action that maximize expected utility" for some notion of "utility" and some notion of counterfactuals etc. It might then end up killing the users (or making some other irreversibly bad decision) because the bad action is the utility-maximizing thing to do according to the the learned values/decision theory/priors/etc. (which aren't identical to humans' values/decision theory/priors/etc.).

I'm not impressed by this objection.

Clearly this would be an objectively bad prediction of the human. And so the question is entirely about how hard it is to notice that it's a bad, or at least uncertain, prediction. That is, to a human it appears to be a comically bad prediction. So the question is: to what extent is this just because we are humans predicting humans?

This class of errors has literally been talked about by humans in advance, as has the general observation that humans won't endorse irreversible and potentially catastrophic actions without checking in with humans first. It will probably be talked about in much more detail at the time. So noticing this is an error only requires something like an understanding of how humans' actions relate to their words, which is significantly easier than building a model of a human as an approximately rational goal-directed agent (since that approximate model would also need to explain human utterances). That is, you just need to be able to infer from a human saying "I think X would be a catastrophic mistake" that a human won't do X.
It seems like this error is only possible for an agent that is unable to predict anything like "how a human would talk about their decision," or "how other people would respond to a decision," or so on. Are you imagining a system that can't predict any of these properties, but can just make OK predictions about actions? Or are you imagining a system that fills in the details of the "kill all humans" action with the human patiently explaining how the action is good because we are probably living in a simulation controlled by an adversarial superintelligence who will torture us if we don't take it, yet isn't able to distinguish this explanation from the explanations that are actually given in the real world for real actions?
You seem to be describing the situation as though expected utility maximization with an aggregate utility function is an OK description of human behavior but for some issues like Pascal's mugging that only appear in future edge cases. This view seems surprising for a few reasons. First, how does it account for human philosophical deliberation, and the actual discussions that humans engage in when faced with cases superficially resembling these pathological edge cases? I don't see how any plausible human model is going to throw out the human deliberative model in favor of some simple general theory. Second, expected utility maximization basically can't reproduce even a single human decision. Taken literally these philosophical frameworks are mostly predictively useless, it's not like this is a basically right framework that has a few weird edge cases. A muggable value system doesn't behave badly in weird corner cases, it behaves badly literally all of the time (except perhaps when implementing convergent instrumental values).
It doesn't seem necessary for a learner to generalize correctly to some far-out case on the first shot, it only seems necessary for it to know that this is a case where it is uncertain (e.g. because it entertains several conflicting hypotheses, or because there are several general regularities that come into conflict in this case).

I don't think these points totally capture my position, but hopefully they help explain where I am coming from. I still feel pretty good about the argument in the "robustness" section of this post. It really does seem like it is pretty easy to predict that the human won't generally endorse actions that leaves them dead.

Ryan Carey

That is to say, the “right” behavior is surrounded by a massive crater of “good enough” behaviors, and in the long-term they all converge to the same place. We just need to land in the crater.

This does seem true if you're talking about acts in a human distribution. i.e. if you've smeared actions out over a space s.t. a uniform probability density over that space is roughly the distribution of human actions. Then, actions near the "good enough" behaviors might also be good.

If you're optimizing, and not sampling from a known distribution over human actions (i.e. quantilizing or similar), then it looks like you'll still get problems with unforeseen maxima, edge instantiation and the like, problems that could easily end up with catastrophic outcomes.