Autonomous AGI

An autonomous or self-directed advanced agent, a machine intelligence which acts in the real world in pursuit of its preferences without further user intervention or steering. In Bostrom's typology of advanced agents, this is a "Sovereign" and distinguished from a "Genie" or an "Oracle". ("Sovereign" in this sense means self-sovereign, and is not to be confused with the concept of a Bostromian singleton or any particular kind of social governance.)

Usually, when we say "Sovereign" or "self-directed", we'll be talking about a supposedly aligned AI that acts autonomously by design. Failure to solve the alignment problem probably means the resulting AI is self-directed-by-default.

Trying to construct an autonomous Friendly AI suggests that we trust the AI more than the programmers in any conflict between them, and we're okay with removing all constraints and off-switches except those the agent voluntarily takes upon itself.

A successfully aligned autonomous AGI would carry the least moral hazard of any scenario, since it hands off steering to some fixed preference framework or objective that the programmers can no longer modify. Nonetheless, being really really really that sure, not just getting it right but knowing we've gotten it right, seems like a large enough problem that perhaps we shouldn't be trying to build this class of AI for our first try, and should first target a Task AGI instead, or something else involving ongoing user steering.

An autonomous superintelligence would be the most difficult possible class of AGI to align, requiring total alignment. Coherent extrapolated volition is a proposed alignment target for an autonomous superintelligence, but again, probably not something we should attempt to do on our first try.

Comments

Paul Christiano

This topic consistently frustrates me; the proposed typology is obviously incomplete, and I don't think it produces any useful conclusions except by either equivocating between definitions (e.g. when establishing that X is a sovereign and later that sovereigns have property P), by assuming exhaustiveness without justification, or by straightforwardly smuggling in associations.

Note that "an AI intended to act freely in the world according to its own preferences" need not entail "without further direction," since the preferences of the AI may make reference to human direction. And neither of these directly entail the need to get it right on the first try to any greater extent than any other AI system.

And the complement of these properties doesn't really imply anything at all, certainly not that a system is a genie or an oracle.

Paul Christiano

I obviously disagree with "under intelligence explosion scenarios a Singleton seems like a quite probable result of constructing a Sovereign."

This is true in an uninteresting sense, namely: in the very long run a singleton seems pretty likely. If technological/economic/social change accelerates enough, then from the outside it may look like a singleton appears immediately. But that's not a useful notion for forecasting the character of that singleton or the future trajectory of civilization, and the resulting singleton has little more relation to the early AI than it has to us.

Relatedly, I feel that "sovereign" is a really bad name.

Eliezer Yudkowsky

I have similar qualms about the name. Got something better?

Leaving that aside, if you have an AI acting under its own goals in a way not intended to involve constantly consulting humans, and an intelligence explosion is feasible, why wouldn't it be able to take over the world and why wouldn't that be a convergent instrumental goal even of nice agents? 48 hours to get the custom proteins, etcetera.

Paul Christiano

I expect you know my answer on this one.

I agree that if there is a really fast transition (e.g. doubling capability in a day), starting from a world that looks generally like the world of today (and in particular one which isn't already moving incredibly quickly) then it could result in world takeover depending on the conditions of AI development. Maybe I'd call it more likely than not in that case, with the main uncertainty being how concentrated relevant information is and how well-coordinated the people with that information already are (of course according to calendar time they might quickly form a singleton anyway as their coordination ability improves, but that's precisely the uninteresting sense I was describing before).

You could reserve "intelligence explosion" for the really fast transition + "standing start" scenario. But from my perspective the broader notion is quite useful, since it looks like a probable consequence of our understanding of technological development, which is consistent with history and the contemporary understanding of AI, which still no one takes seriously despite being one of the most important facts about the future. The broader notion is also what people normally say the definition is, e.g. it's what Chalmers argues for and I think it's the definition Nick uses.

The narrower notion is perhaps even more important if it will actually occur, but also is (at a minimum) highly controversial, and is based on a view of AI progress that few experts endorse. It seems best to reserve "intelligence explosion" for the crazy but probable event that no one takes seriously, to continue to try to get the broader intellectual community to understand why that event seems likely, and to have a more nuanced discussion (amongst people who already take the basic claim seriously) about how fast and abrupt progress is likely to be.

I don't have a better name for "sovereign" in part because I don't think it's a useful or entirely coherent concept---it feels practically designed to smuggle in assumptions. I do think that we can make better names for various more precise versions of it, e.g. "fully autonomous agent."