{ localUrl: '../page/optimization_goals.html', arbitalUrl: 'https://arbital.com/p/optimization_goals', rawJsonUrl: '../raw/1tc.json', likeableId: '758', likeableType: 'page', myLikeValue: '0', likeCount: '1', dislikeCount: '0', likeScore: '1', individualLikes: [ 'AlexRay' ], pageId: 'optimization_goals', edit: '7', editSummary: '', prevEdit: '6', currentEdit: '7', wasPublished: 'true', type: 'wiki', title: 'Optimization and goals', clickbait: '', textLength: '10500', alias: 'optimization_goals', externalUrl: '', sortChildrenBy: 'likes', hasVote: 'false', voteType: '', votesAnonymous: 'false', editCreatorId: 'PaulChristiano', editCreatedAt: '2016-03-04 00:13:31', pageCreatorId: 'PaulChristiano', pageCreatedAt: '2016-02-01 01:31:04', seeDomainId: '0', editDomainId: '705', submitToDomainId: '0', isAutosave: 'false', isSnapshot: 'false', isLiveEdit: 'true', isMinorEdit: 'false', indirectTeacher: 'false', todoCount: '0', isEditorComment: 'false', isApprovedComment: 'true', isResolved: 'false', snapshotText: '', anchorContext: '', anchorText: '', anchorOffset: '0', mergedInto: '', isDeleted: 'false', viewCount: '64', text: 'If we want to write a program that _doesn’t_ pursue a goal, we can have two kinds of trouble:\n\n1. We might need to explicitly introduce goal-directed behavior into our program, because it’s the easiest way to do what we want to do.\n2. We might try to write a program that doesn’t pursue a goal, and fail.\n\nIssue \\[2\\] sounds pretty strange—it’s not the kind of bug most software has. But when you are programming with gradient descent, strange things can happen.\n\nIn this post I illustrate issue \\[2\\] by considering a possible design for an [approval-directed](https://arbital.com/p/1t7) system, intended to take individual actions the user would approve of without considering their long-term consequences.\n\n### The architecture\n\n\nOur agent takes as input a sequence of observations *x[t]* ∈ {0, 1} and outputs a sequence of actions *a[t]* ∈ {0, 1}.\n\nThe main ingredient is a pair of functions A : ℜⁿ → {0, 1} , S: ℜⁿ × {0, 1} → ℜⁿ.\n\nTo run the *agent:*\n\n- We initialize s[0] = 0ⁿ.\n- Define a[t] = A(_s_ [_t_ ]_)_\n- Define s[t+1] = S (s[t], x[t])\n\nFor concreteness you might imagine that A and S are feed-forward neural networks, so that the entire system is a typical recurrent neural network. I’m going to imagine performance well beyond anything we can currently get out of a recurrent neural network.\n\nOur training data consists of some sequence of observations _x_[_t_] and reward functions r[t] : {_0, 1_} → _ℜ_. We train A and S to maximize Σ _r[t]_(_a_[_t_]). We’ll define _r_ _[T]_ _(a)_ to be how much we approve of taking _a_ in the context of step _t_.\n\nTo gather training data, we use the current version of A and S to control a robot, we define _x_[_t_] as the robot’s sensor reading at step _t_, and we manually provide the reward functions _r[t]_ as look-up tables.\n\nWe hope that if we use A and S to control a robot, the resulting system chooses each action to maximize our approval _of that action_. (This might seem like a crazy design, but for now let’s just ask whether the resulting system will exhibit goal-directed behavior.)\n\nThe state _s_ is important for most applications. We’d like our agent to make decisions based on everything it knows, not based only on what it currently observes. By optimizing S to store useful information, we allow the system to remember information across time steps and aggregate many observations to make informed judgments.\n\n### What happens?\n\n\n\nThis system looks like it shouldn’t form any long-term plans; each action should be selected greedily, without concern for its consequences. The separation of A and S makes this perfectly explicit: A is essentially a classifier, which labels each state _s_ with the most-approved action _a_ ∈ {0, 1}. There isn’t any room for A to consider long-term consequences.\n\nHow would this system work, if it worked well? What would the learned function S look like?\n\nHere are some possible properties, which seem like the “right” way to optimize the reward function. That is, if you optimized well enough, over a large enough space of functions, I would expect you to observe these behaviors:\n\n1. The state _s_ encodes beliefs about the agent’s environment. (Assuming that our approval, and hence the reward function _r_, is mostly determined by the environment.)\n2. The function S updates the agent’s current beliefs based on the observation _x_.\n3. _s_ selectively remembers facts that are likely to be useful, because it doesn’t have enough space to remember everything in detail. It also stores the results of cached computations—the function A doesn’t have time to do much computation, so if complicated computations are useful, the agent must carry them out over many steps and remember the results. (For example, if the agent saw a puzzle written on a sheet of paper, the agent might compute and store its answer to use later.)\n4. S uses non-trivial rules to decide what to remember or compute. All of these decisions are at least implicitly optimized to make efficient use of limited resources. These choices depend on what situations will arise in the future, and so S should use its full predictive ability to make them. For example, when deciding whether to forget X or forget Y, S might predict what reward it will receive in each case and choose the better option (this computation could use cached intermediate results and predictions, implicitly taking place over many steps). **In general, S updates _s_ rationally to maximize its expected future reward.**\n5. The agent is able to reason about the relationship between its own beliefs and the world. For example, it might realize that a particular fact will be repeated later, and so not bother remembering it. This clearly requires making a prediction about what it will observe in the future. But it’s also based on a prediction about how S will update _s_ in response to that observation, and about how A will respond to different values of _s_.\n\nThese all sound like good things. Each of them leads the agent to perform better when deployed. We could imagine accomplishing the same goals with simple heuristics, but as we get enough training data the heuristics should converge to rational behavior.\n\nBut these properties, and especially property 4, may have some counterintuitive consequences.\n\nWe can easily see the shadow of plans: when the agent chooses the state _s_, it won’t greedily maximize the next round’s performance (and we wouldn’t want it to, since this would mean promptly forgetting everything that isn’t immediately relevant). Instead, it will choose a state _s_ that is useful for long-term performance. And to do that, it will think about long-term consequences.\n\n### What goes wrong?\n \n\nWhen the agent anticipates the consequences of choosing different values of s[t], it sees that different choices lead to different sequences of states, hence different sequences of predictions, hence different rewards. During training the agent _doesn’t_ expect different states to lead to different events in the world, because it was trained on the fixed actions taken by previous versions of A,S.\n\nBut when the system is actually deployed, changing _s_ will also change what action is actually taken. If the system is a good general purpose learner, it will notice this relationship (along with many other patterns that happen to appear while the system is deployed but didn’t appear in any training episode.)\n\nFor example, if the agent forgets where an item is and is unable to find it when asked, the agent is likely to be reminded where to look. In general, the agent should notice patterns like “If I forget X, I am more likely to be reminded of X in the future.” Once the agent notices this pattern, the agent should act on it—after all, the agent is already acting on many similar patterns.\n\nIf the agent rationally chooses _s_ to maximize future rewards, it might take more extreme measures. For example, it might reason “If I choose a state s which incorrectly `remembers’ that the user asked me to write down X, then I will write down X. If I write down X, I will remember X in the future. This will lead to low reward in the next few rounds, since the user didn’t ask me to write down X. But this is more than offset by the high future reward from remembering X on many different occasions.”\n\nIn an even more extreme case, the agent might reason “If I choose a state swhich incorrectly `remembers’ that the user asked me to implement another AI whose goal is maximizing my future reward, I will achieve a high future reward. I will get an approval of 0 while I am implementing the new AI, but this will be more than offset by my high reward thereafter.”\n\nThis example is of course facetious, but hopefully it illustrates the point. If we actually want the robot to maximize its total approval, then we should be happy with this result. But now we are right back in the goal-directed case.\n\n### Upshot\n\n\nIf you trained a sufficiently weak S, it would not display goals nor exhibit human-level reasoning. If you trained a sufficiently powerful S, it might display goals and would definitely be superhuman. Somewhere in between you get human-level performance, and somewhere in between you probably get goal-directed behavior.\n\nIt’s not clear whether goal-directed behavior emerges before or after human-level performance. But I am wary of systems that only behave as intended in a sweet spot between being “good enough” and “too good,” or which depend on optimistic assumptions. It would be more satisfying to design systems that perform robustly regardless of how smart they are.\n\nI suspect that similar problems will arise frequently if we train very intelligent but goal-free systems.\n\nOne response is to give up on designing intelligent systems without goals—to conclude that in order to build systems that work robustly, we should conservatively assume that they have goals. This is probably somewhat premature.\n\nAnother response is to find a better training criterion, which doesn’t involve maximizing a sum of future rewards. But it’s not clear how to do this while also training the system to build a useful representation, since the usefulness of a representation is determined by long-term performance. We could train F to create a state _s_ which is useful on the very next time step (i.e. we could propagate gradients from s[_n+1_] to the parameters of S, but not back to s[n]), and hope that the resulting update rule happens to be appropriate over longer time scales. But this seems to rest on a pretty optimistic assumption.\n\nA third response is to also update the state _s_ in the most-approved-of way. But this forces the human engineers to figure out how the system should store its knowledge, which may be a serious problem (though it [might](https://arbital.com/p/1t8) be manageable).\n\nA final possible response is to “cross that bridge when we come to it.” I would like to understand AI safety further in advance, and understanding the inevitability of goal-directed behavior is a natural step. So although this option is tempting, I am hesitant to accept it. Our theoretical understanding is also so weak that I feel we can go somewhat further before we run into a hard requirement for empirical feedback. I’ll find this option more tempting once we are going in circles rather than making steady headway.', metaText: '', isTextLoaded: 'true', isSubscribedToDiscussion: 'false', isSubscribedToUser: 'false', isSubscribedAsMaintainer: 'false', discussionSubscriberCount: '1', maintainerCount: '1', userSubscriberCount: '0', lastVisit: '', hasDraft: 'false', votes: [], voteSummary: 'null', muVoteSummary: '0', voteScaling: '0', currentUserVote: '-2', voteCount: '0', lockedVoteType: '', maxEditEver: '0', redLinkCount: '0', lockedBy: '', lockedUntil: '', nextPageId: '', prevPageId: '', usedAsMastery: 'false', proposalEditNum: '0', permissions: { edit: { has: 'false', reason: 'You don't have domain permission to edit this page' }, proposeEdit: { has: 'true', reason: '' }, delete: { has: 'false', reason: 'You don't have domain permission to delete this page' }, comment: { has: 'false', reason: 'You can't comment in this domain because you are not a member' }, proposeComment: { has: 'true', reason: '' } }, summaries: {}, creatorIds: [ 'PaulChristiano' ], childIds: [], parentIds: [ 'paul_ai_control' ], commentIds: [], questionIds: [], tagIds: [], relatedIds: [], markIds: [], explanations: [], learnMore: [], requirements: [], subjects: [], lenses: [], lensParentId: '', pathPages: [], learnMoreTaughtMap: {}, learnMoreCoveredMap: {}, learnMoreRequiredMap: {}, editHistory: {}, domainSubmissions: {}, answers: [], answerCount: '0', commentCount: '0', newCommentCount: '0', linkedMarkCount: '0', changeLogs: [ { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '8253', pageId: 'optimization_goals', userId: 'AlexeiAndreev', edit: '7', type: 'newEdit', createdAt: '2016-03-04 00:13:31', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '8252', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '6', type: 'newEdit', createdAt: '2016-03-04 00:12:47', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '6608', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '5', type: 'newEdit', createdAt: '2016-02-09 03:43:42', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '6605', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '4', type: 'newEdit', createdAt: '2016-02-09 03:29:45', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '6604', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '3', type: 'newEdit', createdAt: '2016-02-09 03:22:11', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '6415', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '2', type: 'newEdit', createdAt: '2016-02-04 02:11:45', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '6001', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '1', type: 'newEdit', createdAt: '2016-02-01 01:31:04', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '5999', pageId: 'optimization_goals', userId: 'JessicaChuan', edit: '0', type: 'newParent', createdAt: '2016-02-01 01:20:26', auxPageId: 'paul_ai_control', oldSettingsValue: '', newSettingsValue: '' } ], feedSubmissions: [], searchStrings: {}, hasChildren: 'false', hasParents: 'true', redAliases: {}, improvementTagIds: [], nonMetaTagIds: [], todos: [], slowDownMap: 'null', speedUpMap: 'null', arcPageIds: 'null', contentRequests: {} }