{ localUrl: '../page/hyperexistential_separation.html', arbitalUrl: 'https://arbital.com/p/hyperexistential_separation', rawJsonUrl: '../raw/8vk.json', likeableId: '4094', likeableType: 'page', myLikeValue: '0', likeCount: '6', dislikeCount: '0', likeScore: '6', individualLikes: [ 'AlexeiAndreev', 'AndrewMcKnight', 'EricRogstad', 'DanielKokotajlo', 'OferGivoli', 'LukasFinnveden' ], pageId: 'hyperexistential_separation', edit: '4', editSummary: '', prevEdit: '3', currentEdit: '4', wasPublished: 'true', type: 'wiki', title: 'Separation from hyperexistential risk', clickbait: 'The AI should be widely separated in the design space from any AI that would constitute a "hyperexistential risk" (anything worse than death).', textLength: '7577', alias: 'hyperexistential_separation', externalUrl: '', sortChildrenBy: 'likes', hasVote: 'false', voteType: '', votesAnonymous: 'false', editCreatorId: 'EliezerYudkowsky', editCreatedAt: '2017-12-04 21:38:46', pageCreatorId: 'EliezerYudkowsky', pageCreatedAt: '2017-12-04 20:49:23', seeDomainId: '0', editDomainId: 'EliezerYudkowsky', submitToDomainId: '0', isAutosave: 'false', isSnapshot: 'false', isLiveEdit: 'true', isMinorEdit: 'false', indirectTeacher: 'false', todoCount: '1', isEditorComment: 'false', isApprovedComment: 'false', isResolved: 'false', snapshotText: '', anchorContext: '', anchorText: '', anchorOffset: '0', mergedInto: '', isDeleted: 'false', viewCount: '1149', text: 'A principle of [2v AI alignment] that does not seem reducible to [7v8 other principles] is "The AGI design should be widely separated in the design space from any design that would constitute a hyperexistential risk". A hyperexistential risk is a "fate worse than death", that is, any AGI whose outcome is worse than quickly killing everyone and filling the universe with [7ch paperclips].\n\nAs an example of this principle, suppose we could write a first-generation AGI which contained an explicit representation of our exact true value function $V,$ but where we were not in this thought experiment absolutely sure that we'd solved the problem of getting the AGI to align on that explicit representation of a utility function. This would violate the principle of hyperexistential separation, because an AGI that optimizes $V$ is near in the design space to one that optimizes $-V.$ Similarly, suppose we can align an AGI on $V$ but we're not certain we've built this AGI to be immune to decision-theoretic extortion. Then this AGI distinguishes the global minimum of $V$ as the most effective threat against it, which is something that could increase the probability of $V$-minimizing scenarios being realized.\n[auto-summary-to-here]\n\nThe concern here is a special case of [ shalt-not backfire] whereby identifying a negative outcome to the system moves us closer in the design space to realizing it.\n\nOne seemingly obvious patch to avoid disutility maximization might be to give the AGI a utility function $U = V + W$ where $W$ says that the absolute worst possible thing that can happen is for a piece of paper to have written on it the SHA256 hash of "Nopenopenope" plus 17. Then if, due to otherwise poor design permitting single-bit errors to have vast results, a cosmic ray flips the sign of the AGI's effective utility function, the AGI tiles the universe with pieces of paper like that; this is no worse than ordinary paperclips. Similarly, any extortion against the AGI would use such pieces of paper as a threat. $W$ then functions as a honeypot or distractor for disutility maximizers which prevents them from minimizing our own true utility.\n\nThis patch would not actually work because this is a rare special case of a utility function *[3r6 not]* being [2rb reflectively consistent]. By the same reasoning we use to add $W$ to the AI's utility function $U,$ we might expect the AGI to realize that the only thing causing this weird horrible event to happen would be that event's identification by its representation of $U,$ and thus the AGI would be motivated to delete its representation of $W$ from its successor's utility function.\n\nA patch to the patch might be to have $W$ single out a class of event which we didn't otherwise care about, but would otherwise happen at least once on its own over the otherwise expected history of the universe. If so, we'd need to weight $W$ relative to $V$ within $U$ such that $U$ still motivated expending only a small amount of effort on easily preventing the $W$-disvalued event, rather than all effort being spent on averting $W$ to the neglect of $V.$\n\nA deeper solution for an early-generation [6w Task AGI] would be to *never try to explicitly represent complete human values,* especially the parts of $V$ that identify things we dislike more than death. If you avoid [2pf impacts in general] except for operator-whitelisted impacts, then you would avoid negative impacts along the way, rather than the AI containing an explicit description of what is the worst sort of impact that needs to be avoided. In this case, the AGI just doesn't contain the information needed to compute states of the universe that we'd consider worse than death; flipping the sign of the utility function $U,$ or subtracting components from $U$ and then flipping the sign, doesn't identify any state we consider worse than paperclips. The AGI no longer *neighbors a hyperexistential risk in the design space;* there is no longer a short path we can take in the design space, by any simple negative miracle, to get from the AGI to a fate worse than death.\n\nSince hyperexistential catastrophes are narrow special cases (or at least it seems this way and we sure hope so), we can avoid them much more widely than ordinary existential risks. A Task AGI powerful enough to do anything [6y pivotal] seems unavoidably very close in the design space to something that would destroy the world if we took out all the internal limiters. By the act of having something powerful enough to destroy the world lying around, we are closely neighboring the destruction of the world within an obvious metric on possibilities. Anything powerful enough to save the world can be transformed by a simple negative miracle into something that (merely) destroys it.\n\nBut we don't fret terribly about how a calculator that can add 17 + 17 and get 34 is very close in the design space to a calculator that gets -34; we just try to prevent the errors that would take us there. We try to constrain the state trajectory narrowly enough that it doesn't slop over into any "neighboring" regions. This type of thinking is plausibly the best we can do for ordinary existential catastrophes, which occupy very large volumes of the design space near any AGI powerful enough to be helpful.\n\nBy contrast, an "I Have No Mouth And I Must Scream" scenario requires an AGI that specifically wants or identifies particular very-low-value regions of the outcome space. Most simple utility functions imply reconfiguring the universe in a way that merely kills us; a hyperexistential catastrophe is a much smaller target. Since hyperexistential risks can be extremely bad, we prefer to avoid even very tiny probabilities of them; and since they are narrow targets, it is reasonable to try to avoid *being anywhere near them* in the state space. This can be seen as a kind of Murphy-proofing; we will naturally try to rigidify the state trajectory and perhaps succeed, but errors in our reasoning are likely to take us to nearby-neighboring possibilities despite our best efforts. You would still need bad luck on top of that to end up in the particular neighborhood that denotes a hyperexistential catastrophe, but this is the type of small possibility that seems worth minimizing further.\n\nThis principle implies that *general* inference of human values should not be a target of an early-generation Task AGI. If a [7t8 meta-utility] function $U'$ contains all of the information needed to identify all of $V,$ then it contains all of the information needed to identify minima of $V.$ This would be the case if e.g. an early-generation AGI was explicitly identifying a meta-goal along the lines of "learn all human values". However, this consideration weighing against general value learning of true human values might not apply to e.g. a Task AGI that was learning inductively from human-labeled examples, if the labeling humans were not trying to identify or distinguish within "dead or worse" and just assigned all such cases the same "bad" label. There are still subtleties to worry about in a case like that, by which simple negative miracles might end up identifying the true $V$ anyway in a goal-valent way. But even on the first step of "use the same label for death and worse-than-death as events to be avoided, likewise all varieties of bad fates better than death as a type of consequence to notice and describe to human operators", it seems like we would have moved substantially further away in the design space from hyperexistential catastrophe.', metaText: '', isTextLoaded: 'true', isSubscribedToDiscussion: 'false', isSubscribedToUser: 'false', isSubscribedAsMaintainer: 'false', discussionSubscriberCount: '5', maintainerCount: '1', userSubscriberCount: '0', lastVisit: '', hasDraft: 'false', votes: [], voteSummary: 'null', muVoteSummary: '0', voteScaling: '0', currentUserVote: '-2', voteCount: '0', lockedVoteType: '', maxEditEver: '0', redLinkCount: '0', lockedBy: '', lockedUntil: '', nextPageId: '', prevPageId: '', usedAsMastery: 'false', proposalEditNum: '0', permissions: { edit: { has: 'false', reason: 'You don't have domain permission to edit this page' }, proposeEdit: { has: 'true', reason: '' }, delete: { has: 'false', reason: 'You don't have domain permission to delete this page' }, comment: { has: 'false', reason: 'You can't comment in this domain because you are not a member' }, proposeComment: { has: 'true', reason: '' } }, summaries: {}, creatorIds: [ 'EliezerYudkowsky' ], childIds: [], parentIds: [ 'alignment_principle' ], commentIds: [ '956', '9fc' ], questionIds: [], tagIds: [], relatedIds: [], markIds: [], explanations: [], learnMore: [], requirements: [], subjects: [], lenses: [], lensParentId: '', pathPages: [], learnMoreTaughtMap: {}, learnMoreCoveredMap: {}, learnMoreRequiredMap: {}, editHistory: {}, domainSubmissions: {}, answers: [], answerCount: '0', commentCount: '0', newCommentCount: '0', linkedMarkCount: '0', changeLogs: [ { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '22907', pageId: 'hyperexistential_separation', userId: 'EliezerYudkowsky', edit: '4', type: 'newEdit', createdAt: '2017-12-04 21:38:47', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '22906', pageId: 'hyperexistential_separation', userId: 'EliezerYudkowsky', edit: '3', type: 'newEdit', createdAt: '2017-12-04 21:02:43', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '22905', pageId: 'hyperexistential_separation', userId: 'EliezerYudkowsky', edit: '2', type: 'newEdit', createdAt: '2017-12-04 20:56:00', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '22904', pageId: 'hyperexistential_separation', userId: 'EliezerYudkowsky', edit: '0', type: 'newParent', createdAt: '2017-12-04 20:49:25', auxPageId: 'alignment_principle', oldSettingsValue: '', newSettingsValue: '' }, { likeableId: '0', likeableType: 'changeLog', myLikeValue: '0', likeCount: '0', dislikeCount: '0', likeScore: '0', individualLikes: [], id: '22902', pageId: 'hyperexistential_separation', userId: 'EliezerYudkowsky', edit: '1', type: 'newEdit', createdAt: '2017-12-04 20:49:23', auxPageId: '', oldSettingsValue: '', newSettingsValue: '' } ], feedSubmissions: [], searchStrings: {}, hasChildren: 'false', hasParents: 'true', redAliases: {}, improvementTagIds: [], nonMetaTagIds: [], todos: [], slowDownMap: 'null', speedUpMap: 'null', arcPageIds: 'null', contentRequests: {} }