[summary: It seems likely that for advanced agents, the agent's representation of the world will change in unforeseen ways as it becomes smarter. The ontology identification problem is to create a preference framework for the agent that optimizes the same external facts, even as the agent modifies its representation of the world. For example, if the intended goal were to create large amounts of diamond material, one type of ontology identification problem would arise if the programmers thought of carbon atoms as primitive during the AI's development phase, and then the advanced AI discovered nuclear physics.]
[toc:]
Introduction: The ontology identification problem for unreflective diamond maximizers
A simplified but still very difficult open problem in AI alignment is to state an unbounded program implementing a diamond maximizer that will turn as much of the physical universe into diamond as possible. The goal of "making diamonds" was chosen to have a crisp-seeming definition for our universe (the amount of diamond is the number of carbon atoms covalently bound to four other carbon atoms). If we can crisply define exactly what a 'diamond' is, we can avert issues of trying to convey complex values into the agent. (The unreflective diamond maximizer putatively has [ unlimited computing power], runs on a [ Cartesian processor], and confronts no other agents [ similar to itself]. This averts many other problems of reflectivity, decision theory and value alignment.)
Even with a seemingly crisp goal of "make diamonds", we might still run into two problems if we tried to write a hand-coded object-level utility function that [ identified] the amount of diamond material:
- Unknown substrate: We might not know the true, fundamental ontology of our own universe, hence not know what stuff diamonds are really made of. (What exactly is a carbon atom? If you say it's a nucleus with six protons, what's a proton? If you define a proton as being made of quarks, what if there are unknown other particles underlying quarks?)
- It seems intuitively like there ought to be some way to identify carbon atoms to an AI in some way that doesn't depend on talking about quarks. Doing this is part of the ontology identification problem.
- Unknown representation: We might crisply know what diamonds are in our universe, but not know how to find diamonds inside the agent's model of the environment.
- Again, it seems intuitively like it ought to be possible to identify diamonds in the environment, even if we don't know details of the agent's exact internal representation. Doing this is part of the ontology identification problem.
To introduce the general issues in ontology identification, we'll try to walk through the [ anticipated difficulties] of constructing an unbounded agent that would maximize diamonds, by trying specific methods and suggesting [ anticipated difficulties] of those methods.
Difficulty of making AIXI-tl maximize diamond
The classic unbounded agent - an agent using far more computing power than the size of its environment - is AIXI. Roughly speaking, AIXI considers all computable hypotheses for how its environment might be turning AIXI's motor outputs into AIXI's sensory inputs and rewards. We can think of AIXI's hypothesis space as including all Turing machines that, sequentially given AIXI's motor choices as inputs, will output a sequence of predicted sense items and rewards for AIXI. The finite variant AIXI-tl has a hypothesis space that includes all Turing machines that can be specified using fewer than $~$l$~$ bits and run in less than time $~$t$~$.
One way of seeing the difficulty of ontology identification is considering why it would be difficult to make an AIXI-tl variant that maximized 'diamonds' instead of 'reward inputs'.
The central difficulty here is that there's no way to find 'diamonds' inside the implicit representations of AIXI-tl's sequence-predicting Turing machines. Given an arbitrary Turing machine that is successfully predicting AIXI-tl's sense inputs, there is no general rule for how to go from the representation of that Turing machine to a statement about diamonds or carbon atoms. The highest-weighted Turing machines that have best predicted the sensory data so far, presumably contain some sort of representation of the environment, but we have no idea how to get 'the number of diamonds' out of it.
If AIXI has a webcam, then the final outputs of the Turing machine are predictions about the stream of bits produced by the webcam, going down the wire into AIXI. We can understand the meaning of that Turing machine's output predictions; those outputs are meant to match types with the webcam's input. But we have no notion of anything else that Turing machine is representing. Even if somewhere in the Turing machine happens to be an atomically detailed model of the world, we don't know what representation it uses, or what format it has, or how to look inside it for the number of diamonds that will exist after AIXI's next motor action.
This difficulty ultimately arises from AIXI being constructed around a [ Cartesian] paradigm of [ sequence prediction], with AIXI's sense inputs and motor outputs being treated as sequence elements, and the Turing machines in its hypothesis space having inputs and outputs matched to the sequence elements and otherwise being treated as black boxes. This means we can only get AIXI to maximize direct functions of its sensory input, not any facts about the outside environment.
(We can't make AIXI maximize diamonds by making it want pictures of diamonds because then it will just, e.g., [ build an environmental subagent that seizes control of AIXI's webcam and shows it pictures of diamonds]. If you ask AIXI to show itself sensory pictures of diamonds, you can get it to show its webcam lots of pictures of diamonds, but this is not the same thing as building an environmental diamond maximizer.)
Agent using classical atomic hypotheses
As an [ unrealistic example]: Suppose someone was trying to define 'diamonds' to the AI's utility function, and suppose they knew about atomic physics but not nuclear physics. Suppose they build an AI which, during its development phase, learns about atomic physics from the programmers, and thus builds a world-model that is based on atomic physics.
Again for purposes of [ unrealistic examples], suppose that the AI's world-model is encoded in such fashion that when the AI imagines a molecular structure - represents a mental image of some molecules - carbon atoms are represented as a particular kind of basic element of the representation. Again, as an [ unrealistic example], imagine that there are [ little LISP tokens] representing environmental objects, and that the environmental-object-type of carbon-objects is encoded by the integer 6. Imagine also that each atom, inside this representation, is followed by a list of the other atoms to which it's covalently bound. Then when the AI is imagining a carbon atom participating in a diamond, inside the representation we would see an object of type 6, followed by a list containing exactly four other 6-objects.
Can we fix this representation for all hypotheses, and then write a utility function for the AI that counts the number of type-6 objects that are bound to exactly four other type-6 objects? And if we did so, would the result actually be a diamond maximizer?
AIXI-atomic
We can imagine formulating a variant of AIXI-tl that, rather than all tl-bounded Turing machines, considers tl-bounded simulated atomic universes - that is, simulations of classical, pre-nuclear physics. Call this AIXI-atomic.
A first difficulty is that universes composed only of classical atoms are not good explanations of our own universe, even in terms of surface phenomena; e.g., the ultraviolet catastrophe. So let it be supposed that we have simulation rules for classical physics that replicate at least whatever phenomena the programmers have observed at [ development time], even if the rules have some seemingly ad-hoc elements (like there being no ultraviolent catastrophes).
A second difficulty is that a simulated universe of classical atoms does not identify where in the universe the AIXI-atomic agent resides, or that AIXI-atomic's sense inputs don't have types commensurate with the types of atoms. We can elide this difficulty by imagining that AIXI-atomic simulates classical universes containing a single hypercomputer, and that AIXI-atomic knows a simple function from each simulated universe onto its own sensory data (e.g., it knows to look at the simulated universe, and translate simulated photons impinging on its webcam onto predicted webcam data in the received format). This elides most of the problem of [ naturalized induction], by fixing the ontology of all hypotheses and standardizing their hypothetical [ bridging laws].
So the analogous AIXI-atomic agent that maximizes diamond:
- Considers only hypotheses that directly represent universes as huge systems of classical atoms, so that the function 'count atoms bound to four other carbon atoms' can be directly run over any possible future the agent considers.
- Assigns probabilistic priors over these possible atomic representations of the universe.
- Somehow [ maps each atomic representation onto the agent's sensory experiences and motor actions].
- [Bayes-updates its priors] based on actual sensory experiences, the same as classical AIXI.
- Can evaluate the 'expected diamondness on the next turn' of a single action by looking at all hypothetical universes where that action is performed, weighted by their current probability, and summing over the expectation of diamond-bound carbon atoms on their next clock tick.
- Can evaluate the 'future expected diamondness' of an action, over some finite time horizon, by assuming that its future self will also Bayes-update and maximize expected diamondness over that time horizon.
- On each turn, outputs the action with greatest expected diamondness over some finite time horizon.
Suppose our own real universe was amended to otherwise be exactly the same, but contain a single [ impermeable] hypercomputer. Suppose we defined an agent like the one above, using simulations of 1900-era classical models of physics, and ran that agent on the hypercomputer. Should we expect the result to be an actual diamond maximizer - that most mass in the universe will be turned into carbon and arranged into diamonds?
Anticipated failure of AIXI-atomic in our own universe: trying to maximize diamond outside the simulation
Our own universe isn't atomic, it's nuclear and quantum-mechanical. This means that AIXI-atomic does not contain any hypotheses in its hypothesis space that directly represent the universe. By 'directly represent', we mean that carbon atoms in AIXI-atomic's best representations do not correspond to carbon atoms in our own world).
Intuitively, we would think it was [ common sense] for an agent that wanted diamonds to react to the experimental data identifying nuclear physics, by deciding that a carbon atom is 'really' a nucleus containing six protons, and atomic binding is 'really' covalent electron-sharing. We can imagine this agent [ common-sensically] updating its model of the universe to a nuclear model, and redefining the 'carbon atoms' that its old utility function counted to mean 'nuclei containing exactly six protons'. Then the new utility function could evaluate outcomes in the newly discovered nuclear-physics universe. We will call this the utility rebinding problem.
We don't yet have a crisp formula that seems like it would yield commonsense behavior for utility rebinding. In fact we don't yet have any candidate formulas for utility rebinding, period. Stating one is an open problem. See below.
For the 'classical atomic AIXI' agent we defined above, what happens instead is that the 'simplest atomic hypothesis that fits the facts' will be an enormous atom-based computer, simulating nuclear physics and quantum physics in order to control AIXI's webcam, which is still believed to be composed of atoms in accordance with the prespecified bridging laws. From our perspective this hypothesis seems silly, but if you restrict the hypothesis space to only classical atomic universes, that's what ends up being the computationally simplest hypothesis to explain the results of quantum experiments.
AIXI-atomic will then try to choose actions so as to maximize the amount of expected diamond inside the probable outside universes that could contain the giant atom-based simulator of quantum physics. It is not obvious what sort of behavior this would imply.
Metaphor for difficulty: AIXI-atomic cares about only fundamental carbon
One metaphorical way of looking at the problem is that AIXI-atomic was implicitly defined to care only about diamonds made out of ontologically fundamental carbon atoms, not diamonds made out of quarks. A probability function that assigns 0 probability to all universes made of quarks, and a utility function that outputs a constant on all universes made of quarks, [ yield functionally identical behavior]. So it is an exact metaphor to say that AIXI-atomic only cares about universes with ontologically basic carbon atoms, given that AIXI-atomic only believes in universes with ontologically basic carbon atoms.
Since AIXI-atomic only cares about diamond made of fundamental carbon, when AIXI-atomic discovered the experimental data implying that almost all of its probability mass should reside in nuclear or quantum universes in which there were no fundamental carbon atoms, AIXI-atomic stopped caring about the effect its actions had on the vast majority of probability mass inside its model. Instead AIXI-atomic tried to maximize inside the tiny remaining probabilities in which it was inside a universe with fundamental carbon atoms that was somehow reproducing its sensory experience of nuclei and quantum fields; for example, a classical atomic universe with an atomic computer simulating a quantum universe and showing the results to AIXI-atomic.
From our perspective, we failed to solve the 'ontology identification problem' and get the real-world result we wanted, because we tried to define the agent's utility function in terms of properties of a universe made out of atoms, and the real universe turned out to be made of quantum fields. This caused the utility function to fail to bind to the agent's representation in the way we intuitively had in mind.
Advanced-nonsafety of hardcoded ontology identifications
Today we do know about quantum mechanics, so if we tried to build an unreflective diamond maximizer using the above formula, it might not fail on account of the particular exact problem of atomic physics being false.
But perhaps there are discoveries still remaining that would change our picture of the universe's ontology to imply something else underlying quarks or quantum fields. Human beings have only known about quantum fields for less than a century; our model of the ontological basics of our universe has been stable for less than a hundred years of our human experience. So we should seek an AI design that does not assume we know the exact, true, fundamental ontology of our universe during an AI's development phase. Or if our failure to know the exact laws of physics causes catastrophic failure of the AI, we should at least heavily mark that this is a [ relied-on assumption].
Beyond AIXI-atomic: Diamond identification in multi-level maps
A realistic, bounded diamond maximizer wouldn't represent the outside universe with atomically detailed models. Instead, it would have some equivalent of a [ multi-level map] of the world in which the agent knew in principle that things were composed of atoms, but didn't model most things in atomic detail. E.g., its model of an airplane would have wings, or wing shapes, rather than atomically detailed wings. It would think about wings when doing aerodynamic engineering, atoms when doing chemistry, nuclear physics when doing nuclear engineering.
At the present, there are not yet any proposed formalisms for how to do probability theory with multi-level maps (in other words: [ nobody has yet put forward a guess at how to solve the problem even given infinite computing power]). Having some idea for how an agent could reason with multi-level maps, would be a good first step toward being able to define a bounded expected utility optimizer with a utility function that could be evaluated on multi-level maps. This in turn would be a first step towards defining an agent with a utility function that could rebind itself to changing representations in an updating multi-level map.
If we were actually trying to build a diamond maximizer, we would be likely to encounter this problem long before it started formulating new physics. The equivalent of a computational discovery that changes 'the most efficient way to represent diamonds' is likely to happen much earlier than a physical discovery that changes 'what underlying physical systems probably constitute a diamond'.
This also means that, on the actual [ value loading problem], we are liable to encounter the ontology identification problem long before the agent starts discovering new physics.
Discussion of the generalized ontology identification problem
If we don't know how to solve the ontology identification problem for maximizing diamonds, we probably can't solve it for much more complicated values over universe-histories.
View of human angst as ontology identification problem
Argument: A human being who feels angst on contemplating a universe in which "By convention sweetness, by convention bitterness, by convention color, in reality only atoms and the void" (Democritus), or wonders where there is any room in this cold atomic universe for love, free will, or even the existence of people - since, after all, people are just mere collections of atoms - can be seen as undergoing an ontology identification problem: they don't know how to find the objects of value in a representation containing atoms instead of ontologically basic people.
Human beings simultaneously evolved a particular set of standard mental representations (e.g., a representation for colors in terms of a 3-dimensional subjective color space, a representation for other humans that simulates their brain via [empathy]) along with evolving desires that bind to these representations (identification of flowering landscapes as beautiful, a preference not to be embarrassed in front of other objects designated as people). When someone visualizes any particular configurations of 'mere atoms', their built-in desires don't automatically fire and bind to that mental representation, the way they would bind to the brain's native representation of other people. Generalizing that no set of atoms can be meaningful, and being told that reality is composed entirely of such atoms, they feel they've been told that the true state of reality, underlying appearances, is a meaningless one.
Arguably, this is structurally similar to a utility function so defined as to bind only to true diamonds made of ontologically basic carbon, which evaluates as unimportant any diamond that turns out to be made of mere protons and neutrons.
Ontology identification problems may reappear on the reflective level
An obvious thought (especially for online genies) is that if the AI is unsure about how to reinterpret its goals in light of a shifting mental representation, it should query the programmers.
Since the definition of a programmer would then itself be baked into the preference framework, the problem might [ reproduce itself on the reflective level] if the AI became unsure of where to find programmers. ("My preference framework said that programmers were made of carbon atoms, but all I can find in this universe are quantum fields.")
Value lading in category boundaries
Taking apart objects of value into smaller components can sometimes create new moral [ edge cases]. In this sense, rebinding the terms of a utility function decides a [ value-laden] question.
Consider chimpanzees. One way of viewing questions like "Is a chimpanzee truly a person?" - meaning, not, "How do we arbitrarily define the syllables per-son?" but "Should we care a lot about chimpanzees?" - is that they're about how to apply the 'person' category in our desires to things that are neither typical people nor typical nonpeople. We can see this as arising from something like an ontological shift: we're used to valuing cognitive systems that are made from whole human minds, but it turns out that minds are made of parts, and then we have the question of how to value things that are made from some of the person-parts but not all of them.
Redefining the value-laden category 'person' so that it talked about brains made out of neural regions, rather than whole human beings, would implicitly say whether or not a chimpanzee was a person. Chimpanzees definitely have neural areas of various sizes, and particular cognitive abilities - we can suppose the empirical truth is unambiguous at this level, and known to us. So the question is then whether we regard a particular configuration of neural parts (a frontal cortex of a certain size) and particular cognitive abilities (consequentialist means-end reasoning and empathy, but no recursive language) as something that our 'person' category values… once we've rewritten the person category to value configurations of cognitive parts, rather than whole atomic people.
In this sense the problem we face with chimpanzees is exactly analogous to the question a diamond maximizer would face after discovering nuclear physics and asking itself whether a carbon-14 atom counted as 'carbon' for purposes of caring about diamonds. Once a diamond maximizer knows about neutrons, it can see that C-14 is chemically like carbon and forms the same kind of chemical bonds, but that it's heavier because it has two extra neutrons. We can see that chimpanzees have a similar brain architectures to the sort of people we always considered before, but that they have smaller frontal cortexes and no ability to use recursive language, etcetera.
Without knowing more about the diamond maximizer, we can't guess what sort of considerations it might bring to bear in deciding what is Truly Carbon and Really A Diamond. But the breadth of considerations human beings need to invoke in deciding how much to care about chimpanzees, is one way of illustrating that the problem of rebinding a utility function to a shifted ontology is [value-laden] and potentially undergo [ excursions] into [ arbitrarily complicated desiderata]. Redefining a [ moral category] so that it talks about the underlying parts of what were previously seen as all-or-nothing atomic objects, may carry an implicit ruling about how to value many kinds of [ edge case] objects that were never seen before.
A formal part of this problem may need to be carved out from the edge-case-reclassification part: e.g., how would you redefine carbon as C12 if there were no other isotopes, or how would you rebind the utility function to at least C12, or how would edge cases be identified and queried.
Potential research avenues
'Transparent priors' constrained to meaningful but Turing-complete hypothesis spaces
The reason why we can't bind a description of 'diamond' or 'carbon atoms' to the hypothesis space used by AIXI or [ AIXI-tl] is that the hypothesis space of AIXI is all Turing machines that produce binary strings, or probability distributions over the next sense bit given previous sense bits and motor input. These Turing machines could contain an unimaginably wide range of possible contents
(Example: Maybe one Turing machine that is producing good sequence predictions inside AIXI, actually does so by simulating a large universe, identifying a superintelligent civilization that evolves inside that universe, and motivating that civilization to try to intelligently predict future future bits from past bits (as provided by some intervention). To write a formal utility function that could extract the 'amount of real diamond in the environment' from arbitrary predictors in the above case , we'd need the function to read the Turing machine, decode that universe, find the superintelligence, decode the superintelligence's thought processes, find the concept (if any) resembling 'diamond', and hope that the superintelligence had precalculated how much diamond was around in the outer universe being manipulated by AIXI.)
This suggests that to solve the ontology identification problem, we may need to constrain the hypothesis space to something [ less general] than 'an explanation is any computer program that outputs a probability distribution on sense bits'. A constrained explanation space can still be Turing complete (contain a possible explanation for every computable sense input sequence) without every possible computer program constituting an explanation.
An [ unrealistic example] would be to constrain the hypothesis space to Dynamic Bayesian Networks. DBNs can represent any Turing machine with bounded memory,[todo: Not sure where to look for a citation, but I'd be very surprised if this wasn't true.] so they are very general, but since a DBN is a causal model, they make it possible for a preference framework to talk about 'the cause of a picture of a diamond' in a way that you couldn't look for 'the cause of a picture of a diamond' inside a general Turing machine. Again, this might fail if the DBN has no 'natural' way of representing the environment except as a DBN simulating some other program that simulates the environment.
Suppose a rich causal language, such as, e.g., a [ dynamic system] of objects with [ causal relations] and [ hierarchical categories of similarity]. The hope is that in this language, the natural hypothesis representing the environment - the simplest hypotheses within this language that well predict the sense data, or those hypotheses of highest probability under some simplicity prior after updating on the sense data - would be such that there was a natural 'diamond' category inside the most probable causal models. In other words, the winning hypothesis for explaining the universe would already have postulated diamondness as a [ natural category] and represented it as Category #803,844, in a rich language where we already know how to look through the enviromental model and find the list of categories.
Given some transparent prior, there would then exist the further problem of developing a utility-identifying preference framework that could look through the most likely environmental representations and identify diamonds. Some likely (interacting) ways of binding would be, e.g., to "the causes of pictures of diamonds", to "things that are bound to four similar things", querying ambiguities to programmers, or direct programmer inspection of the AI's model (but in this case the programmers might need to re-inspect after each ontological shift). See below.
(A bounded value loading methodology would also need some way of turning the bound preference framework into the estimation procedures for expected diamond and the agent's search procedures for strategies high in expected diamond, i.e., the bulk of the actual AI that carries out the goal optimization.)
Matching environmental categories to descriptive constraints
Given some transparent prior, there would exist a further problem of how to actually bind a preference framework to that prior. One possible contributing method for pinpointing an environmental property could be if we understand the prior well enough to understand what the described object ought to look like - the equivalent of being able to search for 'things W made of six smaller things X near six smaller things Y and six smaller things Z, that are bound by shared Xs to four similar things W in a tetrahedral structure' in order to identify carbon atoms and diamond.
We would need to understand the representation well enough to make a guess about how carbon or diamond would be represented inside it. But if we could guess that, we could write a program that identifies 'diamond' inside the hypothesis space without needing to know in advance that diamondness will be Category #823,034. Then we could rerun the same utility-identification program when the representation updates, so long as this program can reliably identify diamond inside the model each time, and the agent acts so as to optimize the utility identified by the program.
One particular class of objects that might plausibly be identifiable in this way is 'the AI's programmers' (aka the agents that are causes of the AI's code) if there are parts of the preference framework that say to query programmers to resolve ambiguities.
A toy problem for this research avenue might involve:
- One of the richer representation frameworks that can be inducted as of the time, e.g., a simple Dynamic Bayes Net.
- An agent environment that can be thus represented.
- A goal over properties relatively distant from the agent's sensory experience (e.g., the goal is over the cause of the cause of the sensory data).
- A program that identifies the objects of utility in the environment, within the model thus freely inducted.
- An agent that optimizes the identified objects of utility, once it has inducted a sufficiently good model of the environment to optimize what it is looking for.
Further work might add:
- New information that can change the model of the environment.
- An agent that smoothly updates what it optimizes for in this case.
And further:
- Environments complicated enough that there is real structural ambiguity (e.g., dependence on exact initial conditions of the inference program) about how exactly the utility-related parts are modeled.
- Agents that can optimize through a probability distribution about environments that differ in their identified objects of utility.
A potential agenda for unbounded analysis might be:
- An [ unbounded analysis] showing that a utility-identifying preference framework is a generalization of a [ VNM utility] and can [ tile] in an architecture that tiles a generic utility function.
- A Corrigibility analysis showing that an agent is not motivated to try to cause the universe to be such as to have utility identified in a particular way.
- A Corrigibility analysis showing that the identity and category boundaries of the objects of utility will be treated as a [ historical fact] rather than one lying in the agent's [ decision-theoretic future].
Identifying environmental categories as the causes of labeled sense data.
Another potential approach, given a prior transparent enough that we can find causal data inside it, would be to try to identify diamonds as the causes of pictures of diamonds.
[todo: expand]
Security note
[5j Christiano's hack]: if your AI is advanced enough to model distant superintelligences, it's important to note that distant superintelligences can make 'the most probable cause of the AI's sensory data' be anything they want by making a predictable decision to simulate AIs such that your AI doesn't have info distinguishing itself from the distant AIs your AI imagines being simulated
Ambiguity resolution
Both the description-matching and cause-inferring methods might produce ambiguities. Rather than having the AI optimize for a probabilistic mix over all the matches (as if it were uncertain of which match were the true one), it would be better to query the ambiguity to the programmers (especially if different probable models imply different strategies). This problem shares structure with [ inductive inference with ambiguity resolution] as a strategy for resolving [ unforeseen inductions].
[todo: if you try to solve the reflective problem by defining the queries in terms of sense data, you might run into Cartesian problems. if you try to ontologically identify the programmers in terms more general than a particular webcam, so that the AI can have new webcams, the ontology identification problem might reproduce itself on the reflective level. you have to note it down as a dependency either way.]
Multi-level maps
Being able to describe, in purely theoretical principle, a prior over epistemic models that have at least two levels and can switch between them in some meaningful sense, would constitute major progress over the present state of the art.
[todo: try this with just two level. half adders as potential models? requirements: that the lower level be only partially realized rather than needing to be fully modeled; that it can describe probabilistic things; that we can have a language for things like this and prior over them that gets updated on the evidence, rather than just a particular handcrafted two-level map.]
Implications
[todo: if the programmers can read through updates to the AI's representation fast enough, or if most of the routine ones leave certain levels intact or imply a defined relation between old and new models, then it might be possible to solve this problem programmatically for genies. especially if it's a nonrecursive genie with known algorithms, because then it might have a known representation that might be known not to change suddenly, and be corrigible-by-default while the representation is being worked out. so this is one of the problems more likely to be averted in practice but understanding it does help to see one more reason why You Cannot Just Hardcode the Utility Function By Hand.]
[todo: Hard to solve entire problem because it has at least some entanglement with the full AGI problem.]
The problem of using sensory data to build computationally efficient probabilistic maps of the world, and to efficiently search for actions that are predicted by those maps to have particular consequences, could be identified with the entire problem of AGI. So the research goal of ontology identification is not to publish a complete bounded system like that (i.e. an AGI), but to develop an unbounded analysis of utility rebinding that seems to say something useful specifically about the ontology-identification part of the problem.)
Comments
Alexei Andreev
This is probably explained elsewhere, but what's AIXI-tl?
Alexei Andreev
It's not clear to me what point you are making here.
Paul Christiano
I don't see indirect specifications as encountering these difficulties; all of the contenders so far go straight for the throat (defining behavior directly in terms of perceptions) rather than trying to pick out the programmer in the AI's ontology. Even formal accounts of e.g. language learning seem like they will have to go for the throat in this sense (learning the correspondence between language and an initially unknown world, based on perceptions), rather than manually binding nouns to parts of a particular ontology or something like that. So whatever mechanism you used to initially learn what a "programmer" is, it seems like you can use the same one to learn what a programmer is under your new physical theory (or more likely, your beliefs about the referent of "programmer" will automatically adjust with your beliefs about physics, and indeed will be used to help inform your changing beliefs about physics).
The "direct" approaches, that pick out what is valuable directly in the hard-coded ontology of the AI, seem clearly unsatisfactory on other grounds.
Paul Christiano
Six months and several discussions later this still seems like a serious concern (Nick Bostrom seemed to have the same response independently, and to consider it a pretty serious objection).
It really seems like the problem is an artifact of the toy example of diamond-maximization. This "easy" problem is so easy, in a certain sense, that it tempts us to a particular class of simple strategies where we literally specify a model of the world and say what diamond is.
Those strategies seem like an obvious dead end in the real case, and I think everyone is in agreement about that. They also seem like an almost-as-obvious dead end even in the diamond maximization case.
That's fine, but it means that the real justification is quite different from the simple story offered here. Everyone at MIRI I have talked to has fallen back to some variant of this more subtle justification when pressed. I don't know anywhere that the real justification has been fleshed out in any publicly available writing. I would be game for a public discussion about it.
It does seem like there is some real problem about getting agents to actually care about stuff in the real world. This just seems like a very strange way of describing or attacking the problem.