[summary: This page contains a thorough list of all value-alignment subjects.]
Safety paradigm for advanced agents
- Advanced Safety
- Advanced agents
- AI safety mindset
- Methodology of foreseeable difficulties
- Context Change problems ("Treacherous problems"?)
- Methodology of unbounded analysis
- Priority of astronomical failures (those that destroy error recovery or are immediately catastrophic)
Foreseen difficulties
- Value identification
- Edge instantiation
- Unforeseen maximums
- Ontology identification
- Cartesian boundary
- Human identification
- Inductive value learning
- Ambiguity-querying
- Moral uncertainty
- Indifference
- Patch resistance
- Nearest Unblocked Neighbor
- Corrigibility
- Anapartistic reasoning
- Programmer deception
- Early conservatism
- Reasoning under confusion
- User maximization / Unshielded argmax
- Hypothetical user maximization
- Genie theory
- Limited AI
- Weak optimization
- Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)
- Factoring of an agent by stage/component optimization power
- 'Checker' smarter than 'inventor / chooser'
- 'Checker' can model humans, 'strategizer' cannot
- Safe optimization measure (such that we are confident it has no Edge that secretly optimizes more)
- Transparency
- Domain restriction
- Effable optimization (opposite of cognitive uncontainability; uses only comprehensible strategies)
- Minimal concepts (simple, not simplest, that contains fewest whitelisted strategies)
- Weak optimization
- Genie preferences
- Low-impact AGI
- Minimum Safe AA (just flip off switch and shut down safely)
- Safe impact measure
- Armstrong-style permitted output channels
- Shutdown utility function
- Oracle utility function
- Safe indifference?
- Online checkability
- Reporting without programmer maximization
- Do What I Know I Mean
- Low-impact AGI
- Superintelligent security (all subproblems placing us in adversarial context vs. other SIs)
- Bargaining
- Non-blackmailability
- Secure counterfactual reasoning
- First-mover penalty / epistemic low ground advantage
- Division of gains from trade
- Epistemic exclusion of distant SIs
- Distant superintelligences can coerce the most probable environment of your AI
- Breaking out of hypotheses
- 'Philosophical' problems
- One True Prior
- Pascal's Mugging / leverage prior
- Second-orderness
- Anthropics
- How would an AI decide what to think about QTI?
- Mindcrime
- Nonperson predicates (and unblocked neighbor problem)
- Do What I Don't Know I Mean - CEV
- Philosophical competence - Unprecedented excursions
Reflectivity problems
- Vingean reflection
- Satisficing / meliorizing / staged maximization / ?
- Academic agenda: view current algorithms as finding a global logically-uncertain maximum, or teleporting to the current maximum, surveying, updating on a logical fact, and teleporting to the new maximum.
- Logical decision theory
- Naturalized induction
- Benja: Investigate multi-level representation of DBNs (with categorical structure)
Foreseen normal difficulties
- Reproducibility
- Oracle boxes
- Triggers
- Ascent metrics
- Tripwires
- Honeypots
General agent theory
- Bounded rational agency
- Instrumental convergence
Value theory
- Orthogonality Thesis
- Complexity of value
- Complexity of object-level terminal values
- Incompressibilities of value
- Bounded logical incompressibility
- Terminal empirical incompressibility
- Instrumental nonduplication of value
- Economic incentives do not encode value
- Selection among advanced agents would not encode value
- Strong selection among advanced agents would not encode value
- Selection among advanced agents will be weak.
- Fragility of value
- Metaethics
- Normative preferences are not compelling to a paperclip maximizer
- Most 'random' stable AIs are like paperclip maximizers in this regard
- It's okay for valid normative reasoning to be incapable of compelling a paperclip maximizer
- Thick definitions of 'rationality' aren't part of what gets automatically produced by self-improvement
- Alleged fallacies
- Alleged fascination of One True Moral Command
- Alleged rationalization of user-preferred options as formal-criterion-maximal options
- Alleged metaethical alief that value must be internally morally compelling to all agents
- Alleged alief that an AI must be stupid to do something inherently dispreferable
Larger research agendas
- Corrigible reflective unbounded safe genie
- Bounding the theory
- Derationalizing the theory (e.g. for a neuromorphic AI)
- Which machine learning systems do and don't behave like the corresponding ideal agents.
- Normative Sovereign
- Approval-based agents
- Mindblind AI (cognitively powerful in physical science and engineering, weak at modeling minds or agents, unreflective)
Possible future use-cases
- A carefully designed bounded reflective agent.
- An overpowered set of known algorithms, heavily constrained in what is authorized, with little recursion.
Possible escape routes
- Some cognitively limited task which is relatively safe to carry out at great power, and resolves the larger problem.
- Newcomers can't invent these well because they don't understand what is a cognitively limited task (e.g., "Tool AI" suggestions).
- General cognitive tasks that seem boxable and resolve the larger problem.
- Can you save the world by knowing which consequences of ZF a superintelligence could prove? It's unusually boxable, but what good is it?
Background
- Intelligence explosion microeconomics
- Civilizational adequacy/inadequacy
Strategy
- Misleading Encouragement / context change / treacherous designs for naive projects
- Programmer prediction & infrahuman domains hide complexity of value
- Context change problems
- Problems that only appear in advanced regimes
- Problem classes that seem debugged in infrahuman regimes and suddenly break again in advanced regimes
- Methodologies that only work in infrahuman regimes
- Programmer deception
- Academic inadequacy
- 'Ethics' work neglects technical problems that need longest serial research times and fails to give priority to astronomical failures over survivable small hits, but 'ethics' work has higher prestige, higher publishability, and higher cognitive accessibility
- Understanding of big technical picture currently very rare
- Most possible funding sources cannot predict for themselves what might be technically useful in 10 years
- Many possible funding sources may not regard MIRI as trusted to discern this
- Noise problems
- Ethics research drowns out technical research
- And provokes counterreaction
- And makes the field seem nontechnical
- Naive technical research drowns out sophisticated technical research
- And makes problems look more solvable than they really are
- And makes tech problems look trivial, therefore nonprestigious
- And distracts talent/funding from hard problems
- Bad methodology louder than good methodology
- So projects can appear safety-concerned while adopting bad methodologies
- Ethics research drowns out technical research
- Future adequacy counterfactuals seem distant from the present regime
- (To classify)
- Coordinative development hypothetical
Comments
Alexei Andreev
Ideally we shouldn't have pages like this. It means that the hierarchy feature failed. Is this just meant to be temporary? Or do you foresee this as a permanent page?
Eliezer Yudkowsky
I think one will often still need 'introductory' or 'tutorial' type pages that walk through the hierarchy as English text, but this exact page was something I whipped up during the recent Experimental Research Retreat as an alternative to just dumping the info and because I thought I might start filling it in as Arbital pages.
Anna Salamon
I'm finding this page helpful. Alexei, does your theory think I shouldn't be?
Alexei Andreev
I definitely think something like this should exist and will be helpful, but I think Arbital should be able to generate something like this automatically. Until it can, we are stuck doing it manually.
Expanding all children in the Children tab on the AI alignment page achieves something similar, but not quite as clean.
Mike Johnson
Within the "Value Theory" section, I'd propose two subpoints:
Unity of Value Thesis
Necessity of Physical Representation
The 'Unity of Value Thesis' is simply what we get if the Complexity of Value Thesis is wrong. And it could be wrong- we just don't know. For what this could look like, see e.g. https://qualiacomputing.com/2016/11/19/the-tyranny-of-the-intentional-object/
'Necessity of Physical Representation' refers to the notion that ultimately, a proper theory of value must compile to physics. We are made from physical stuff, and everything we interact with and value is made from the same physical stuff, and so ethics ultimately is about how to move & arrange the physical stuff in our light-cone. If a theory of value does not operate at this level, it can't be a final theory of value. See e.g., Tegmark's argument here: https://arxiv.org/abs/1409.0813