"Superficially, there are tw..."

https://arbital.com/p/394

by Paul Christiano Apr 19 2016


Superficially, there are two quite different concerns:

  1. You optimize a system for X. You are unhappy when X ends up optimized.
  2. You optimize a system for X. But instead you get a consequentialist that optimizes Y != X. You are unhappy when Y ends up optimized.

You seem to be claiming that these are the same issue or at least conceptually very closely related, but this post basically doesn't defend that claim. That is, you say several times that the goal of the game is not to build an adversary, and you defend that claim. But you say nothing about why that problem is analogous to working against an intelligent adversary, you just assert it.

I think that most serious people will agree that problem #2 is a problem---that a policy trained to optimize X can actually be optimizing some Y that happens to be correlated during training. They may even agree that this happens generically. But I don't think most people will agree that this problem is conceptually similar to the security problem in the way your are claiming.