Well, the purpose is to avoid the AGI classifying potential goal fulfillments in a way that, from the user's perspective, is a "false positive". The reason why we have to spend a lot of time thinking about really, really good ways to have the AGI not guess positive labels on things that we wouldn't label as positive, is that the training data we present to the AI may be ambiguous in some way we don't know about, or many ways we don't know about. Meaning that the AI does not actually have the information to figure out what we meant by looking for the simplest ways to classify the training cases, and instead has to do something that's very very similar to the positively labeled training instances to minimize the probability of screwing up.
I'm pushing back a little on this "classifier that avoids false positives" description because that's what every classifier is in some sense intended to do; you have to be specific about how, or what approach you're taking, in order to say something that means more than just "classifier that is a good classifier".