This workshop asks for insights how far we may/can push the theoretical boundary of using data in the design of learning machines. Can we express our classification rule in terms of the sample, or do we have to stick to a core assumption of classical statistical learning theory, namely that the hypothesis space is to be defined independent from the sample? This workshop is particularly interested in – but not restricted to – the ‘luckiness framework’ and the recently introduced notion of ‘compatibility functions’ in a semi-supervised learning context (more information can be found at http://www.kuleuven.be/wehys).

## Choice of Metrics

How do we judge success in Machine Learning? As Aaron notes, the best way is to use the loss imposed on you by the world. This turns out to be infeasible sometimes for various reasons. The ones I’ve seen are:

- The learned prediction is used in some complicated process that does not give the feedback necessary to understand the prediction’s impact on the loss.
- The prediction is used by some other system which expects some semantics to the predicted value. This is similar to the previous example, except that the issue is design modularity rather than engineering modularity.
- The correct loss function is simply unknown (and perhaps unknowable, except by experimentation).

In these situations, it’s unclear what metric for evaluation should be chosen. This post has some design advice for this murkier case. I’m using the word “metric” here to distinguish the fact that we are considering methods for *evaluating* predictive systems rather than a loss imposed by the real world or a loss which is optimized by a learning algorithm.

A good metric satisfies several properties.

**The real problem**. This property trumps all other concerns and every effort argument that metric A better reflects the real problem than metric B must be carefully evaluated. Unfortunately, this is application specific, so little more advice can be given.**Boundedness**. A good metric is bounded. Joaquin ran the Evaluating Predictive Uncertainty Challenge and presented results at a NIPS workshop. In the presentation, there was a slide on “analysis of a disaster”—one of the good contestant entries mispredicted the correct answer with very high confidence resulting in a terrible log loss, which was used to evaluate the contestants. If that single example was removed from the test set, the entry would have won.An essential question is: is this fair? My belief is no. It’s not generally fair to exclude a method because it errs once, because losses imposed by the real world are often bounded. This is a failure in the choice of metric of evaluation as much as a failure of the prediction system.

Another highly unintuitive property of unbounded losses is nonconvergence. When an IID assumption holds over examples in the test set, bounded losses converge at known rates. Without a bound, there is

*never*convergence. This means, for example, that we can construct examples of systems where a test set of size*m*typically produces a loss ordering of system A > system B but a test set of size 2m reverses the typical ordering. This ordering can be made to reverse again and again as larger and larger test sets are used. In other words, there is no size of test set such that you can be sure that the order has stabilized.**Atomicity**. Another way to suffer the effects of nonconvergence above is to have a metric which uses some formula on an entire set of examples to produce a prediction. A simple example of this is “area under the ROC curve” which becomes very unstable when the set of test examples is “lopsided”—i.e. almost always*1*or almost always*0*. (Sample complexity analysis for AUC gets around this by conditioning on the number of 1’s or 0’s. Most sampling processes do not have this conditioning built in.)**Variation**. A good metric can vary substantially in value. Precisely defining “vary substantially” is slightly tricky, because we wan’t it to vary substantially in value with respect to the tested system. A reasonable approach is: compare the metric on the best constant predictor to the minimum of the metric.Variation can always be “improved” by simply doubling the magnitude of the metric. To remove this “improvement” from consideration, normalizing by a bound on the loss appears reasonable. This implies that we measure varation as (loss of best constant predictor – minimal possible loss) / (maximal loss – minimal loss). As an example, squared loss—

*(y’ – y)*would have variation^{2}*0.25*according to this measure while 0/1 loss—*I(y’ != y)*would have variation*0.5*for binary classification. For*k*-class classification, the variation would be*(k-1)/k*. When possible, using the actual distribution over*y*to compute the loss of the best constant predictor is even better.**Semantics**. It’s useful to have a semantics to the metric, because it makes communication of the metric easy.**Simplicity**. It’s useful to have the metric be simple because a good metric will be implemented multiple times by multiple people. Simplicity implies that time is saved in implementation and debugging.

Trading off amongst these criteria is not easy in general, but sometimes a good argument can be made. For example, If you care about probabilistic semantics, squared loss seems superior to log loss (i.e. *log (1/p(true y)* where *p()* is the predicted value, because squared loss is bounded.

Another example is AUC ordering vs. predicting which of mixed pairs are larger. Predicing mixed pairs correctly has known rates of deviation convergence and is essentially equivalent to AUC optimization.