In the following pages we support the individual objectives of the working group, however, before we go into such detail it is important to have an overall understanding of our view of evaluation.
The obvious first question to ask is , why evaluate? We feel the need to say something precise about the merits of different machine techniques. In particular, to be able to compare different techniques so as to be able to tell whether a new technique is better than previous techniques. New ideas which cannot be evaluated with old techniques are usually unpublishable, so we do not know if they have been developed: and they are discouraged by advisers if proposed.
We want a broader framework for evaluation. It needs to take into account users: their needs, their actual (not idealised) behaviour. Otherwise the evaluation will be unconnected with the actual use, and so the actual value, of the technology. On the other hand, we need (because it is our original and continuing motivation) to have ways of testing new techniques quantitatively and qualitatively
A framework must relate 3 issues or features:
(a) user behaviour and performance,
(b) actual information needs i.e. a study of what tasks people bring to IR and how well they are satisfied, and
(c) machine-oriented performance measures that can be used to compare machine performance.
Real users, and their actual performance as part of the overall interactive system.. Modern systems even more than older ones are interactive, and the work done by them, the information actually extracted, depends not only on the machine part of that system, but on the human user. Furthermore, as response times become faster, people do not think carefully before each action (it is not cost effective for them to do so), so idealising user actions is not even an approximation. Evaluations must include studies of actual user behaviour, and the design of systems must in part focus on aspects that shape that behaviour i.e. on the user interface.
Actual user tasks or information needs: the tasks external and prior to the machines we build. Without knowing what these are, and to what extent designs serve them, all work will be unconnected with application and may be useless in practice. Most IR work on standard "collections" has been on the use of queries which have not been drawn from observations of queries that arise in practice, so we have no information on their relevance. This is comparable to the early use of fuel consumption figures for automobiles that were unrelated to the driving conditions met in practice, and so essentially irrelevant. Another aspect is that different information needs probably make different demands on a retrieval machine. We need a taxonomy of information needs, to allow for different techniques to address different needs.
Machine tasks or benchmark measures. These are what are needed for immediate engineering work, as targets for improvement and measurements of whether improvement has been achieved. These measure are going to be machine oriented. Examples of such measures for automobiles would be fuel consumption, top speed, acceleration from 0-60mph. For IR systems, precision and recall are examples of old measures.
We see then that all three of the elements discussed must be dealt with in an evaluation framework, even though most specific bits of measurement work will deal with only one or two elements at a time.