Themes
A balanced approach (system vs. user)
Evaluation of new types of interactive IR systems has to cope with
several methodological problems. The combination of factors or models of
retrieval, user, task, interaction, and interfaces in an integrated system
complicates the evaluation of these systems, since they are interdependent and
cannot easily be isolated or controlled for evaluation. We assume that the
design principles and methods that are used to test more traditional IR systems
cannot be adopted directly for our purposes. Evaluation criteria and measures
have rather to be redefined, if possible, or newly developed combining
quantitative and qualitative evaluation criteria in a multi-dimensional
evaluation. This can only be done with regard to the specific type of system
and the conceptual approaches in question. Thus, the methodological
considerations themselves must become one important part of the evaluation
phase. The evaluation design decisively depends on the main evaluation goal.
e.g., seeking evidence about how the major factors of a system interact with
each other and influence the user's conception of the system and his
performance and satisfaction as opposed to a comparative evaluation of the
performance of different retrieval engines, or interfaces.
* How does the system support the user in formulating a query? (For instance, is the retrieval interface easy to comprehend and are there facilities for inspecting the scope and the vocabulary of a collection?)
* Are the retrieval results presented in such a way that the user can easily judge the relevance of the information units to his/her information need?
* Is the user supported in using the advanced facilities (e.g. relevance feedback) offered by the system?
* How well can the IR system be integrated into the working environment of the user?
Traditional evaluation measures do not address these questions, since they only consider the query and the results achieved However, a system that offers appropriate answers to the above questions but shows poorer precision values might overall be more useful than a system with more efficient retrieval and indexing techniques offering only little support to the user.
Recall and precision measures loose some of their importance in advanced IR systems that provide links between information units and which offer navigation and browsing facilities. In such systems a query might just serve to find some suitable starting points for exploring the information space in search for useful information. Moreover, the link between traditional measures of effectiveness and optimal retrieval is not clear in these contexts and will have to be re-examined.
Interaction affects dynamic evaluation
The classical model of IR system evaluation, initiated by the Cranfield
experiments and currently manifest in the TREC programme, demonstrates very
clearly its origins in the era of batch retrieval systems. The system is seen
as taking well-defined input (a query or topic) and producing well-defined
output (a list of documents). However, with modern interactive systems, that
input-output model is clearly becoming more and more inadequate as a
representation of the IR situation. A dominant problem in current IR research
is the question of what model or models we need instead. One possible source of
ideas and methods is work elsewhere (outside IR) on evaluating the HCI
characteristics of systems. However, this work suffers from two limitations,
at least as regards its applicability to IR.
* HCI evaluation is really in its infancy --certainly very much less mature that the traditional IR evaluation model.
* HCI evaluation tends to concentrate on very immediate aspects of the communication process, and not to consider interaction in the context of the wider task that the user is trying to pursue.
There has, however, been some work on user information-seeking behaviour and users' perceptions of information sources. This has provided us with strong evidence for the importance of this context, in IR at least, in understanding user-system interaction. Thus what seems to be needed is a paradigm or model for the evaluation of interactive IR systems, drawing from all these sources.
Another source of ideas is the Case Based Reasoning (CBR) approach to IR. In order to adapt a former retrieval session to a current one, we have to know how the former session has developed. Well-known precision and recall rates, used to evaluate the global results of a given search, do not give an adequate account of how the search has progressed. Furthermore, when searching images, motion pictures, sounds, the relevance judgements are not context independent and depend at least on the user and on the documents seen before in the retrieval session. So the evaluation process of a retrieval session must be done on-line (i.e., traditional evaluation protocols using test collections are not suitable).A two stage evaluation method is used:
Evaluate each retrieval step as the aggregation of three criteria,
* a) user satisfaction (we can measure here the suitability of some interface parameters by asking precise questions of the user),
* b) quantitative evaluation of the step results (precision, or could be more sophisticated as for instance the use of the notion of "acceptable ranking strategy" and "user preference" during the feedback step),
* c) evaluation of the query reformulation
then, evaluate the session as a whole.
In CBR it is more appropriate to evaluate how the search has progressed by analysing the variation rate of performance during a session: for example, a monotonically increasing performance curve denotes a good session while a decreasing curve denotes that a problem happened at least at the step corresponding to the beginning of the fall. A bad session involves a failure at some step, i.e., no new documents are found or all documents are rejected by the user.
There is no doubt that work from both the HCI and CBR communities is of importance to the development of new evaluation methods for interactive IR systems. It is also clear that any advance will have an impact on the evaluation methodology in those disciplines since in each they are relatively underdeveloped.
The move from static to dynamic evaluation
Dynamic evaluation will thus investigate the definition of new criteria
that will track the dynamic behaviour of both the system and the users within
retrieval sessions, and criteria that estimate the importance of users' efforts
compared to their satisfaction level throughout sessions. The use of such
criteria would in turn provide an effective evaluation of the incidence of
components such as search strategies, retrieval interfaces, relevance feedback
techniques which have all a major impact within an interactive retrieval
process. Through extensive experiments, it seems also clear that these
evaluations will ultimately help in a better individual design of these
components, and will facilitate an optimal integration of their individual
contributions. An important issue will be here to investigate how these new
criteria will motivate the specification of new performance measures.
Problems of evaluation of interactive systems also require a careful methodological discussion and design decisions. An appropriate evaluation design, for example, demands an identification of realistic problems or tasks of typical users which they are to solve during interaction with the system. Real tasks, however, are often complex and, also, depend on the everyday working context and environment. Often they cannot be reduced to a single realistic query, but are embedded in more complex task-settings (e.g., looking for "interesting" new developments in a certain research area). Besides, the user's conception of the information problem, queries and search strategies changes through interaction. One cannot assume that the initial problem holds through the entire interaction and that there is an optimal solution to the problem that may be reached under ideal conditions (after several iterations). Traditional "objective" measures like recall, precision, number of errors, time spent, etc. might be used (in addition to qualitative measures), but always w.r.t. a certain stage of the interaction which is characterized by a certain (changed) information need, task, or goal of the user. The dynamic process of the user-system interaction must be taken into account in the evaluation design.
New media affect evaluation
One complication in evaluating retrieval of multi-media material is that
even basic questions, for example, what retrieval tasks people actually want to
carry out are highly dependent on the particular media represented in the
system and on special skills of the users. Do people want to retrieve pictures
on the basis of colour, texture, light intensity, or some other physical
property? Are people interested in working with an intermediary to sketch the
'ideal house' to which the system returns interesting candidates which are for
sale in the neighbourhood? Do people wish to access video by content , for
example, to check if Darth Vader said 'light sabre' or 'life saver'?
Technically, we can offer users tools which allow for these sorts of tasks to
be carried out but we seem less than confident that users are interested in
actually carrying them out. At present, the crucial questions in multimedia
evaluation may be upstream, when requirements are formulated. Progress in
multimedia development may be best fostered by an emphasis on the evaluation of
scenarios of use, rather than by the evaluation of finished systems.
A basic problem in multimedia systems is the nature of a `document' and `query'. Documents are usually richly structured and linked in multi-media repositories. Queries are often a mix of querying, navigation, consulting displays, note-taking, and reformulation. That `query' and `document' are more difficult to define in multimedia raises the question of the relevance of documents to user need as expressed in queries. Ultimately, we need to reconsider the whole question of evaluation of IR systems, which may result in new framework(s) for evaluating complete systems, the individual components thereof, and the interactions between the components.
Another crucial feature of multimedia is that normally any one unit of material plays a rhetorical role in a larger context of exposition. An example is where a user failed to solve a programming problem efficiently because, although a search got him to the right spot, he failed to read necessary prerequisite material. The system failed to reveal the rhetorical relations between one unit of material and the others within a larger substructure. So perhaps an important kind of evaluation is not how fast a retrieval engine can take a person to a unit of relevant material, but rather how well a retrieval engine can reveal the relations between one unit of relevant material and its context or how well can a retrieval engine present an alternative view of some set of potentially relevant units. That is, how well do search facilities and electronic exposition interpenetrate?
In interactive searches, it is important how the documents are presented to the user and how fast the users can perceive their contents. Hence, the techniques that support a fast perception of the retrieved documents are one of the subjects of the evaluation method. The valuation of this particular aspect is anything but trivial because documents of different media are perceived differently and their perception must be supported in different ways. The same applies for the identification of relevant parts of multimedia documents for relevance feedback.
New information retrieval systems face the problem to provide access to non-textual information, that is image, video, speech,... In such systems, the way to handle this access is more complex than in text-based systems. Can such systems be evaluated in the same way as "text-based information" retrieval systems? In systems managing non-textual data, we probably have to change the fact that usually recall and precision values are a good way to evaluate systems. For instance, a low precision information retrieval system managing images is not necessarily a sign of poor quality of the system, since a user sees immediately if an image fits his need. With time-based media like audio and video data, it seems important to evaluate precisely the granularity of query results.
A crucial point for the evaluation of text-based systems is the experts judgement of relevance documents. This notion of relevance has to be refined with new media documents. For instance with images, the relevance can be based on subjective elements, on physical characteristics (colours, ...), on objects of the image scene, and the evaluation of systems has to handle these different points of views for relevance judgements.
Evaluation methods for smaller groups
The evaluation of retrieval methods is traditionally based on the
pooling method Only big organizations, however, can afford the pooling method.
Little research was done towards "evaluation on a shoestring" which would allow
Small and Medium Enterprises (SMEs) to evaluate dedicated retrieval methods for
special applications.
New projects for new evaluation frameworks
The members of Mira are involved in a number of other programmes which will
support the creation of new RTD projects in this area. For example, we have
close connections with two working groups, started under framework 3, MIRO and
FADIVA. We are well represented on the network of excellence IDOMENEUS. Weare
actively involved in FERMI, FIDE II, IMIS. A number of us are on the academic
panel for the Information Engineering sector under Telematics. Thus we are in a
good position to propose new projects to support further research and
development.