Mira - Themes

Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications

Themes

A balanced approach (system vs. user)
Evaluation of new types of interactive IR systems has to cope with several methodological problems. The combination of factors or models of retrieval, user, task, interaction, and interfaces in an integrated system complicates the evaluation of these systems, since they are interdependent and cannot easily be isolated or controlled for evaluation. We assume that the design principles and methods that are used to test more traditional IR systems cannot be adopted directly for our purposes. Evaluation criteria and measures have rather to be redefined, if possible, or newly developed combining quantitative and qualitative evaluation criteria in a multi-dimensional evaluation. This can only be done with regard to the specific type of system and the conceptual approaches in question. Thus, the methodological considerations themselves must become one important part of the evaluation phase. The evaluation design decisively depends on the main evaluation goal. e.g., seeking evidence about how the major factors of a system interact with each other and influence the user's conception of the system and his performance and satisfaction as opposed to a comparative evaluation of the performance of different retrieval engines, or interfaces.

* How does the system support the user in formulating a query? (For instance, is the retrieval interface easy to comprehend and are there facilities for inspecting the scope and the vocabulary of a collection?)

* Are the retrieval results presented in such a way that the user can easily judge the relevance of the information units to his/her information need?

* Is the user supported in using the advanced facilities (e.g. relevance feedback) offered by the system?

* How well can the IR system be integrated into the working environment of the user?

Traditional evaluation measures do not address these questions, since they only consider the query and the results achieved However, a system that offers appropriate answers to the above questions but shows poorer precision values might overall be more useful than a system with more efficient retrieval and indexing techniques offering only little support to the user.

Recall and precision measures loose some of their importance in advanced IR systems that provide links between information units and which offer navigation and browsing facilities. In such systems a query might just serve to find some suitable starting points for exploring the information space in search for useful information. Moreover, the link between traditional measures of effectiveness and optimal retrieval is not clear in these contexts and will have to be re-examined.

Interaction affects dynamic evaluation
The classical model of IR system evaluation, initiated by the Cranfield experiments and currently manifest in the TREC programme, demonstrates very clearly its origins in the era of batch retrieval systems. The system is seen as taking well-defined input (a query or topic) and producing well-defined output (a list of documents). However, with modern interactive systems, that input-output model is clearly becoming more and more inadequate as a representation of the IR situation. A dominant problem in current IR research is the question of what model or models we need instead. One possible source of ideas and methods is work elsewhere (outside IR) on evaluating the HCI characteristics of systems. However, this work suffers from two limitations, at least as regards its applicability to IR.

* HCI evaluation is really in its infancy --certainly very much less mature that the traditional IR evaluation model.

* HCI evaluation tends to concentrate on very immediate aspects of the communication process, and not to consider interaction in the context of the wider task that the user is trying to pursue.

There has, however, been some work on user information-seeking behaviour and users' perceptions of information sources. This has provided us with strong evidence for the importance of this context, in IR at least, in understanding user-system interaction. Thus what seems to be needed is a paradigm or model for the evaluation of interactive IR systems, drawing from all these sources.

Another source of ideas is the Case Based Reasoning (CBR) approach to IR. In order to adapt a former retrieval session to a current one, we have to know how the former session has developed. Well-known precision and recall rates, used to evaluate the global results of a given search, do not give an adequate account of how the search has progressed. Furthermore, when searching images, motion pictures, sounds, the relevance judgements are not context independent and depend at least on the user and on the documents seen before in the retrieval session. So the evaluation process of a retrieval session must be done on-line (i.e., traditional evaluation protocols using test collections are not suitable).A two stage evaluation method is used:

Evaluate each retrieval step as the aggregation of three criteria,

* a) user satisfaction (we can measure here the suitability of some interface parameters by asking precise questions of the user),

* b) quantitative evaluation of the step results (precision, or could be more sophisticated as for instance the use of the notion of "acceptable ranking strategy" and "user preference" during the feedback step),

* c) evaluation of the query reformulation

then, evaluate the session as a whole.

In CBR it is more appropriate to evaluate how the search has progressed by analysing the variation rate of performance during a session: for example, a monotonically increasing performance curve denotes a good session while a decreasing curve denotes that a problem happened at least at the step corresponding to the beginning of the fall. A bad session involves a failure at some step, i.e., no new documents are found or all documents are rejected by the user.

There is no doubt that work from both the HCI and CBR communities is of importance to the development of new evaluation methods for interactive IR systems. It is also clear that any advance will have an impact on the evaluation methodology in those disciplines since in each they are relatively underdeveloped.

The move from static to dynamic evaluation
Dynamic evaluation will thus investigate the definition of new criteria that will track the dynamic behaviour of both the system and the users within retrieval sessions, and criteria that estimate the importance of users' efforts compared to their satisfaction level throughout sessions. The use of such criteria would in turn provide an effective evaluation of the incidence of components such as search strategies, retrieval interfaces, relevance feedback techniques which have all a major impact within an interactive retrieval process. Through extensive experiments, it seems also clear that these evaluations will ultimately help in a better individual design of these components, and will facilitate an optimal integration of their individual contributions. An important issue will be here to investigate how these new criteria will motivate the specification of new performance measures.

Problems of evaluation of interactive systems also require a careful methodological discussion and design decisions. An appropriate evaluation design, for example, demands an identification of realistic problems or tasks of typical users which they are to solve during interaction with the system. Real tasks, however, are often complex and, also, depend on the everyday working context and environment. Often they cannot be reduced to a single realistic query, but are embedded in more complex task-settings (e.g., looking for "interesting" new developments in a certain research area). Besides, the user's conception of the information problem, queries and search strategies changes through interaction. One cannot assume that the initial problem holds through the entire interaction and that there is an optimal solution to the problem that may be reached under ideal conditions (after several iterations). Traditional "objective" measures like recall, precision, number of errors, time spent, etc. might be used (in addition to qualitative measures), but always w.r.t. a certain stage of the interaction which is characterized by a certain (changed) information need, task, or goal of the user. The dynamic process of the user-system interaction must be taken into account in the evaluation design.

New media affect evaluation
One complication in evaluating retrieval of multi-media material is that even basic questions, for example, what retrieval tasks people actually want to carry out are highly dependent on the particular media represented in the system and on special skills of the users. Do people want to retrieve pictures on the basis of colour, texture, light intensity, or some other physical property? Are people interested in working with an intermediary to sketch the 'ideal house' to which the system returns interesting candidates which are for sale in the neighbourhood? Do people wish to access video by content , for example, to check if Darth Vader said 'light sabre' or 'life saver'? Technically, we can offer users tools which allow for these sorts of tasks to be carried out but we seem less than confident that users are interested in actually carrying them out. At present, the crucial questions in multimedia evaluation may be upstream, when requirements are formulated. Progress in multimedia development may be best fostered by an emphasis on the evaluation of scenarios of use, rather than by the evaluation of finished systems.

A basic problem in multimedia systems is the nature of a `document' and `query'. Documents are usually richly structured and linked in multi-media repositories. Queries are often a mix of querying, navigation, consulting displays, note-taking, and reformulation. That `query' and `document' are more difficult to define in multimedia raises the question of the relevance of documents to user need as expressed in queries. Ultimately, we need to reconsider the whole question of evaluation of IR systems, which may result in new framework(s) for evaluating complete systems, the individual components thereof, and the interactions between the components.

Another crucial feature of multimedia is that normally any one unit of material plays a rhetorical role in a larger context of exposition. An example is where a user failed to solve a programming problem efficiently because, although a search got him to the right spot, he failed to read necessary prerequisite material. The system failed to reveal the rhetorical relations between one unit of material and the others within a larger substructure. So perhaps an important kind of evaluation is not how fast a retrieval engine can take a person to a unit of relevant material, but rather how well a retrieval engine can reveal the relations between one unit of relevant material and its context or how well can a retrieval engine present an alternative view of some set of potentially relevant units. That is, how well do search facilities and electronic exposition interpenetrate?

In interactive searches, it is important how the documents are presented to the user and how fast the users can perceive their contents. Hence, the techniques that support a fast perception of the retrieved documents are one of the subjects of the evaluation method. The valuation of this particular aspect is anything but trivial because documents of different media are perceived differently and their perception must be supported in different ways. The same applies for the identification of relevant parts of multimedia documents for relevance feedback.

New information retrieval systems face the problem to provide access to non-textual information, that is image, video, speech,... In such systems, the way to handle this access is more complex than in text-based systems. Can such systems be evaluated in the same way as "text-based information" retrieval systems? In systems managing non-textual data, we probably have to change the fact that usually recall and precision values are a good way to evaluate systems. For instance, a low precision information retrieval system managing images is not necessarily a sign of poor quality of the system, since a user sees immediately if an image fits his need. With time-based media like audio and video data, it seems important to evaluate precisely the granularity of query results.

A crucial point for the evaluation of text-based systems is the experts judgement of relevance documents. This notion of relevance has to be refined with new media documents. For instance with images, the relevance can be based on subjective elements, on physical characteristics (colours, ...), on objects of the image scene, and the evaluation of systems has to handle these different points of views for relevance judgements.

Evaluation methods for smaller groups
The evaluation of retrieval methods is traditionally based on the pooling method Only big organizations, however, can afford the pooling method. Little research was done towards "evaluation on a shoestring" which would allow Small and Medium Enterprises (SMEs) to evaluate dedicated retrieval methods for special applications.

New projects for new evaluation frameworks
The members of Mira are involved in a number of other programmes which will support the creation of new RTD projects in this area. For example, we have close connections with two working groups, started under framework 3, MIRO and FADIVA. We are well represented on the network of excellence IDOMENEUS. Weare actively involved in FERMI, FIDE II, IMIS. A number of us are on the academic panel for the Information Engineering sector under Telematics. Thus we are in a good position to propose new projects to support further research and development.


Ian Ruthven