Themes
The user in the evaluation process
Information seeking tasks are at the centre of much computer-mediated work.
Sometimes these tasks are curiosity-driven, but most often they are spawned to
support a primary goal, perhaps to prepare a budgetary report, to fix a mistake
in a design drawing, or to resolve a misconception about a programming language
feature. Thus, the utility---the value gained for the effort expended---of IR
tools must ultimately be assessed within an overall work setting. The
important goal may be less to assign a utility coefficient than to understand
how IR tools are actually used. For example, ethnographic studies, the
description and analysis of workplace activity, may suggest that the most
important feature of an IR tool is how well it allows people to co-ordinate
their activities, goals, and sub-goals within a mixture of application displays
and programs. It is possible that people spend 10% of their time querying and
reformulating their queries, but 50% of their time on keeping track of what
they have done, what results they want to keep or investigate further, and what
needs to be done next. These monitoring tasks may be the bottleneck to overall
performance, yet the only way to discover such bottlenecks is to observe people
at work. People often use technology in unexpected ways. The spreadsheet, for
example, is an extremely effective tool for Computer-Supported Co-operative
Work . This surprise, however, could not have been discovered without a
commitment to investigating real work practice. One reason, then, for taking
an activity-centred view of IR evaluation is to test one's assumptions about
what aspects of a tool are important. Studying cognition in practice can
change a designer's assumptions about the strengths and weaknesses of the
artefacts he provides. Very few ethnographic studies of IR in practice exist,
especially involving people who are not library patrons, we need to look
elsewhere for that
Landaurer (1985) has argued that "psychological research can be the mother of invention". This slogan means that empirical studies of human skill can provide designers with a non-trivial understanding of how people struggle to complete tasks. With this understanding comes insight and inspiration for design. As an example, several studies, in both print and electronic media, have shown that people have much trouble recognizing relevant material as being indeed relevant. To concretize this, a person may retrieve a unit of material, read it, or look at it, or listen to it, but inappropriately dismiss it as non-relevant. One can conclude then that improved ranking schemes will not necessarily help many people unless the reason for relevance is understood by the users. Even if superb designers can sometimes anticipate such problems, they may have considerable trouble convincing their colleagues. More often, such insights are found through empirical work, and it is harder for a design team to ignore what a video tape shows users to have actually done. So, like ethnographic research, user-centred empirical work can change design priorities and provide inspiration.
The changing nature of IR tasks and their evaluation
Traditionally, information retrieval has evolved as a solution to the
problem of finding documents in a corpus or collection, in response to queries
from users. The task was thus to match the text of a query against the text of
documents and retrieve the best-matched. This was adequate for applications
like bibliographic databases and other simple text processing niches. Now,
however, the data has changed, users have changed and the functionality has
changed.
Current day multimedia technology, knowledge-based information processing and high performance visualization technology enhance the development of advanced interactive IR systems. The diversity and complexity of interlinked multimedia data offered in many applications requires new indexing methods and an enhanced retrieval functionality as well as new interaction pardons and metaphors as compared to traditional IR systems. As the interaction possibilities of users hence become increasingly complex, co-operative user interfaces aim at supporting and guiding the information retrieval process in any phase of the interaction. Supporting cognitive effective interaction is essential for effective information retrieval. This includes an active context-, task-, and user-tailored assistance, which takes into account and exploits the development of the information-seeking interaction. Not only new retrieval functionalities are required and supported, but also a variety of global, interaction-oriented tasks (apart from the specific search and retrieval tasks), e.g., clarification and formulation of information needs, inspection of retrieved information, interactive relevance assessment, explanation of retrieval methods, etc. Cognitively adequate access to these new retrieval functionalities by naive users as well as by expert users has to be ensured. To achieve this goal, appropriate natural ways of interaction utilizing visualization and multimodal dialogue techniques can be exploited.
For an appropriate evaluation design the range of tasks supported by a given system has to be identified and categorized in order to identify the relevant evaluation factors/ dimensions. Given the growth in inter-connectivity and networking and the change in users and in data, we now see other functionalities required of information retrieval techniques besides ad hoc querying; routing, filtering, categorising, abstraction and summarisation, passage retrieval, sub-document retrieval, browsing, data mining are just a few. Quantitative evaluation approaches which measured traditional IR applications do not transfer readily to these and other emerging applications. Qualitative evaluation of the new applications in terms of user needs, goals, satisfaction, have not been attempted as yet. Designing comparative evaluations of different systems or variants of a system (controlling one or more factors), where typical users are given realistic tasks to fulfil, may provide evidence about the use of the system-supported tasks. We assume that qualitative evaluation criteria and measures play an important role in the evaluation design.
Traditional evaluation methodologies
Information retrieval has developed a strong scientific evaluation
methodology over the last forty years. This methodology has served us well but
is now showing signs of being ready for revision and extension because of the
changing nature of information seeking tasks. Moreover this methodology is
beginning to be adopted in other disciplines without a deep understanding of
its foundations and limitations. Most IR researchers use the classical
measures of recall and precision for evaluating IR systems and methods.
However, these measures are based on underlying assumptions and simplifications
many people are not aware of. This leads to the problem that these measures
are applied even in situations where it is not appropriate. In the following,
we briefly describe the major assumptions underlying the evaluation methods
currently in use.
Originally, these measures were developed for Boolean retrieval performed as batch runs. That is, the user's information need is transformed into exactly one query formulation, which, in turn, produces a set of documents as answer. For measuring the quality of this result, it is assumed that the user looks at all documents in the answer set. Although the quality of the answer set could be considered as a whole, instead, single documents are considered as the basic elements for evaluation.
For characterizing the quality of a document with respect to the information need, the concept of relevance is introduced. Mostly, a binary relevance scale is used, but multivalued relevance scales also have been considered by some researchers . In order to arrive at a measure characterizing the answer set, it is assumed that the relevance of each document is independent of that of any other document, and that all relevant documents are of the same value with respect to the information need.
In order to arrive at a measure, one more step is needed: User perspectives have to be defined, where a perspective is described as a set of preferences characterizing the kind of results a user prefers. A measure can then be developed which reflects exactly these preferences; for example, if a user wants as many as possible relevant documents, then recall is the appropriate measure.
Since IR experiments are stochastic experiments, a set of user queries must be used in order to compute a measure for characterizing the quality of the system or method under investigation. Average values can then be computed, provided that the measure chosen is based on a ratio scale. For comparing two systems, it is not sufficient to look at the mean values produced by the two systems, since the means may be affected heavily by outliers. Instead, the statistical significance of a difference in performance should be checked. Since the assumptions underlying parametric tests often are not valid for the experimental data, non-parametric tests should be preferred.
A major extension of the general approach described so far comes with ranked lists of answer documents. In order to apply the traditional measures, cut-off points have to be chosen. This is reasonable, assuming that a user looks through the ranked list from the beginning and then stops at a certain point. The choice of this cut-off point is also part of the user perspective. For the set of documents seen until this point, retrieval measures can be applied in the same way as with Boolean retrieval. If an average value for several cut-off points is computed, this can be regarded as averaging over users with slightly different perspective.
From the explanations given above, it should be clear that all the implicit assumptions underlying current evaluation practice may be subject to revision. Since the characteristics of new IR applications may very different from the classical ones, new evaluation methods and measures have to be developed. Of especial importance is the fact that the metaphor of queries as a reflection of information need seems now limited in the light of the fact that people engage in document seeking activities not always as a result of an information need, but evermore often as a result of a generic need (e.g. the need to watch a movie, enjoy a picture or a photograph, listen to a piece of music, etc.); evaluation of "information" retrieval services should probably be revised in the light of this.
Evaluation can be prescriptive of IR design
Information retrieval is a highly pragmatic art. Many existing
commercial systems have been developed in a purely ad hoc fashion, uninformed
either by theory or by experiment. However, there is now a lot of evidence
that system developers are beginning to take to heart at least some of the
results of 40 years of evaluation experiments -- and that they are benefiting
from this knowledge.
How can evaluation be prescriptive of IR design? The central question that needs to be asked when choosing an IR system design is: "Will this be the best solution for our users searching our database with their information needs during their task?" We need to know how system components and techniques perform in various contexts of use so that we can decide how to construct and parameterise an IR system for its intended contexts of use. IR systems and techniques have traditionally been evaluated in single, hypothetical use situations with the emphasis on the question 'does it work?'. It is becoming increasingly clear, however, that techniques do not simply work, or not work, but that most will exhibit extreme variations in performance across different use situations and with different parameters. We currently have little knowledge of why this is, even for well-worn techniques such as relevance feedback. We are in a situation where we have a technique that we know can perform well and can also perform badly, but we do not have much reason to hypothesise whether and with what parameters it will perform well in its intended context. Our evaluation efforts do not provide analyses of techniques that are prescriptive of IR system design. We need evaluation techniques that investigate mappings between parameters of IR systems and techniques and parameters occurring in their context of use.
There is a strong relation between theory and experiment. The theoretical basis for IR is quite rich but not generally very powerful. That is, IR theory draws from a variety of sources (e.g. cognitive psychology for interface design, linguistics for natural language processing, and logic, mathematics and statistics for matching). However, no single piece of theory is very prescriptive, even within its own province: all theories require filling out with experimental data. Furthermore, mixing ideas derived from different theoretical approaches can only be taken as a pragmatic task, to be guided by experiment.
Thus the role of experimental methodology, particularly but not exclusively evaluation methodology, in IR system design is stronger now than it has ever been. A good illustration of this is the increasing influence of the biggest experimental programme there has ever been in IR, the three-year-old US-based TREC programme.