{giorgio|mizzaro|tasso}@dimi.uniud.it Dipartimento di Matematica e Informatica, University of Udine Via delle Scienze, 206 Loc. Rizzi - 33100 Udine - ITALY
Abstract Designing good user interfaces to information retrieval systems is a complex activity. The design space is large and evaluation methodologies that go beyond the classical precision and recall figures are not well established. In this paper we present an evaluation of an intelligent interface that covers also the user-system interaction and measures user's satisfaction. More specifically, we describe an experiment that evaluates: (i) the added value of the semi-automatic query reformulation implemented in a prototype system; (ii) the importance of technical, terminological, and strategic supports and (iii) the best way to provide them. The interpretation of results leads to guidelines for the design of user interfaces to information retrieval systems and to some observations on the evaluation issue.
(*) paper presented at the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96, Zurich, CH, August 18-22, 1996, pp.128-136.
We propose rules of thumb, designed to guide the evaluation of interactive information retrieval systems. These rules are intended to assist the IR community in shedding maximum light on the effectiveness of IR "components" while, at the same, minimising the effort required to do so. By "component", we mean a focal concept, which the evaluator wishes to better understand, such as retrieval model, computational technique, interface dialog, visual display, or overall model for interaction. These rules can be thought of as rallying cry for thinking about the utility of an IR evaluation.
During an `evaluation', we want to shed maximum light on effectiveness, yet do so economically. Thus, the title of this position paper, `Evaluation Light', is a play on two concepts `illumination' --- brightness rather than darkness --- and `economy' --- lightweight rather than heavyweight.
We present a set of rules of thumb we have devised to guide the evaluation of interactive information retrieval systems. We believe these rules can assist the IR community in shedding maximum light on the effectiveness of IR "components" while minimising the effort required to do so. We use the term "component" to stand for such things as retrieval models, computational techniques, interface dialogs, visual displays, and so on. It is important to note that in any particular evaluation, some components, or combination of components, will assume a prominence over others.
The list of rules is very much in the draft stage, and this position paper has been specifically written to provoke discussion on what constitutes useful evaluation. We are well aware of the importance of contextualising these rules with the work of others in HCI and IR, but have yet to do so. In our own work, these rules have assisted in the development of a light-weight plan for the evaluation of a system designed for the retrieval of pictures on the basis of spatial indexing features. (This is the second scenario presented in the Appendix.)
Don't do feel-good testing, do feel-right testing
In other words, don't do evaluation because it is expected of you,
or because you want the test users to say "what a lovely system".
Rather, do it because you want to gain insight into the component being
tested. For example, too often we see Precision-Recall graphs presented,
where components should be evaluated in their own terms.
Principled design drives principled evaluation
If you have a set of claims as to why your "component",
in particular, should improve an information retrieval task, it is much
easier to decide how to evaluate it. This idea underlies the development
approach where the manuals for the user interface are written before the
interface is prototyped.
Make claims/hypotheses for your system
If your system is designed with some goals, objectives or claims
clearly stated, then you should be in a better position to test the system
against these. Thus, it is important to state what the research or development
goals are.
Break down a large hypothesis or design conjecture into discrete,
smaller hypotheses or conjectures
All too often what one would like to learn is stated at too high
a level. Thus, it is important to discover smaller aspects of the problem
and to propose focused questions about these aspects. We believe that
this approach will lead to greater insight.
Evaluate using small, controlled, tasks directed at evaluating
a given claim or hypothesis
Focused trials are more likely to lead to specific insights. There
is plenty of evidence to suggest that small trials, properly run, can be
at least as effective as big ill-focused trials.
Know your intended users, and brief test users accordingly
This is important if the test users are not necessarily the same
as the intended users. Or, if the evaluation is being done "at a distance".
Inform test users of the system's intended user model through
the right cover story
The problem here is to decide what skills and point of view you
would like test users to have. It seems that instructions can make a large
difference on how test users approach an interface and thus what can be
learned from a test.
Minimise testing, maximise analysis of results
Conducting experiments, and particularly those involving human
subjects, can be very expensive. Moreover, analysing the results from such
experiments can itself be very expensive, and particularly if the analysis
is unfocused. Thus, it is important to plan your experiments and analysis
in order to discover the minimal amount of data required to test a conjecture
or hypothesis. Naturally, we would advocate collecting as much data as
possible from experiments; you or others may be able to re-use this data
for other purposes.
Consider micro-evaluation rather than macro-evaluation
Micro-evaluation looks at comparisons between individuals (users,
searches, etc.) rather than averaging performance across individuals. Potentially,
micro-evaluation may result in greater insights concerning the claims for
your system. There is a perceptible move away from macro-evaluation (e.g.
Precision-Recall graphs) in IR towards micro-evaluation. Micro-evaluation
will certainly require quite sophisticated data extraction and analysis
tools, and perhaps we in the IR community should be designing and building
these to support our experimentation.
Use other people's results where possible
You should have a plan for how your data could be shared with,
and used by, other experimenters. Also, you should consider how you might
reuse someone else's data.
Have contingency/risk plan
If a major hypothesis is invalidated, it may be desirable to abandon/re-direct
the research, design or development effort. This plan can be based on the
set of conjectures/hypotheses you develop.
On reflection, the "Evaluation Light" rules seem self-evident. However, we think that by reminding ourselves of these "self-evident" rules, we may be able to design IR experiments which require less effort and which lead to greater insight into the component being tested. We would welcome suggestions for improving both the rules themselves, and the presentation of these rules.
These two scenarios are intended to prompt you to think about how the Evaluation Light rules could be used in practice.
The cognitive dimensions framework includes a dimension called "role expressiveness", which captures the extent to which an application (or interface) reveals causal relationships between the various data/artefacts being manipulated by the user.
We conjecture that if the causal relationship(s) between a query, and the documents retrieved in response to that query were revealed to a user, then the user would be able to judge the relevance of a retrieved document more accurately. This is becoming increasingly important in situations where documents take an appreciable time to fetch, load and indeed read, e.g. full-text WWW documents, and where typically some surrogate of the document, e.g. title, is displayed in the list of retrieved results. (Incidentally, it may support users who prefer to identify likely relevant document first before reading the documents in detail.)
Examples of the kinds of information that might be displayed to indicate query-document role expressiveness in a list of retrieved results are:
Using the Evaluation Light guidelines, design a set of experiments to test the above design conjecture.
We have designed a indexing and retrieval model for images. In the present work the images are photographs of landscape scenes in Scotland. The photographs have been indexed manually, where people have identified salient objects, drawn a rectangle around them, and labelling each object with a keyword. The test collection consists of approximately 800 photographs.
A query comprises a set of spatial features (rectangular region plus label) and/or a set of text features (keyword which will be matched with text associated with the image). Let us concentrate on the use of the spatial indexing for querying. We suppose that a user builds up a "picture" of the image they wish to retrieve using labelled rectangles. This query is then matched against the indexed images, and a ranked list of matching images is presented to the user.
A major (and incomplete) design conjecture underlying this work is: "For some user groups, for some tasks, spatial indexing will result in more effective retrieval of photographs/images". Think about the user groups and tasks for which spatial indexing and retrieval might be useful. Then, consider how the Evaluation Light rules be used to home in on a set of simple, but revealing, experiments.
A necessary step towards progress is the clarification and definition of evaluation frameworks and methods leading to models of the IR system and of user interaction with the system within task domain as well as to effective cost/benefit ratios.
A framework can help identify theories, methods and models appropriate for analysis of a given problem domain and data about it. A framework differs from theories, methods and models in the sense that it cannot be falsified or verified as can theories, methods and models. A framework can for example be evaluated in terms of criteria such as its completeness with respect to a) the problem domain that it should cover b) theories, methods and models addressing the problem domain.
The following aspects should be studied/explored:
Crucial issues will be:
One thing that is needed is a map from aspects/goals of evaluation to methods for data acquisition and analysis. A framework like the framework for user-work domain interaction presented by Annelise can for example be used to identify a set of methods that are appropriate alone or in combination for different evaluation goals. For example,
In addition, task scenarios envisaged within frameworks can help identifying appropriate performance measures
It may be useful to 1) look into other scientific disciplines for other theories and methods such as experimental and cognitive psychology, sociology, HCI and usability inspection methods, human factors and ergonomics, cognitive systems engineering, hermeneutic approaches like discourse analysis and ethnographic studies, design in context/situated action/design studies, semiotics and natural decision making theories and studies. And to 2) adopt more advanced technology such as evaluation work benches using multimedia recording techniques.
Group 5 addressed David Harper's Scenario A: "design empirical investigations of whether users would benefit from knowing how terms in their query related to the documents retrieved". We treated the exercise mainly as an evaluation of Harper's Evaluation Light framework, which was precirculated (see section 4.2). Our approach was to move through the headings of the Evaluation Light framework in what we hoped was a reasonable sequence, pausing to include other design steps that we felt were needed. We shall put the existing Evaluation Light steps in bold and the additions in bold italic red.
We initially assumed that the scenario was supposed to describe standard IR, but there were two views of how the user might benefit. In View 1, the user is expected to look at the set of retrievals and then to reformulate a refined request; the improvement would lie in a better reformulation of the request. In View 2, the user is supposed to look at the set of surrogates for retrievals and then to fetch some of them for full examination; the improvement is supposed to lie in being better able to separate relevant from non-relevant documents.
The difference in our understandings was not initially clear, leading to some confusion. We started with the Evaluation Light question: Has it been done before? Not to our knowledge. (Later, someone who claimed to be a fly on the wall, although some of us sometimes wondered if he was totally off the wall, told us that there was a study by Veerasamy and Belkin, pub. 1996, with some relevance.) Next we asked for intuitive or anecdotal examples and counter-examples, a step strongly recommended by the HCI-wise members of the group. Here the difference between the views became initially apparent, as Steve Draper (who had interpreted the scenario according to View 1) argued that in modern full-text IR the user was if anything better off not even thinking about how the terms retrieved the documents, and should instead just cut and paste large relevant sections from the first set of retrievals and use those to form the refined query, leaving all the workings to the magical black box inside.
Our mixed interpretations led us into some difficulties at that stage and still more at the following stage, when we moved on to break the hypothesis down. Harper's idea was to avoid evaluating the whole IR process, but instead to isolate and test smaller hypotheses in reduced situations. For our scenario, we assumed the real-world process included the steps of
Those of us who had assumed View 2 argued that step 2 could be omitted from the evaluation, and instead we should concentrate on step 4, the winnowing of sheep from goats.
Later we performed what should be the first step in the revised Evaluation Light framework: Grill the originator of the request over tea until you find out what he or she REALLY means. After Steve had trapped David Harper in a corner, David revealed that the scenario was meant to describe searching the web using an engine like AltaVista. Experience had revealed that when, say, ten terms was put in, intending to narrow the search, resulting documents included some that were retrieved by what David called 'bizarre subsets' of the terms. Harper's original hypothesis is that surrogates should additionally be tagged by which terms matched them, and the user could then by inspection dismiss surrogates with bizarre subsets e.g. if the retrieval query was "elephants and image" then any surrogate with only "and image" but no "elephant" could be dismissed. This scenario lies somewhere between views 1 and 2; titles of web pages usually reveal their utter irrelevance, so the problem of discerning relevance from surrogates is less important than step 5, refining the query.
Next in the Evaluation Light framework came finding a claim. The HCI-versed members were strongly in favour of this step but nevertheless had difficulty at first. What claim was there to find? So we moved on to consider possible manipulations of the task. (In retrospect, I believe this step is what Evaluation Light calls try gedanken experiments.) We listed several possible manipulations, including TileBars. Given our new understanding of the scenario, the manipulation we liked best was to present a histogram of how many documents had been retrieved by each combination of terms. (Possibly a bar of the histogram could then be queried to discover just which documents that combination had produced.) Users could then note that the combination of, say, terms 1 and 8 had fetched a disproportionate number of titles, and perhaps decide on reflection that it relied on a different sense of one of the words; they could refine the request accordingly.
We therefore decided to evaluate a design in which the usual list of titles was accompanied by a histogram like this:

Our claim for testing would be that reformulation would be improved. Now it was time to consider the experimental design. The obvious design would be compare performance on the usual AltaVista display against performance with the enhanced display, but that is not an adequate design because performance with extra information is always likely to be at least as good as baseline performance, so random deviations will support the experimenter's hypothesis. Instead, our proposed design would include a third condition, performance with only the histogram, giving a design with three conditions: list of titles; histogram; titles plus histogram.
Further Evaluation Light steps included consider generalizability (would it work with multimedia? why not, we said) and consider intended users. Our proposed investigation would include the usual suspects, i.e. information specialists in one group, naive users in another.
It was agreed by all that before starting with real subjects, the experiment should be debugged by running think-aloud pilot subjects at an early design stage. Part of the point is to do some open-ended observation on the system and task before committing to an experiment. This isn't so much piloting a controlled experiment as investigating to see if there are new issues to address that might make the experiment beside the point.
Finally, we would add an explicit principle about seeking to demonstrate possibilities. E.g. although it is of interest to know what typical users do on average, it is also important to know what is and is not possible at all with a system. To know what the fastest typing speed on a keyboard is, as well as the average; to know that a document cannot possibly be retrieved by a particular system, as well as to know that the average user fails to find it. Discovering what is possible is a scientific enterprise in itself, but one seldom mentioned by experimentalists.
We believed that Evaluation Light was an excellent design process. Its headings were good discussion tools and it encouraged us to look for probing, investigative evaluations, rather than the style of 'build the whole system and see what happens'. Our suggested additions merely fill out the gaps, they do not change the framework.
The group worked on 2 main issues, the scenarios presented by Peter and the existence of different kinds of relevances.
The intention was to discuss the normalisations over groups of test persons, engines, use of simulated needs and real needs, and so on, in interactive performance experiments where the foremost entity for statistical validity is the needs/requests applied by users - not the users themselves.
Naturally, if the experiment is to observe the use and behaviour by real-life users applying two systems (or two interfaces, or two methods of representation) the requirement is - for instance - 125 users each with their own need. However, if the goal is measuring the performance of two systems by users with or without real needs, the number and nature of the needs becomes crucial, less the number of persons in the groups to test the systems. This is the rationale behind the scenario and the idea of discussing number of test queries, their "openness", relevance assessments, etc. We don't need 125 x 2 = 250 users but much less (as also shown by the Udine Team).
Anyway, for having statistical significant experiment, we need a lot of users, and very huge efforts. This has almost never been done in the past, and now it should be the right time for doing it.
The relevance measured in TREC is a "low" (referring to the "Stefanology of relevance") one, while the relevance that other people (and Peter in his scenarios) are suggesting/trying to measure is a "higher" (nearer to the user) one, both in terms of (i) not only topic, but also task, and (ii) relevance to the information need, not to the request.
The group agreed that we need some method for evaluating a relevance near to the user, and the discussions seemed to confirm the "Relevance indetermination phenomenon" (The more we try to measure a relevance near to the user, the less we can measure it), though somebody in the group did not agree on it. Further discussion is needed on this issue.
Finally, we also discussed vaguely the use of Tague "Informativeness Measure" and its scenario applying task end product properties as entities for performance assessments (e.g. references actually used by searchers in their final articles compared to the retrieved ones from systems).
During MIRA meeting in Padova, several very interesting models have been displayed. But apart from telling <<we need a meta-model>> (Thomas Green), <<it might be desirable to be able to characterise (or classify) tasks, domains, and users, according to some agreed framework, so that at least we might be able to compare the results of experiments>> (David Harper), etc., the group discussions seem to me :
and so :
In fact, I think that referring to common model(s) is the best (the unique, perhaps) way for each of us to :
Over the background of classical models, at least 5 contributions have brought stuff about models during the meeting :
I won't enter in the model details here, but first I try to have a synthetic view.
One good exercise would be to try to distribute the various delivered keywords or keyword patterns into the levels of -a- and to structure them in an intra-level and then an inter-level way.
The keywords might be : user, documents, surrogates, requests, answers, mechanism, user goals, evaluation function, evaluation goal, data collection model, system response, transaction logs, user questioning, think-aloud protocols, effectiveness, efficiency, ease of interaction, usability of interface, user satisfaction, experimental settings, information, real information need, perceived information need, expressed information need, query, topic, task, context, comprehensibility, novelty, choice of which relevance you need, "which relevance to judge ?", "which relevance judgement ?", money to earn, time, activity analysis, information overhead, "getting lost", viscosity, hidden dependencies, visibility and juxtaposibility, progressive evaluation. role expressiveness, consistency, choice of what relevance you need, micro-world, extra-information provided, mental image, etc.
Annelise Mark Pejtersen said that one design challenge for system evaluation consists in identifying stable structures (another being : constraining actions / possibilities). Thomas Green and Steve Draper highlighted the importance of abstraction.
I think that the kernel of IR problems stands in abstraction and abstraction level.
When each of us speaks of abstraction, we mean different facets of abstraction, according to who we are, which object is concerned at this moment, which present level of meta is concerned,... It would be useful to try to precise on this point as often as possible. In particular, when we describe and use models, we necessarily make abstractions. The problem is to find the adequate level.
I won't enter in details here. But I'll have a look on one kind of abstraction : abstraction from the operational viewpoint. And I suggest one somewhat personal track about it.
Let us imagine an (H)IRS session in progress.
Each constituent (object or action) taking part in this IRS session may be seen as the embodiment of a hierarchy of abstractions (or perhaps several hierarchies ; but let us ignore that). For example : the keyword "Stradivarius violin" may be the visible element of a hierarchy including, as we go up in the abstraction degree, violin, string instrument, instrument, music, art, ... (Here, we could introduce the notion of characteristic level to express the upper level where a mental image arises in the user's mind). This is true for surrogates, each surrogate constituent, request, matching, relevance judgement, etc., and also -with a great power- for elements of images.
At each step of the session, according to the state of the user, of the user's need, of the whole context, etc., we may conceive what is the current good level in the hierarchy for each constituent ; let us call it its present abstraction point (PAP) . For instance, for the keyword "violin", it may be sometimes "string instrument" and at other times "music". Another example is : the action of comparing 2 keywords may be considered as the visible element of a hierarchy including : "compare the keywords themselves", "compare their PAPs", etc. So one may define a PAP for this action at every session stage.
If you wish to compare a request word RW with a surrogate word SW, you can localise each of them in a common hierarchy, and then the comparison will pass through what we might call their youngest common ancestor (YCA) or more precisely the Youngest Common Ancestor of their respective PAPs.
In other respects each action of the IRS may be considered in terms of the PAP of each of the involved elements (objects or actions). And we may modulate how we climb in the hierarchy from each element's PAP to adapt oneself to the context, to a micro-world, to the result of a previous action trial, to an evaluation goal, etc.
Steve Draper evoked the problem of being lost after a jump in an Hypertext. One way of helping the user may be to show him the YCA between his former position's PAP and the new one's PAP (this is possible only if the system is able to find a common hierarchy of abstractions).
Moreover, when Thomas Green spoke of the good way of representing music, it is linked to finding a good illustration, a good visual (or multimedia) surrogate for a PAP.
At another point of view, each default in Thomas Green's Cognitive Dimension Framework may be viewed as the manifestation of some default of the abstraction degree adopted for at least one constituent.
Still another point of view. When we want to make comparisons between different experiments, what is comparable is probably made of the PAPs of the different problem constituents.
This could be easily connected with the propositions of Annelise Pejtersen, Stefano Mizzaro and probably other MIRAns and would perhaps aid applying David Harper's Rules of Thumb (e.g. for role expressiveness).
Finally, fixing some environment to a (H)IR session probably may be represented by keeping in mind (or in the system) a certain permanence in the underlying abstraction level -which may be modulated-, while the particular abstraction levels of the different objects may change rapidly.
In conclusion I think that abstraction aspects may (and must ?) be the pivot of the evaluation works. And in particular working about and in terms of models is necessary.
Final notice : this has been done with respect only on what was displayed at the meeting in Padova (and what I understood). But, obviously, it must be complemented and made more precise at the light of existing works (e.g. Ingwersen, Croft, Chiaramella, .......).

