drawing of Villa Duodo, Monselice

Proceedings of the Second Mira Workshop

Edited by Mark Dunlop

University of Padua

Villa Duodo, Monselice, Italy

November 14-15, 1996


Preface - Detailed Content - Citation - Format - Attendees
HCI&IR - TREC - HTIR - New Directions


Preface

These proceedings are the result of the second workshop of the Mira working group. Mira is a European Union ESPRIT working group which is funded to hold regular meetings to work on evaluation of modern information retrieval systems (including very large scale systems, interactive systems and browsing based systems). The working group is guided by Keith van Rijsbergen and is composed of: The University of Glasgow (prime contractor), City University, Dortmund University, Dublin City University, GMD IPSI Institute , IEI-CNR Pisa, Joseph Fourier University, Nancy University, Padova University, Robert Gordon University, Danish Royal School of Librarianship, Swiss Federal Institute of Technology Zurich, Tampere University. and The Union Bank of Switzerland.

This workshop was held in the very pleasant surroundings of the Villa Duodo. A 16th century villa perched on the hillside above the village of Monselice near Padova (the villa is now in the hands of the University of Padova). This amazing location and social programme (or should that read, eating programme) were arranged by the local organisers: Fabio Crestani and Silvia Gabrielli.

After the first meeting of the working group, in which all participants presented their work and views on evaluation in IR, it was decided to give more time in subsequent meetings for discussion. Steve Draper proposed a format based on short presentations and long group discussions - this was adopted for the workshop (see Attendees and Workshop Format for more details). I implemented this format (even down to a whistle to control the timing) and these proceedings are composed mainly of the reports of the working groups (written up shortly after the workshop).

Together with work group reports and short reports of presenters the proceedings include papers giving an introduction to Thomas Green's Cognitive Dimensions, an overview of hypertext-IR evaluation by Jonathan Furner, an overview of work-place evaluation by Annelise Mark Pejtersen, an introduction to Evaluation-Light by David Harper and David Hendry and concludes with a short paper by Marion Crehange summarising the workshop.

To print these proceedings simply print each section in the contents list above - the proceedings are split into six long HTML pages to make printing easier. Hypertext links are provided within each page: please make sure a page is fully loaded before using these links. The symbol will take you to the top of the current page when clicked, from there links lead to each report/presentation within that page and to the contents listing (via the workshop title). When printing pages some amount of reduction is helpful, I find two-to-a-page layout best. The proceedings are also available as a technical report by writing to "Research Report Request, Computing Science, University of Glasgow, Glasgow G12 8QQ, Scotland". The proceedings may be distributed and printed freely so long as this preface is included - the proceedings must not be charged for (except in printed form by Glasgow University). The proceedings are copyrighted University of Glasgow. The full citation for these proceedings is given below. Further details on the Mira working group are available at http://www.dcs.gla.ac.uk/mira/.

The superb location, good food and wine, and the discussion-based format led to an invigorating meeting and hopefully this comes through in these proceedings.

Mark Dunlop

Contents

Please note that some titles have been shortened to aid formatting of printed version - if citing, please use full title from within the proceedings.

0 Background to Mira and workshop format;

1 Session 1: Combining HCI and IR Evaluation;

2 Session 2: The TREC-5 Interactive Track Experience;

3 Session 3: Evaluation of Hypertext/Hypermedia Information Retrieval;

4 Session 4: New directions in the evaluation methodology.

Full Citation

Dunlop, M.D. (editor). Proceedings of the Second Mira Workshop (Monselice, Italy). University of Glasgow Computing Science Research Report TR-1997-2, http://www.dcs.gla.ac.uk/mira/workshops/padua_procs , November 1996.


University of Glasgow - - - - - - Universita degli Studi di Padova

drawing of Villa Duodo, Monselice

Proceedings of the Second Mira Workshop

0: Workshop Attendees and Format


0.1 Workshop format

The workshop was split into four half-day sessions: each session had a specific theme relating to the overall theme of Mira. Within a typical session there would be four short five minute introductions to the topic, essentially giving an individuals personal view on what the main issues of that session were. These short talks would be followed by a longer keynote presentation. After around one hour of talks, the session would split into four or five groups which discussed the topic and the presentations for roughly one and a half hours before concluding. To form these proceedings all presenters were invited to write an abstract of their talks - in practice, most keynote presenters and some others wrote a short paper for their talks. Each working group for each session was required to write a short summary of their session (resulting in 32 contributions in all).

Together with lunch / coffee breaks, social events and time to walk in the grounds of Villa Duodo, this format led to a considerable amount of discussion which pushed our knowledge forward and tired most participants!

0.2 Workshop attendees

The attendees were:

Maristella Agosti
agosti@dei.unipd.it
Micheline Beaulieu
mmb@is.city.ac.uk
Nick Belkin
belkin@scils.rutgers.edu
Pia Borlund
pbj@db.dk
Giorgio Brajnik
brajnik@dimi.uniud.it
Marco Buchel
buechel@inf.ethz.ch
Marion Crehange
marion@loria.fr
Fabio Crestani
fabio@dei.unipd.it
Steve Draper
steve@dcs.gla.ac.uk
Mark Dunlop
mark@dcs.gla.ac.uk
Jonathan Furner
j.furner@rgu.ac.uk
Silvia Gabrielli
silvia@dafne.dei.unipd.it
Isabella Gagliardi
isabella@itim.mi.cnr.it
Thomas Green
thomas.green@mrc-apu.cam.ac.uk
Kai Grossjohann
grossjohann@charly.informatik.uni-dortmund.de
David Harper
djh@scms.rgu.ac.uk
David Hendry
d.hendry@scms.rgu.ac.uk
Peter Ingwersen
pi@db.dk
Kalervo Jervelin
kalervo.jarvelin@uta.fi
Noriko Kando
noriko@db.dk
Massimo Melucci
massimo@dafne.dei.unipd.it
Stefano Mizzaro
mizzaro@dimi.uniud.it
Jane Reid
jane@dcs.gla.ac.uk
Eero Sormunen
lieeso@uta.fi
Carlo Tasso
tasso@dimi.uniud.it
Keith van Rijsbergen
keith@dcs.gla.ac.uk
Bruna Zonta
bruna@itim.mi.cnr.it


University of Glasgow- - - - - -Universita degli Studi di Padova

drawing of Villa Duodo, Monselice

Proceedings of the Second Mira Workshop

1: Combining HCI and IR Evaluation


Co-chaired by Keith van Rijsbergen and Steve Draper.

Short presentations by
Micheline Beaulieau, Steve Draper, Mark Dunlop and Stefano Mizzaro.

Full presentation by Thomas Green.

Working group reports by Steve Draper, David Harper, David Hendry and Kalervo Jarvelin.

Top


1.1 Presentation summaries

1.1.1 IR and HCI - Micheline Beaulieau

The relationship between IR functionality and HCI issues has been the focus of several projects based on the Okapi system. The investigations have been concerned with the evaluation of automatic and interactive query expansion facilities implemented in different interface environments including: a VT100 character-based interface, and two GUI's one with overlapping windows and another with multiple panes. Operational field trials have raised three major questions regarding factors which contribute to the quality of interaction. Firstly, to what extent should the underlying functionality of the system, i.e. the relevance feedback & query expansion process, be made visible? Secondly, what is the cognitive loading if the user is required to select terms for query expansion? Thirdly, what is the resulting distribution of control between the user and the system?

The retrieval task can be viewed as two distinct sub tasks: one concerned with query construction and the other with viewing or browsing results. Interface features need to cater for each.

Top

1.1.2 Evaluation in HCI and IR - Steve Draper

This session is about HCI and IR. Given the tradition of systematic and controlled experimentation in IR, I think the important lessons from HCI evaluation for IR are mainly about the importance of discovering surprises. In particular, users surprise you when you look at what they actually do as opposed to what you expect or what they might tell you if you interview them rather than observe them in action. First of all, we can't be sure of what the user's goals (information needs) are unless we investigate them too. For instance in HCI an easy mistake to make is to assume that every error message indicates a user error. However sometimes users are systematically experimenting with the machine, and using the error messages to get information. Similarly, it may be that some user sessions are not trying to retrieve any particular documents or information about the domain, but instead looking at what kind of thing the collection holds or what kind of search request is effective. Secondly, you can't always trust what users say or you expect about user methods of doing a task. In one project I supervised, my student tested a system not only on the children it was intended for but also on the programmers who had designed it. The latter insisted that allowing boolean operators as well as free text queries was important, but in fact in doing the set tasks they did not use any of the operators they had insisted on implementing. Thirdly, features of the user interface that may not seem important even to the user may turn out to be when you observe closely. My student noticed that, of the two systems she tested, in the one that did not highlight and automatically scroll to matched terms in documents users often dismissed as irrelevant documents that were in fact important for them. All of these issues illustrate the importance of real user observations, as opposed to interviews or numerical measures.

On the other hand, my own interest in IR evaluation comes from the fact that I believe it shows the importance of benchmark measures that can be used apart from user testing: and the challenge is how to combine the use of both. This is fundamentally because the machine's functionality and performance is one (although only one) of the crucial factors affecting the user: for example in some cases at least, whether or not the machine delivers all the relevant documents and delivers them fast enough is very important to the user. This is in contrast to, say, word processors where functionality is usually not a key issue. Thus while open-ended observation of real user behaviour is crucial for spotting what the important issues are, engineers can only tune a system efficiently if they can concentrate for periods on optimising performance on some simple measure. If we don't use recall and precision, we will still need to identify measures like them, even though it is important also to do more open-ended user studies to tell us how such measures relate to the overall task performance.

Top

1.1.3 The problems with precision and recall - Mark Dunlop

Although precision and recall are widely used in the IR community and provide a valuable tool for analysing the effectiveness of underlying IR engines, I see there being three key problems with respect to evaluating interactive IR systems:

They don't tell the whole story.

Precision - Recall graphs only show the retrieval performance of the underlying system and do not take into account many aspect of the user interface which impinge on the applied effectiveness of the IR system. In other words, the user interface and speed issues are not taken into account and, as a result, having two PR graphs of two different IR systems does not tell us which would perform better in a real setting for a particular set of tasks and a particular set of users.

In my talk I compared PR graphs with engine performance for cars: brake horse power, torque, fuel consumption and top speed can all be used to give a statement on how effective the car's engine is. However, very few people buy a car based purely on these features - there is little point in buying a two-seater sports car as the only car for a family of four, even if the engine is superb. PR graphs and engine performance figures share the same major benefit, as Cooper [1966] said for his Expected Search Length measure, they are useful "in a theoretical investigation of how varying a certain design parameter would affect retrieval effectiveness under certain assumed conditions regarding input queries and document collection makeup". Their main use is to improve the engine, which must now be viewed as a very small part of reviewing the system.

Graphs are hard to read.

I personally find PR-graphs hard to read. As an example, Porter's stemming algorithm should improve recall at the cost of precision, but how does this relate to the graph? And what does that mean for typical user tasks? I am currently working on a hypothesis that a more task and user oriented graphing method would help interpretation of what system is better for what tasks. Although some users may have a feel for what percentage of domain's documents they are looking for, I guess that this would be a small percentage of users. Most would be interested in finding, say, "2-5" or "10-20" documents on a topic. With ranked retrieval systems, a format which maps effectiveness to distance down the ranked list would also be closer to most users interactions with the system. While such graphs are not available, we must attempt to translate PR-graphs into tasks to establish when such features are useful and if an improvement is actually likely to help end-users.

Being required to plot them is holding back research.

Finally, I feel that the belief in having to plot recall and precision graphs to have a piece of IR work accepted may be holding back advances in interface design. It is simply too expensive to create a test collection for real tasks (and often too artificial to adapt test collections for interactive use).

We need alternatives....

Reference

Cooper, W.S. "Expected search length: a single measure of retrieval effectiveness based on the weak ordering action of retrieval systems", American Documentation, vol. 19, January 1968.

Top

1.1.4 How many kinds of relevance in IR? - Stefano Mizzaro

(see a complete paper on this topic)

It is widely recognised that relevance is a (if not 'the') central concept of IR. I show that: (i) there are many kinds of relevance, not just one, and (ii) these kinds can be classified in a four dimensional space. It is commonly accepted that relevance is a relation between two entities of two groups. In the first group, we have one of the following three entities:

In the second group, we have one of the following four entities:

On this basis, a relevance can be seen as a relation between two entities, one from each group: the relevance of a surrogate to a query, or the relevance of the information received by the user to the information need, and so on. Therefore, a relevance seems a point in a two-dimensional space. But these are not all the possible relevances, since two more dimensions have to be taken into account. First, the above mentioned entities can be decomposed in the following three components (third dimension):

Therefore, a surrogate (a document, some information) is relevant to a query (request, perceived/"real" information need) with respect to one or more of these components. The fourth dimension is the time: a surrogate (a document, some information) may be not relevant to a query (request, perceived/"real" information need) at a certain point of time, and be relevant later, or vice versa. Summarising, each relevance can be seen as a point in a four-dimensional space, the values of each dimension being:

Finally, some problems/questions:

Top


1.2 An Introduction to the Cognitive Dimensions Framework - Thomas Green

This talk was a shortened version of a talk given at Visual Languages 96. Some of the illustrations I used in that talk are available from my home page. From there you can also download a copy of a much more detailed paper by Green and Petre, which appeared in the Journal of Visual Languages and Computing 1996.

1. Introduction

The problem I address is, how to evaluate the usability of information-based artefacts and notations. More particularly, how to evaluate them cheaply. No laborious user-testing, no detailed analyses and modelling by HCI experts.

HCI folk have slowly learnt that expensive, time-consuming evaluative methodologies are not taken up by creators, for very good reasons. I propose something different, a frame-work of user-centred discussion tools.

We all have concepts that are vaguely known but unformulated. Discussion tools are elucidations of such concepts. If they resonate with your experience, they can promote a higher level of discourse amongst you, the designers and creators. They can create goals and aspirations, promote the reuse of good ideas in new contexts, and provide a basis for informed critique. Standard examples can become common currency and best of all, once concepts are named and exposed, their interrelationships can be appreciated.

The set I propose is called the cognitive dimensions framework, a still-unfinalised set of about a dozen terms such as 'viscosity', 'premature commitment', 'abstraction level'.

 Background

The problem of usability evaluation has been attacked in three ways. One way is to perform user testing, in which users are watched while they use the system. This is expensive and somewhat artificial: users in a laboratory, performing specified tasks, are different from users in the wild. Moreover, it takes far too long. Designers cannot hang around for the necessary weeks, then make a small modification and repeat the cycle. Results from lab testing are valuable but in general they are too expensive and not 100% trustworthy.

The second way is to use predictive user models. This is much cheaper and has been extremely successful in some cases. Much the commonest technique is GOMS, in which all the users' core tasks are scrutinised in detail, possible methods are worked out, and the time required for the user actions to accomplish these methods is predicted in detail. Although GOMS has achieved considerable success as a cheaper and quicker alternative to user testing, it is still an expensive undertaking, and it requires the assistance of HCI experts. Moreover, it really needs to be performed on the finished system, since the time-predictions it deals in depend on physical and perceptual characteristics of the interface display. (See Olson and Olson, 1990, for one of many accounts of GOMS).

I would certainly recommend designers of information retrieval systems to use GOMS rather than to perform user testing. But my purpose here is describe my cognitive dimensions framework, one of a new generation of lightweight, approximate evaluation methods which constitute the third type of attack on the problem of usability evaluation.

 Discussion Tools

The cognitive dimensions framework is not an analytic method. Rather, it is a set of discussion tools. My purpose is to provide a way in which some evaluation can be done by the designers themselves.

I believe what we need is to improve the quality of discussion. Experts make sophisticated judgements about systems, but they have difficulty talking about their judgements because they don't have a shared set of terms. Also, experts tend to make narrow judgements, based on their own needs of the moment and their guesses about what other people may need; and other experts don't always point out the omissions. Again, if they had a shared set of terms, and that set was fairly complete, it would prompt a more complete consideration.

In short, experts would be in a good position to make useful early judgements, if (i) they had better terms with which to think about the issues and discuss them, and (ii) there was some kind of checklist. The terms might or might not describe a new idea; most probably, the expert will recognise a concept as something that had been in his or her mind, but had never before been clearly articulated and named.

Discussion tools are good concepts, not too detailed and not too woolly, that capture enough important aspects of something to make it much easier to talk about that thing. They promote discussion and informed evaluation.

To be effective, they must be shared - you and I must have the same vocabulary if we are going to talk. And it is better still if we share some standard examples. And it is best of all if we know some of the pros and cons - the trade-offs between one thing and another.

They have many advantages:

What discussion tools do not need to do, is do describe novel ideas. If they just give a name to something you had often thought about but thought about giving a name to, that's fine. But sometimes they might be ideas that are new to you, and that's fine, too.

Figure 1 illustrates a real-life discussion without the benefit of discussion tools; Figure 2 shows how it might have been if the participants had possessed shared concepts - shorter, more accurate, and less frustrating.

2. Cognitive Dimensions as discussion tools

The idea of 'dimensions'

The previous section illustrated the idea of discussion tools; the 'cognitive dimensions' framework, first introduced in Green (1989), is meant to provide discussion tools to help people who are not HCI experts in making quick but useful evaluations.

I believe that taken together, the cognitive dimensions describe enough aspects to give a fair idea of how users will get on with a system, and can help both designers and users think and talk about the system. Each dimension describes one aspect of a system, something that affects how users will manage. The framework contains about 12 dimensions, but in this document I shall not go into detail.

System = notation + environment

Something that is cognitively hard in one environment may be much easier in another. As an extreme example, writing a Pascal program over the phone is not recommended, even though Pascal has an easy syntax for writing on paper. The properties also depend on the tools available in a given environment; a word processor with no search-and-replace is a different system to one that has a powerful and easy-to-use tool.

No free lunches

You can fix any kind of difficulty, either by changing the notation or by changing the environment, but you usually pay for it with another kind. For example the search-and-replace tool fixes one form of 'viscosity' but it introduces a new and slightly higher level of abstract thought. Search tools that use regular expressions are an even better fix - but they introduce a very much higher level of abstract thought.

I shall illustrate some of these trade-off relationships below.

'Good' depends on context

The global spelling corrector might seem a good idea, but is it always? Sometimes you want to make sure the user looks at every single case separately.

In all the examples I shall give, you should remember that different circumstances might demand different designs.

Process, not just structure

A notation might be as good as you like, considered just as a static entity, but what needs to be evaluated is the whole process of using it: building or writing in it, debugging it, reading it, maintaining it over the years. Certain sorts of diagrammatic notations, such as diagrammatic query languages, are probably easier to understand than symbolic notations, but they are also harder to 'write' and harder to modify. These various aspects all need to be balanced out.

3. The Cognitive Dimensions

In its present form, the framework has 14 dimensions (Figure 3) although if I'm honest I think there are overlaps.

I obviously can't go through all of those here. On the other hand, to describe the trade-offs issue I have to at least give thumbnail sketches of three or four.

A viscous system resists change - you have to do a lot of work. For example, if you have produced a long and carefully formatted document and then someone tells you to change the style of all the level 2 headings, say a hundred of them, correcting each one individually is hard work ('repetition viscosity').

One solution is to create a 'style sheet' that defines a level 2 heading. By changing the definition, you can change all the headings.

In Green (1990) I distinguish repetition viscosity and 'knock-on' viscosity.

Change one thing, and who knows what might fall over? That's the hidden dependency problem. Spreadsheets are a fine example. So are some kinds of style sheets; if one style is defined in terms of another, changing the parent might give you an unpleasant surprise.

The key aspect is not the fact that A depends on B, but that the dependency is not made visible.

I like to think of this as the number of new high-level concepts that have to be learnt to make use of a system, such as 'style sheet' or 'regular expression'. Each new idea is a significant barrier to learning and acceptance. High-level ideas - that is, ideas that do not refer to easily-produced concrete instances - are very much harder still.

Let's continue with the word-processor example. If you decide to use styles, notice that you have to decide what styles you want and how they are related very early. Too soon, sometimes. Afterwards, you might wish that you had defined something like "inset block quotation [no space afterwards]" as a child of 'inset' rather than as a child of 'quotation' - but it's probably too late now; the viscosity of the system would make it too much work to redefine everything.

Premature commitment occurs in all sorts of places. Try drawing a map to guide someone to your house. What's the betting you start too close to one side of the paper, or start at the wrong scale, and the last few turnings are all scrunched up?

Or try working out a few formulae with a pocket calculator. How often do you get in a knot because you've started entering the formula in a way that makes the computation extra hard?

The Trade-Off Problem

In the examples I have given, I have repeatedly illustrated how fixing a problem in one dimension leads to a problem with another dimension. A sort of law of conservation of cussedness.

However, the designer can choose. He or she can fix the viscosity problem by increasing the abstraction level OR by changing to a different kind of notation. Using more abstractions is the commonest solution, but by no means the only one.

I like to compare the cussedness of information structures with the behaviour of ideal gases. Three quantities, temperature, pressure and volume, describe an ideal gas. If you want to increase the temperature, you can keep the pressure constant (but the volume must be allowed to increase) or you can keep the volume constant (but the pressure must be allowed to increase). Taken in pairs, these three dimensions are orthogonal. But you cannot raise the temperature while holding constant both the pressure and the volume. The parallels may not be exact, but they are intriguing.

Figure 4 illustrates some of the trade-off relationships that are frequently observed. We have seen how viscosity can be reduced by introducing more abstractions; but getting the abstractions right demands thinking ahead (i.e. there is a premature commitment problem). Viscosity increases the cost of premature commitment; if the abstractions themselves are viscous, then getting them wrong means you're in trouble. Furthermore, all too often abstractions introduce problems of hidden dependencies, because one abstraction is defined in terms of another.

Secondary notation and visibility were not discussed above.

 4. Using the Cognitive Dimensions approach, and how it compares with other approaches

My approach is very easy to use.

What you get out of this approach is a rough and sketchy evaluation. As we saw back in Figure 1, it will correspond to what users talk about. And if you were to consider changing the design, it will alert you to some of the possible trade-off consequences.

What it will not do is give precise time estimates. For that you should use GOMS (see above).

Nor does the question of users' knowledge get much attention in my framework. A much more thorough approach to users' knowledge has been developed by Lewis et al. (1991). In their 'cognitive walkthrough' methodology, the emphasis is on how the user knows what to do next.

For best results, I think all three methods could be employed, since they address different facets of the problem.

 5. Conclusion

In practice, the cognitive dimensions approach seems to have hit the right note for many people. It has been tried as a teaching tool and as a simple evaluation method, in both cases with success.

In the field of information retrieval not much has been done with it, but I think it would e a good way to make a preliminary evaluation of the usability of a system, rather than by going straight for expensive user testing.

6. References

Green, T. R. G. (1989) Cognitive dimensions of notations. In A. Sutcliffe and L. Macaulay (Eds.) People and Computers V. Cambridge University Press.

Green, T. R. G. (1990) The cognitive dimension of viscosity: a sticky problem for HCI. In D. Diaper, D. Gilmore, G. Cockton and B. Shackel (Eds.) Human-Computer Interaction -INTERACT '90. Elsevier.

Green, T. R. G. and Petre, M. (1996) Usability analysis of visual programming environments. J. Visual Languages and Computing, 7, 131-174.

Lewis, C., Rieman, J. and Bell, B. (1991) Problem-centered design for expressiveness and facility in a graphical programming system. Human-Computer Interaction, 6 (3-4), 319-355.

Olson, J. R. and Olson, G. M. (1990) The growth of cognitive modeling in human-computer interaction since GOMS. Human-Computer Interaction, 5, 221-265.

Top

Figure 1: An impoverished discussion

This example illustrates a discussion that would have been better if the discussants had shared appropriate concepts


Verbatim transcript from a newsgroup discussion (real words from real users). A's remark that starts the excerpt was the conclusion of an irritated message about how much work he or she had to do to jeep identical formats for all components of a large project (a 'book' on Framemaker jargon).

NB: this discussion referred to a version of Framemaker that is now obsolete.


A: ALL files in the book should be identical in everything except body pages. Master pages, paragraph formats, reference pages, should be the same.

B: Framemaker does provide this ... File -> Use Formats allows you to copy all or some formatting categories to all or some files in the book.

A: Grrrrrrrrr ........ Oh People Of Little Imagination !!!!!!

Sure I can do this ... manually, every time I change a reference page, master page, or paragraph format .....

What I was talking about was some mechanism that automatically detected when I had made such a change. ( ..... ) Or better yet, putting all of these pages in a central database for the entire book ......

C: There is an argument against basing one paragraph style on another, a method several systems use. A change in a parent style may cause unexpected problems among the children. I have had some unpleasant surprises of this sort in Microsoft Word.

Figure 2: An improved discussion

A: Framemaker is too viscous.

B: With respect to what task?

A: With respect to updating components of a book. It needs to have a higher abstraction level, such as a style tree.

C: Watch out for the hidden dependencies of a style tree.

(further possible comments)

The abstraction level will be difficult to master; getting the styles right may impose lookahead.


In this version of the discussion, a number of new terms have been introduced:

The terms are part of the framework of cognitive dimensions presented in this document.

Figure 3: The Full List of Cognitive Dimensions

dimension thumbnail description
Viscosity resistance to change
Hidden Dependencies important links between entities are not visible
Visibility and Juxtaposibility ability to view components easily
Imposed Lookahead Constraints on order of doing things
Secondary Notation extra information in means other than program syntax
Closeness of Mapping representation maps to domain
Progressive Evaluation ability to check while incomplete
Hard Mental Operations operations that tax working memory
Diffuseness/Terseness succinctness of language
Abstraction Gradient amount of abstraction required, amount possible
Role-expressiveness purpose of a component is readily inferred
Error-proneness syntax provokes slips
Perceptual mapping important meanings conveyed by position, size, colour etc
Consistency Similar semantics expressed in similar syntax

Figure 4: Some Trade-Offs among Cognitive Dimensions

Top


1.3 Working group reports

1.3.1 Overall task measurement and sub-task measurements - Steve Draper

We felt we would eventually like to achieve an overall, universal evaluation method for IR systems: that is what we would like to work towards.

The ultimate criterion for evaluating IR systems is: the satisfaction of the user's overall task in their work context (typically their information need will be a subtask of this overall task).

We may be able to develop standard evaluation measures for retrieval: but which of these measures are important and relevant to a particular case will depend upon the TYPE of the information need involved in that case. We need different measures for different information needs, not single standard measures for systems independent of user needs and tasks; and a corollary is that a system optimised for one set of measures is most unlikely to be optimal or even good for all user tasks and needs.

It may be useful to divide the overall IR task into subtasks, and to have metrics or standard measures for each subtask. A first approximation to such a division for a typical modern IR system might be like this:

  1. Understanding the information need [User]
  2. Expressing the information need to the machine e.g. formulating a query, or deciding on and performing other input actions [User]
  3. Retrieving a set of documents [Machine]
  4. Select a document from a list of surrogates [User]
  5. Decide the relevance of a document from inspecting it [User]

Note that understanding the information need better is an important output, not input, of most retrieval sessions. As the user interacts with the machine and performs successive retrievals, so their knowledge of their information need improves, and this is often an important and useful product of the activity.

Traditional experiments on recall and precision apply to only 1 of these 5 subtasks: the one the machine does. The other subtasks are done by the user, but note that features of the user interface strongly affect how easily and well the user can do these tasks e.g. the information displayed in the surrogate determines performance of (4), highlighting and scrolling to matched terms has a strong effect on (5).

It would be possible and sensible to do evaluation experiments on each subtask separately, and to devise performance measures for each separately.

In addition overall performance of the combined user-machine system is a distinct issue. It should be measured in itself i.e. the net result of complete retrieval session. Measures for this overall activity could be developed and standardised e.g. precision, recall, satisfaction of the information need, costs to the user.

Overall performance is affected not only by performance on each subtask, but also by the connection and co-ordination of these subtasks. This again is itself influenced by the user interface e.g. if only the first 10 surrogates are displayed, then the selection (precision and recall) of the machine for its first 10 items matters, but not for more items; and probably the order within the 10 doesn't matter very much. In other words, a feature of the user interface may completely change what measures for subtasks are important.

In summary, we are concerned with three levels:

Any measures used at lower levels are relevant only to the extent that they contribute to success and economy at achieving the task at the top level.

Top1.3.2 User, task and domain - David Harper

Initially, we considered what was meant by "Combining HCI and IR Evaluation". Two (of the possibly many) interpretations we thought about were: applying HCI evaluation techniques to the problem(s) of IR evaluation, and bridging the gap between the traditional IR component evaluation and HCI evaluation, where the latter involves task, domain, and possibly user modelling and analysis. We recognised that the IR community is already combining HCI and IR evaluation in the above senses, although there is much to learn from the HCI community (and this was amply demonstrated throughout the workshop).

There was some discussion on the position paper presented by Thomas Green, and specifically on the "Cognitive Dimensions Framework" (CDF). Some felt that CDF focused too much on the user interface, and in particular on the representation and presentation of information, and that the framework did not say enough about users, tasks, or domains. (My understanding is that CDF is independent of task, user or domain, in the sense that you choose the sets of tasks, users, and domains for which you are designing a system - note, not simply a user interface - and that the CDF can be used to explore the design space for such a system. Naturally, this observation was not made during the working session, but afterwards over a coffee!) We also wondered whether the CDF was based on some description or model of users' cognitive abilities. (I understand that it is, and that details can be found in papers on CDF.)

At this point, the focus shifted to looking at a recent large-scale investigation by Nick Belkin (and student) on the effectiveness, usability, and utility of relevance feedback in an IR system. Various important issues emerged during the description of, and discussions on, this investigation.

  1. The choice of measurable performance variables depends on the objectives or purposes of the evaluation. Different evaluation criteria will require different measures of performance, and indeed different data for computing the measures. Thus, retrieval effectiveness might be measured by Precision-Recall, which in turn requires that assessments of relevance be made. Usability might be variously measured by time taken for a search, number of actions performed, etc. User satisfaction might be best measured by questionnaires, for example. The message seems to be choose the most appropriate measure(s) and collect data accordingly.
  2. The importance and difficulty of controlling experimental variables was highlighted. Some techniques for reducing variability were: task setting to establish a context for an experiment (and thereby reduce variability due to task variation and possibly the domain variation); and appropriate training of test subjects in using the system under test (to reduce variability due to different users). It was acknowledged that controlled experiments were difficult to design and perform well for interactive systems involving as they do actual users.
  3. Pilot experiments should be run to debug the experiment.
  4. The importance of discussion with the test subjects (users) in experiments was highlighted, even at the expense of removing some of the experimental control. An earlier speaker had emphasised how much can be learnt from observing and questioning users on their use of a system.

Perhaps the most difficult problem in the evaluation of interactive IR systems is how to generalise the findings of any particular experiment. Namely, given that an experiment is conducted in the context of some task(s), some domain(s), and some sets of users, how can the results be generalised, or at least applied to, different tasks, domains, and user groups. In this, it might be desirable to be able to characterise (or classify) tasks, domains, and users, according to some agreed framework, so that at least we might be able to compare the results of experiments. (Aside: The TREC experiments can be seen as an attempt to making experiments comparable by carefully defining the task(s), domain(s) and intended users.)

If we could find a way of classifying experiments, then we could develop a catalogue of experimental results which detailed, for each experimental investigation: task description, domain description, users, experiment(s) performed, findings of experiments, general findings (if possible), caveats, other related experiments, and supporting papers/publications.

We were reminded that real estate agents (realtors in the USA, I think), say that the three most important things about a property are: position, position and position. In the context of IR experiments, the three most important things may be: user, task and domain.

Top

1.3.3 Three issues for any evaluation - David Hendry

We discussed a variety of recurrent issues on the evaluation of information systems. When carrying out an evaluation a stance is necessarily taken on certain important issues, if in some cases only by accident. Thus, pinning down these issues and being explicit about them during the planning stages of an evaluation should be beneficial.

Three issues that seem particularly important are:

In sum, the group decided that the goal of an evaluation should be to evaluate the functional use of the information system with the aim of collecting data to improve its functionality. There is no silver bullet for deciding how best to do this.

Top

1.3.4 Issues in hypermedia system evaluation - Kalervo Jarvelin

The working group discussed the topic first generally and then on the basis of the given scenario concerning evaluating a hypertext-IR version of the British Highway Code. The aim was to develop an evaluation plan for the scenario, but the discussion result was rather a set of issues which need to be clarified in order to perform evaluation.

The group felt that evaluation must be built into the design of the system right at the start of the process.

Cost-benefit analysis process outline

The following four phase process may be found in standard text books on CBA

Phase 1: Analysis of the evaluation situation - problem definition - goal definition and setting the scope of the analysis - recognition (creation) of the alternatives to be evaluated - recognition of any limiting factors (e.g., cost) that must be met (e.g., not be exceeded) - identification of benefits, costs and other weaknesses to be taken into account in the evaluation

Phase 2: Clarification of evaluation methods - classifying benefits, costs and other weaknesses and identifying their place and time of occurrence - choosing scales of measurement for benefits, costs and other weaknesses - valuing benefits, costs and other weaknesses -- deciding on how to combine measurements in different dimensions - setting basis of prediction (of long term activities) -- e.g., how is the user population estimated to grow over the years?

Phase 3: Performing the analysis - data collection, e.g., in the form of test tasks - calculations - presentation of the results

Phase 4: Interpretation and decision-making This should take the whole process in to account: were the alternatives and criteria identified correctly, was the combination of measurements done correctly, etc.

Top


University of Glasgow- - - - - -Universita degli Studi di Padova

drawing of Villa Duodo, Monselice

Proceedings of the Second Mira Workshop

2: The TREC-5 Interactive Track Experience


Co-chaired by, and presentations from Nick Belkin and Micheline Beaulieau.

Working Group Reports by Maristella Agosti, David Harper, David Hendry, and Jane Reid, and Eero Sormunen

Top


2.1 Interactive evaluation at TREC - Micheline Beaulieu

The session provided an overview of the problems encountered in trying to conduct genuine interactive searching evaluation experiments in the context of TREC. The conflicts in the basic methods relate to the nature of the topics, the relevance judgements as well as the definition of the search task itself.

Firstly lengthy topic specifications were found to favour automation methods for query construction. In the five rounds of TREC the topics have evolved from rich discursive queries to brief questions.

Secondly user relevance judgements not only could differ from those of the assessors but could also be made at different levels, e.g. the document may be pertinent to the topic but not meet all the specified criteria of the request. Moreover in the case of relevance feedback, a document could be judged as a good source of terms but fall outside the topic criteria.

Thirdly the task specification for interactive searching differed in each of the rounds culminating in a separate interactive track in TREC-4. The ad hoc task was to find as many relevant documents as possible without too much rubbish in a limited time. Hence the basic laboratory experimental design was gradually modified to develop specialised methods for interactive searching within the TREC framework.

For TREC-5 attempts to make the interactive task compatible with the main TREC tasks were abandoned. Instead the task was defined as a browsing task, where a user would aim to retrieve items which covered as many different aspects of a topic as possible within a time limit. The total search output was assessed for "aspectual recall" instead of relevance at individual document level. The outcome of this approach and the experimental design has yet to be assessed.

Mira participants should now consider what contribution could be made to addressing the interactive evaluation issues.

Top


2.2 Working group reports

2.2.1 The positive and negative effects of TREC - Maristella Agosti

The group discussed different aspects related to the TREC initiative.

The group has decided to concentrate its attention first of all on an important and crucial distinction between

where the first one is related to the interest and focus on comparing systems, and the second one is useful for designers that intend to develop new systems.

The group was aware of the fact that sometime this distinction is not made clear, and what is expected from an evaluation effort is different depending on the interpretation. (The distinction is discussed, in the context of educational software evaluation, in http://www.psy.gla.ac.uk/~steve/Eval.HE.html ). After the group has made this distinction clear, the positive and the negative side effects of an effort such as TREC have been considered.

The positive effects include:

The negative effects include:

The conclusions of the group were:

So the group would be to see the MIRA working group to address and decide on the possibility of being like a spin off for a European TREC. Why not to plan something in the two and half years of life of the MIRA working group?

Top

2.2.2 TREC: A Way Forward for Mira - David Harper

Following the presentation on TREC, we identified what we thought were the main benefits of participation in TREC, namely:

It was observed that, the main TREC activity is set up as a "competition" between the various participants, and each participant has a different research agenda, in general. Consequently, knowledge about IR tends to be developed in a "bottom-up" fashion as a result of the diversity of TREC experiments. We asked what might be achieved if TREC was set up as a non-competitive activity. It happens that the TREC mini-tracks are intended to foster this kind of collaborative activity, and the interactive track was held up as an example. Nevertheless, the element of competition remains. Could a TREC experiment be designed where the some of the experimental variables were rigorously controlled in order to explore a particular aspect of the IR problem domain? For example, suppose the search engine was identical for a number of sites, and only the interface to the engine was changed. What could we learn about IR interface design from such an experiment?

We then considered what the Mira Working Group could do in respect of TREC. Various ideas emerged from our discussions:

We noted that TREC has at least another three years to run, and most of us felt that we would like to become involved in TREC. In some cases, lack of resources, both human and computer, were a barrier to involvement in TREC. We were reminded that there are a number of ways of being involved: A-type full participation, B-type participation using smaller datasets, and the various tracks. Perhaps, we in Mira, together with the CEC, should explore establishing a European-based TREC resource to enable greater participation by European research groups in TREC.

Top

2.2.3 The dynamic nature of information seeking - David Hendry

We discussed problems on the dynamic nature of information seeking. An information need is dynamic. Assessments about document relevance are dynamic. A searcher's knowledge of an information system and effective tactics for using it is dynamic.

To better understand these elements of information-seeking, we decided that it is important to consider the whole task. So, for example, finding information in the British Highway Code [the given scenario for this session] is only one, perhaps relatively small, task to which the Code might be used. To explore this, we considered how the Code might be presented so that learners better acquire the knowledge they need for passing their driving test. This change in perspective---from that of searching to that of skill acquisition--- opened up a range of possibilities and problems for how IR might apply to the Code.

One major difficulty in experimental evaluation is that changing the interface to a search engine is a gross change---many variables change at once. Thus, if differences are observed between the old and new interface it can be very difficult to know why.

Nevertheless, the group concluded that considering the embedded nature of information-seeking, with all its complexity, seems to be the most fruitful direction for discovering how to improve the effectiveness of information-seeking environments. Thus, making gross comparisons may teach us rather a lot, though perhaps not the exact reasons for any differences that emerge.

Top

2.2.4 MIRA: the other side of TREC? - Jane Reid

The TREC experience has been very fruitful in some respects. For example, by recognising that induced information problems are necessary for repeatability of experiments, a very useful collection of such resources has been built up. However, TREC has two main shortcomings: firstly, it has focused on the retrieval system as the central concept in IR, and, secondly, it has proved to be primarily a competitive, rather than a collaborative, exercise.

It is time now to redress this balance. A parallel research community should be set up (led by MIRA participants?), focusing on interaction issues and user behaviour. It should make use of the TREC data, perhaps using small, limited subsets for different experiments. The long-term, co-operative aim of this community should be to draw general (i.e. cross-system) conclusions about IR as part of the overall process of information seeking. This requires agreement in advance on the goals, motivations and priorities of such work.

Top

2.2.5 An approach of small scale experiments proposed for the TREC-6 - Eero Sormunen

The group concentrated on two topics:

One of the problems of the TREC is that it is strongly oriented to text and traditional database applications. It does not seem to meet well the challenges of multimedia and open network environments (like Internet and especially WWW). Some critical comments were also expressed on TREC's limited contribution to IR system design work. One of the problems is that the test topics are quite atypical. On the positive side, some new models and ideas have been introduced for the evaluation of interactive IR systems.

The group suggests for the TREC-6 that the theoretical and empirical issues of scalability (e.g. sampling and estimation) should be taken into more careful consideration. The focus of evaluations should also be changed. One possibility is to design experiments for fixed search engines because major advances are not likely to take place on the engine side. An example of experiment would be to focus on interface issues (e.g. visualisation of document spaces, query input technologies). The test collections should also be extended to cover hypertext documents (e.g. from WWW).

The general approach of TREC to run identical or closely similar experiments at all partner sites was seen a restricting and mechanical model for co-operation. A suggestion was made that the TREC-6 should adopt a common goal under which a family of small scale experiments could be designed. According to this model, each small scale experiment could be conducted by a group of (say from three to six) research groups. As a result, a larger set of hypotheses could be tested without loosing the possibility to compare the results from parallel experiments.

Top


University of Glasgow- - - - - -Universita degli Studi di Padova

drawing of Villa Duodo, Monselice

Proceedings of the Second Mira Workshop

3: Evaluation of Hypertext/Hypermedia Information Retrieval


Co-chaired by Maristella Agosti and Mark Dunlop.

Short presentations by Steve Draper, Jonathan Furner and Massimo Melucci.

Full presentation by Annelise Pejtersen.

Working group reports by
Nick Belkin, Pia Borlund, Marion Crehange, Kai Grossjohann and David Hendry.

Top


3.1 Presentation Reports

3.1.1 The evaluation of hypermedia IR systems: a statement of the problems - Jonathan Furner

Basic assumptions

In this discussion paper, I attempt to provide a brief statement of the problems involved in evaluating hypermedia systems. Before going on to consider IR systems in general, and hypermedia IR systems in particular, I examine a few of the basic assumptions that are commonly made in this context, and a few of the issues that are raised if those assumptions are accepted.

Evaluating hypermedia systems

1 Evaluation is for systems

The first of these assumptions is that evaluation is for systems, or in other words that it is systems that we should be evaluating.

One conventional model of a computer-based system identifies three principal elements:

Such a model makes explicit the principal function of the system, which is to assist or support the human user in their performance of some task or series of tasks. We could define a task, perhaps, as any activity or action that involves the manipulation of certain concepts or objects and that a person performs in order to attain a particular desirable goal or purpose -- that goal typically being to bring about a change in some state of affairs, problem space or domain, a change that has a specific result. The kind of support or assistance that a system provides to a user consists in enabling the user to perform certain of their tasks with greater ease, efficiency and effectiveness, and thus in facilitating the successful attainment of their goals.

One obvious result of accepting the first assumption is that, whatever type of system we are considering, we should not limit our evaluation to that of the operation of the mechanism. But equally, nor should we limit evaluation to that of the operation of the interface -- even though the importance of studying the interaction between user and computer has only been recognised relatively recently. Evaluation of the operation, the activity, of all three elements of the system will be important components of the evaluation of the whole.

2 Systems are for evaluation

The second assumption is that systems are for evaluation, or in other words that the research activity that we should be engaged in is one of evaluation.

Users might profitably undertake evaluation, so that they can compare one system with another, so that they can find out which is more appropriate for their needs. From the point of view of the system developer, the reason for conducting evaluation is to obtain some indication of the kind of changes that might have to be made to the system if its value or worth is to improve.

Evaluation, then, is essentially the activity of determining how well something does what it is supposed to do, or in other words measuring the quality of its performance of its function. So in the context of computer-based systems, whose function is to support their users' performance of certain goal-oriented tasks, evaluation is concerned with determining the level of success with which systems enable their users to achieve their goals. This is a significant point, and one to which I shall return.

3 Each system its evaluation

The third assumption might be summed up in the phrase `each system its evaluation'. This rather glib statement has two separate implications.

The first of these is that there exist systems of different types, supportive of different tasks, and intended to facilitate the attainment of different goals.

One fundamental characteristic that varies from system to system is degree of interactivity: in other words, the continuity of the control that the user is allowed to exercise over decision-making in the course of the interaction process. For instance, there are a number of ways, differing in level of interactivity, in which a user may select an option from the range of options that are available to them at any time and then communicate this selection to a computer. Allowing the user to directly manipulate visual symbols on-screen, to modify actions or strategies in the light of the ongoing responses of the computer, to enjoy constant access to user-selectable options enabling parameters to be altered and forms of display to be modified -- these are all marks of a highly interactive system.

The second implication of this statement is that there exist different methods of evaluation, and that certain of these are considered to be appropriate for application to systems of different types. At least, this has historically been the case, because the particular sorts of activity carried out in the name of evaluation have generally proceeded in accordance with particular conceptions of the intended function of individual systems and the goals of their users. A statement of the function of a system, couched in terms of the way in which it is intended to assist its users in performing certain tasks and therefore in attaining certain goals, is often used as the principal criterion on which evaluation is deemed to be based.

Once a decision has been made as to the criterion or criteria on whose basis evaluation is to proceed, decisions of three further kinds need to be taken, which follow to a greater or lesser extent from the initial selection of criterion. The first of these is the identification of appropriate measures to be used in the evaluation process; the second is the design of the data-collection method to be used in the observation or computation of the values of these measures; and the third is the decision as to whether data should be collected in a real-life, operational setting, or in the context of an experiment in which certain variables may be subject to some degree of control.

Measures

An important subclass of activity involved in evaluation is that of measurement. Measurement involves establishing, for the purposes of comparison between objects, the position, on a quantitative scale, of the value of some directly observable attribute of the object under consideration.

Unfortunately, however, it is often considered that the level of success with which a system enables its users to achieve their goals is not an attribute whose values may be directly observed. And the need, therefore, is to determine what other attributes there are whose values (a) are directly observable and (b) are sufficiently strongly correlated as to be used as indicators or measures of the level of a system's success.

Data collection

The nature of the measures that are selected will obviously have a bearing on the nature of the method used to collect the raw data that is to be used in the calculation of the values of the measures. There are, of course, many different methods of data collection, each geared to providing data either of an objective or of a subjective nature and that lends itself either to quantitative or qualitative analysis.

Setting

Any combination of these methods may be implemented either in an operational setting, or in the context of a laboratory experiment in which certain variables may be controlled, and in which respects the complexity of real-world experience is not reflected.

Conclusions

So, we can point to a few general conclusions that may be drawn if we accept these basic assumptions. These are: that whatever type of system we're concerned with, before we can begin to evaluate it, we need to:

The evaluation of IR systems

Function

Our research community is concerned with the design and evaluation of IR systems. So what is the function of IR systems, what is it that they are supposed to do? IR systems are designed and implemented in order to support the user in the performance of tasks that involve information-seeking activity. In engaging in such activity, the goal of the user is to bring about a change in a particular state of affairs, or situation, that is perceived by the user to be problematic, with the specific result that this problematic situation is resolved. It is implicitly assumed by the designer and the user of any IR system that at least part of the user's problem consists in their anomalous state of knowledge (ASK) in a particular respect or domain, and that such a state may be resolved through the acquisition of certain information. The user thus has a perceived need for information, and undertakes the task of identifying and retrieving information with the goal of satisfying that need, and resolving their anomalous state of knowledge.

But simply carrying out the task of identifying those documents that satisfy such a need for information: that's only part of the wider problem faced by the person in an ASK. Other aspects of the problem actually result from the user's decision to use assistance in their performance of this task. In order to call on the support of a computerised IR system, for example, the user needs to be able to figure out how to express or articulate their problem so that it may be communicated to the retrieval mechanism -- which may itself be difficult, especially if the user is unfamiliar with the terminology of the domain of their information need. And even before the user is in a position to specify their information need clearly, they may well need to learn more about the characteristics of the original problematic situation in which they find themselves, or indeed to come to the initial perception that such a problem exists. These cognitive tasks are also ones that an IR system might be expected to support, and it is the success with which the system allows the user to achieve its goals in the performance of multiple tasks that should be the criterion on whose basis the performance of the system should be judged or evaluated.

Criteria for evaluation

So, as evaluators, we need to consider:

but all of these criteria together, and more generally

Yet, the traditional model of the IR system lingers on, with

Two further assumptions here are

As a result, it is commonly determined that the system may be evaluated in terms of the relevance of the responses of the retrieval mechanism to single, static and specific queries -- i.e., in terms of the similarity between the retrieval set and an ideal set.

In real-world systems, however,

Other assumptions are:

The result is that a call is regularly made for such systems to be evaluated in terms of criteria other than the relevance of the documents in any individual retrieval set.

Measures

Discussion of the criteria that may be used in the evaluation of IR systems brings us on to the separate issue of which measures may be used as quantitative indicators of such criteria.

The standard measures of retrieval effectiveness are those of recall and precision, which are usually characterised as being system-centred; in terms of the terminology adopted here, they're mechanism-centred, in that they are used simply to quantify the degree of similarity between an ideal set of actually-relevant documents and a set of potentially-relevant documents retrieved through the operation of the retrieval mechanism. These measures were introduced at a time when the traditional model of IR was a more truthful representation of reality, and when systems were, for instance, far less supportive of a high level of interaction between user and mechanism. For a long time, however, there has been a need for complementary measures that take into account the changes in the design of IR systems that have taken place since.

Measures of search efficiency are also well established, and typically relate (i) some conception of the utility, usefulness, worth or value of a search, to (ii) the time, cost or effort expended, or to the length of the search measured in terms of the number of commands issued by the user or the number of documents viewed by the user.

Measures of the usability of a user interface, or the ease and appropriateness with which a user may perform the tasks, communicate the requirements, display the responses that they wish to, are again well-known, and are commonly based on the speed with which a user performs a specific, often quite low-level task, or alternatively on the speed with which they learn or come to understand how to perform such a task.

Measures of user satisfaction seem, however, to be less widely used in the IR community, although they have a distinguished history in, amongst other fields, library science. Users may be asked to give their own individual, subjective views on their satisfaction with various specific aspects of the information-seeking process, including the completeness or the precision of searches, or to give a judgement as to the overall success of the operation of the system.

One interesting point to be made about measures of user satisfaction is that their values are often, perplexingly, not related to the values of other measures. Several studies exist, for example, of end-users' searches in CD-ROM databases, which show users often to be highly satisfied with the results of their searches, even though the effectiveness of those same searches as measured in terms of recall and precision is low. Furthermore, it might not immediately be apparent to the designer what the causes of such paradoxes are, and for this reason it is especially important that measurement of user satisfaction should be supported by the elicitation of qualitative data using questionnaire-based methods.

The implication is that, to obtain a full, user-centred picture of the value of a system, it is necessary

In turn, these guidelines have obvious implications for the selection of the methods to be used in the collection of data enabling the values of particular measures to be observed or computed.

The evaluation of hypermedia IR systems

Hypermedia IR systems are IR systems of a particular type, and pose particular problems for evaluators.

Structure

The traditional model of a hypermedia IR system, just like that of a conventional IR system, includes, amongst other elements:

In terms of structural elements, the basic difference between a conventional IR system and a hypermedia IR system lies in the explicitness with which the relationships between documents are represented.

In a conventional IR system, records are stored, and considered by the retrieval mechanism, independently of one another. This might be viewed as an inappropriate simplification of reality, since one document may be related to another in any of a number of different ways. But these relationships between documents are represented only implicitly, through the use of similar sets of terms in the indexing of documents. Such relationships could be identified and acted upon by the user only if a facility were made available to calculate values of document-document similarity on the basis of the co-occurrence of index terms in each pair of records.

In a hypermedia IR system, some of the relationships that exist between documents are explicitly represented and stored in the form of a network of links, which are simply ordered-pairs of origin and target nodes, with the intention that, in the course of their information-seeking activity, users may exploit the additional information captured in this explicit representation of document-document relationships in their information-seeking.

Function

In fact, it is through the implementation of links that a hypermedia system supports a particular kind of information-seeking activity or behaviour known as browsing. Instead of requiring the user to specify a query, which is then to be matched against every record in the database, the system allows the user to request specific, single nodes to be retrieved and displayed successively, by activating the link between a currently-displayed node and some target node. The level of interactivity of the retrieval process is thus particularly high.

Different implementations of hypertext systems vary greatly in the types of link that they support, and thus in their function. Links vary in several dimensions.

Current research issues

A few of the issues that are currently the concern of developers of hypermedia IR systems are:

The evaluation problem

In hypermedia IR systems, it is clear that the user is the primary actor, exercising a high degree of control over successive stages of a information-seeking process, and engaging in high-level interaction with the retrieval mechanism on the basis of information needs that the system does not require to be expressed in the form of clearly-defined queries. One conclusion that may be drawn, then, is that such systems should be evaluated in terms of mechanism-based criteria such as retrieval effectiveness.

Some of the problems faced by prospective evaluators of hypermedia IR systems, then, are as follows:

The urgency of the situation is fuelled by two immediate needs, for the design and evaluation of systems:

Top

3.1.2 Experiences from the Irides project - Massimo Melucci

Evaluation of the prototype for the wide area dissemination of the on line version of "The Computer Journal" (http://www.dcs.gla.ac.uk/~mark3/Irides/prototype)

Irides concerned with the design and implementation of a prototype offering the user sophisticated searching and browsing facilities. Through these facilities the user can retrieve relevant articles of "The Computer Journal" by combining query-based retrieval and hypertext browsing.

Evaluation has been the final work package of the Irides project. The evaluation step has been implemented by the GMD-IPSI, Darmstadt. The main goal of evaluation has been described by the project proposal as follows: "We will evaluate the goodness of the multimedia authoring of the hyper-structure and of the technique for metering the access for charging, preserving copyrights, and articles delivery". This brief communication aims to give some highlights of the evaluation of automatically constructed hyper-text employed to browse the articles.

Tests have been carried out with small of number of users due to the time constraints. The test users were six computer scientists, and were asked to interact with the system and concentrate on the design of the hypertext. After the users were given, and solved some retrieval tasks, the following results have been obtained:

- the browsing tool implemented by the ACM classification scheme anticipates the users' needs

- the network of index terms can be complementary used to make the retrieval more precise

- the network of similar articles can be employed to make recall higher

- the interface is still a crucial part of such a system: users need more information on the hypertext to make both retrieval more effective and risk of disorientation lower

Top


3.2 EMPIRICAL WORK PLACE EVALUATION OF COMPLEX SYSTEMS[*]

Annelise Mark Pejtersen

Risø National Laboratory, Postbox 49, Denmark

Tel: 45-46775149 Fax: 45-46755170 Email:amp@risoe.dk

Keywords: Cognitive framework; empirical evaluation experiments; library system

Abstract: For a comprehensive experimental evaluation of a computer system, it is necessary to define a suitable evaluation sequence consisting of a set of boundary conditions to be able to evaluate more encompassing features of the system. This paper suggests a framework for identification of boundary conditions to be used in empirical evaluation experiments. The framework is briefly described and empirical evaluation at the boundary of the actual work environment is demonstrated by evaluation of a library system.

Introduction

A systematic evaluation of complex systems should be well structured and performed at several well defined levels of user-work place interaction. At each of these levels, evaluation should be performed either analytically or empirically, or both approaches should be applied. Issues related to the contents of the information and the functionality of the system can be evaluated analytically, while issues related to its form involve context, user experience and preferences, and therefore, very likely will need an empirical approach. An analytical evaluation depends on a structured comparison of work requirements as defined by a work analysis with the design specifications. In contrast, empirical evaluation involves experimental tests of the performance of a system with reference to design objectives, or with reference to its actual performance in a laboratory with test users, or its actual performance in the ultimate context of use in a real work place.

Framework for laboratory and field evaluation

Empirical Evaluation Approaches. For empirical evaluation experiments, it is necessary to establish an experimental work situation that creates a well defined boundary around the subject, and to study whether subjects' responses to this boundary leads to the mode of behaviour which was assumed as the design basis. For proper integration of the results, such experimental evaluation scenarios should be compatible with the structure of the work analysis underlying design as well as with the design specifications. To structure an empirical evaluation of the match between a new design and the work domain including the characteristics of its users, we need a comprehensive framework for description of work systems, see figure 1.

Figure 1. The figure demonstrates how different evaluation questions can and should be asked at the various levels described in the framework for work analysis. In addition, it is shown that a different ordering of the evaluation questions should be considered for an analytical and for an empirical approach to evaluation.

Both analytical and empirical evaluation should be considered for all the levels of figure 1, but the sequence of the levels considered will be different for an analytical and an empirical approach. For analytical evaluation of design objectives, a natural approach will be top-down from global system properties to detailed task functions. For empirical evaluation, a path from details to global features will be the best approach. Complex experiments involving e.g., entire task situations will be meaningless, if the system does not match user characteristics at the elementary level. For example if experiments are planned to evaluate the functionality of a prototype in advance of a test of the interface readability. The various boundaries of evaluative experiments are summarised below with reference to figure 1. These boundaries "move" the context successively further from the actor to encompass more and more of the total work content in some kind of increasingly complete simulation and field evaluation.

1. Match with Users' Resources and Characteristics. This level addresses the sensory - motor characteristics as well as perceptual and cognitive resources. Evaluation of perceptual and cognitive resources will focus on the size of letters, the readability of the typography and the graphics of the displays, display composition, consistency, coherence, use of colours, icons, WYSIWYG interfaces etc.. This level also addresses evaluation principles that are important for the understandability of the information flow in the communication between the system and the user.

2. Support of Task Strategies and Mental Models. At this level, the following questions are addressed: Does the system support several task strategies and can the user shift goals and tasks concurrently without loosing support from the system? Does the system provide the mental representations of novices and experts, and is the user's mental model of the work domain supported by the interface - also during distributed decision making?

3. Support of Cognitive Decisions and Processes. A basic question to be asked at this level is: Does the system effectively support the cognitive decisions that have to be made during task performance ? Does the system support the actor's decision making- are exploration, situation analysis, goal evaluation and planning supported for familiar as well as less familiar situations?

4. Support of Relevant Task Situations. The question is here whether the system supports the entire task repertoire - are the tools adequate, their functionality sufficient and does the information cover the complete work task space? Is its capacity adequate? Experiments may serve to evaluate whether information is available about the basic concepts of the system and its overall architecture. Is it possible to navigate among tasks, and to pursue several, different task related goals?

5. Adequate Representation of Work Environment. Evaluation experiments here will investigate the relationship between the actual use of the system and users' intellectual and emotional style and their personal problem solving habits in the total work place context. For this boundary, the evaluation must be based on actual work scenarios generated from an actual work analysis, and the aim is not a task simulation but a work place simulation and must include not only the system's effect on a complex work place situation.

6. Field Evaluation in Actual Work Environment. Evaluation in the actual work context will address the question: Does the system match organisational policies and employee's acceptance and development? How is its impact on the work context and the quality of the work situation? Does the system support several coherent work task activities and the co-operative co-ordination of activities among several users, maybe in different departments of the organisation, and does it support interaction and co-ordination with institutions outside the organisation? In other words, will it answer the question whether the design approach and the assumed work organisation does match the performance criteria and preferences of the users. Will the system be used? And do the users like to use the system, and do they actually use it over a longer time span in the daily task situations, for which it was designed and to the degree as was expected?

Example of field evaluation in actual work environment

Evaluation in the actual work environment at boundary 6 is illustrated by a selection of evaluation tests from the evaluation of a full scale library system. The system design was based on a work analysis and then tested in laboratory experiments and evaluated at the work place within the framework boundaries of figure 1. Extensive experimental validations in the laboratory were performed to ascertain that the system could meet the work requirements before the evaluation of the system took place in the actual work context in a library. The subsequent evaluation of its use by the general public was thus an attempt to validate whether or not the system actually is the right design for supporting actual library users. The test took place over six months in a public library in order to evaluate whether the information system: 1) could be accessed and was accepted by the general public and professional librarians; 2) could provide the books asked for to the users' satisfaction; 3) would impact the library work in a way that was satisfying to the public and the professionals, and cost/effective to the organisation.

Users' resources and value criteria. One of the tests conducted to pursue the first goal was an evaluation of the iconic interface at boundary 1 and 2. The efficiency of use, the comprehensibility of icons and the subjective user satisfaction was evaluated at the work place in a full scale prototype system by 1030 users, who responded to on-line questionnaires which appeared automatically on the screen, after the user ended his/her session with the system. The questionnaire adapted to the individual user's navigation trajectory and displayed those icons, which the user had met at the interface and actually employed during a search. It contained questions about the understandability of icons, which were used both as action buttons, and as a means to express the topics contained in books. Fifteen different icons used as action buttons were displayed together with a textual list of action possibilities, and users were then asked to select the action that would match the icon. Evaluation of the associative relationship between the message of the icons and the contents of the books in the database was measured on a scale that expressed the users' perception of degree of match. Finally, users' subjective satisfaction with an icon based interface was evaluated relative to a similar text based interface. The result of the quantitative test at the work place was then tested qualitatively by 75 observations and interviews with library users after they had used the system.

Task strategies. Field studies of task strategies before the design showed that several different strategies were employed such as analytical search by attributes, search by analogy and similarities with previous examples, browsing strategies etc.. At the work place, 7100 on-line logging of all dialogue events (mouse clicks, etc.) tracked the users' strategy choice, and 220 questionnaires gave answers to their reason for choice of strategy, its ease of use, their strategy preference etc.. During use of the system over a longer time span, the analytical strategy became the most popular strategy: users and librarians adapted their strategy choice to the most effective strategy in the new environment. Field studies before the system was introduced showed that the analytical strategies were very rarely used in a library due to its high demands on knowledge, time, and memory resources etc..

Decision task. The second goal was pursued by an evaluation at boundary 3 of users' subjective satisfaction with the books they had retrieved from the database by use of the classification scheme. The classification scheme used had been developed from extensive field studies, and now the support of this classification scheme, its keywords and book descriptions employed for retrieval of relevant books was evaluated from structured questionnaires by 120 end users based on their reading of books. The most important performance measure was the precision of retrieved books based on users' comparison of the database classification of book contents with their own estimation of the book content and its relevance in a use situation.

Work space. Another type of experiments were used to pursue the third goal and aimed at an evaluation of the impact of a new retrieval system on user behaviour and preferences, on the means and ends required and the impact on the total work situation. Professional intermediaries working with a new computer system in information retrieval and cultural mediation tasks reported in questionnaires and focus group interviews at boundary 4 how the system changed their roles and left more resources for co-operation and a thorough dialogue with the users. The system supported their cultural mediation strategies, and allowed a shift to the role as a consultant analysing task problems, evaluating the quality of alternative proposals, and assist in choice of solutions. Secondly, they reported how important it was for the professional image and pleasure of use that errors did not occur as the system supported exploration of alternatives, and no error messages occurred.

Organisational work context. The possible positive or negative impact on quality of work and the system's potential deterioration of professional skills during changes in role allocation among users and librarians was evaluated at boundary 5 and 6. Whether the new system would lead to a simplistic interpretation of users' needs, an impoverishment of their reading experiences and, as well, an impoverishment of the librarian's domain knowledge. A computer logging of librarians' and users' use of the system was implemented and combined with focus group interviews with the staff and user groups. Both types of data were compared with records of librarians' and users' book descriptions from earlier field studies, to make sure that the database information exceeded in number and breath their book knowledge. This was done to make sure that both users and librarians through the use of the system would increase their competence and knowledge about the document collection.

The impact on cost/effectiveness was measured by the increase in number and distribution in loan of high quality books, as the ultimate institutional goal for public libraries is to promote education and cultural values. A more even distribution of book loans means more effective use of the book stock, which has economic implications for a library's costs for book acquisition.

Conclusion

The framework suggested here has been developed to be used concurrently for analysis of user-work interaction in system design and evaluation both in the laboratory and in field studies of real work environments (Pejtersen and Rasmussen 1997, Rasmussen, Pejtersen and Goodstein, 1994, Pejtersen, 1994, 1993, 1992, 1991, Goodstein and Pejtersen, 1989). Empirical evaluation pose special problems with respect to validity and generalization of results. Very often, evaluation questions are zapping among several levels of analysis with very different requirements for control of the boundary conditions. This makes it very difficult to generalise the evaluation results to other, similar work domains and support systems. To enable generalization and the transfer of findings among actual work analyses and different experimental designs, a consistent framework is necessary.

References

PEJTERSEN, A. M. and RASMUSSEN, J. (1997): Effectivenes testing of complex systems. In: Handbook of Human factors and Ergonomics. Ed. by G. Salvendy, Wiley. In Press.

PEJTERSEN, A. M. (1996): Empirical work place evaluation of complex systems. In: Advances in Applied Ergonomics. Proceedings of the 1st International Conference on Applied Ergonomics. (ICAE'96), Istanbul, Turkey, May, 21-24, 1996. (Eds.): Ozek, Ahmet F. and Salvendy, G..USA Publishing Coorporation.

RASMUSSEN, J. PEJTERSEN, A. M. and GOODSTEIN, L. P. (1994). Cognitive systems engineering. (John Wiley, London).

PEJTERSEN, A.M. (1994): A Framework for Indexing and Representation of Information based on Work Domain Analysis: A Fiction Classification Example. In: Knowledge Organisation and Quality Management. Proceedings of Third International ISKO Conference, Copenhagen, 20-24 June 1994. Eds: Albrechtsen, H. and Ørnager, S. Indeks Verlag. Frankfurt. 1994. pp. 251-264.

PEJTERSEN, A. M. (1993): Designing Hypermedia Representations from Work Domain properties. In: Hypermedia. Proceedings der Internationalen Hypermedia Konferenz. (eds): Frei, H.P. and Schauble, P. Springer Verlag. Heidelberg.

PEJTERSEN, A. M. (1992). The Book House. An icon based database system for fiction retrieval in public libraries. In: The Marketing of Library and Information Services 2. Ed: Cronin, B., (ASLIB, London). pp. 572-591.

PEJTERSEN, A. M., (1992). New model for multimedia interfaces to online public access catalogues. The Electronic Library, the International Journal for Minicomputer, Microcomputer and Software Applications in Libraries. Vol. 10, No. 6.

PEJTERSEN, A. M. (1991): Interfaces based on Associative Semantics for Browsing in Information retrieval. Risø M-2794. p.143.

GOODSTEIN, L.P and PEJTERSEN, A.M. (1989): The Book House. System Functionality and Evaluation. Risø National Laboratory, Risø-M-2793.

Appendix

With some overheads from the workshop.

Framework for work domain analysis and evaluation

In modern work, stable work procedures are not the norm. Many tasks are discretionary. Explicit consideration of goals and constraints and exploration of the boundaries of acceptable performance are often required to optimise effectiveness. For this reason, the object of modelling can no longer be the "task," but must include all the features of the work environment, and the interpretation of these features by the actors. The interaction of work environment and actors' resource constraints creates the task ad hoc. A wide variety of options is found with respect to when and how to approach a given task. Therefore, to understand why a particular piece of behaviour is preferred instead of another possible pattern, we have to understand how the action alternatives in a particular situation are eliminated so that one unique sequence of behaviour can manifest itself. Only then can we hope to predict how a new tool will change a present work practice and the users' interaction with a new system design.

In other words, we have to identify the constraints or boundaries within a work environment that shape the behaviour of users together with the subjective performance criteria they apply to optimise performance within the remaining action possibilities.

Levels of Work Analysis

A framework for analysis and evaluation must serve to represent the characteristics of both the physical work environment and the "situational" interpretation of this environment by the actors involved, depending on their physical, perceptual properties, and their skills, strategies and values. The analysis of the work domain activity includes the identification of behaviour-shaping constraints

Work Domain, Means-Ends Space

This dimension represents the landscape within which the work takes place and it serves to make explicit its goals, constraints, and productive resources. The representation presents an inventory of system elements and it is, in the short perspective, independent of particular situations and tasks. It identifies the functional elements and their means-ends relations or, in other words, the productive resources which are available for the actors to 'design' their local activity. The analysis is structured at several levels of functional abstraction and, in this way, include representations of physical configuration and anatomy, of physical work processes, of general functions, of priority measures, and, finally, of system goals and constraints with reference to the environment.

An analysis within this dimension of the framework will identify the structure and general content of the global knowledge base of the work system which must be considered for design of work support systems.

Activity Analysis: Task Situation in Domain Terms

This dimension instantiates that subset of the basic means-ends network which is relevant for a particular task. Analysis should not be made in terms of work procedures but in terms of the objectives, functions and resources active in prototypical work situations, and the related information requirements. A set of such prototypical work situations can be used in various combinations to characterise a set of task situations to be considered for information system design.

Activity Analysis: Task Situation in Decision Terms

For the next dimension of analysis, a shift in representational language is made. For each of the activities defined, the relevant tasks are identified in terms of decision making functions, such as situation analysis, goal evaluation, planning, or actual execution. This representation breaks down work activities into subroutines which can be related to the cognitive activities of the involved people and which serves to identify the cognitive tasks that are the targets for support systems. The information gained in this analysis will identify the knowledge items from the work domain representation which are relevant in a particular situation. In addition, it assists in identifying the queries which are likely to be made by decision makers for retrieving information.

Activity Analysis in terms of Mental Strategies

A further analysis of the decision task requires another shift in language in order to be able to compare task requirements with the cognitive resources and subjective preferences of the individual actors. For this purpose, the mental strategies which can be used for each of the decision functions are identified by detailed analyses of the actual work performance (e.g., by protocol analysis). Each strategy is based on a particular kind of mental model, a set of tactical rules and a related mode of interpretation of observations. The characteristics of the various strategies are identified with reference to subjective performance criteria such as time needed, cognitive strain, amount of information required, cost of failure, etc. Knowledge about the available effective strategies is important for two purposes: 1) The design of the user-system dialogue and navigational paths to the database content and 2) the user interface design, - because it supplies the designer with several coherent sets of mental models, data formats, and tactical rule sets which can be used by actors with varying expertise and competence and, therefore, should be supported by several, different, corresponding interface representations.

Analysis of Users' Cognitive Resources and Values

At this stage, the action possibilities in work performance of the individual have been delimited through an identification of the work-dependent behaviour-shaping constraints down to the level of mental strategies which can be employed for the decision functions allocated to each individual actor. In order to judge which strategy will actually be used, the resource requirements of the various strategies have to be compared to the cognitive resource profiles of the actors. Therefore, this perspective of analysis is focused on the background of the relevant user category and on the level of expertise and the performance criteria of the individual actors.

Analysis of Social, Organisation, Role Allocation and Co-ordination of Work

In order to identify the actors actually involved in the prototypical task situations, it is necessary to find the principles and criteria governing the allocation of roles among the groups and individuals involved. This allocation of roles to actors is governed by the social organisation and management structure and is dynamically dependent upon the circumstances and criteria such as actor competency, access to information, minimising the communication needed for co-ordination, sharing of work load, complying with regulations (e.g., union agreements),

A work analysis will not proceed as an orderly top-down through these perspectives as described above, starting with the work domain analysis and finishing with the users' cognitive resources and value criteria. The broader context of the entire work environment will be activated both during the analysis of task activity and user characteristics. In particular, the analysis of division and co-ordination of work and social organisation will be closely related to the analysis of the work domain and the task situation and frequent iterations among the perspectives will be necessary.

Framework for analysis of human-work interaction mediated by computer

Framework for system evaluation

1. Define Boundaries

2. Generalization and transfer of results

Evaluation boundaries in experiments

Book house evaluation experiments

Some examples of laboratory and work place evaluations

Evaluation goals

* Compatibility with human sensory and anthropomorphic characteristics

* Understandability:

*Effectiveness:

Can system be used?

* Acceptability

Will the system be used?

Some evaluation problems/questions: Boundaries, objectives, measures and methods

1 What is to be evaluated: a product, a concept, a partial solution, a prototype with surface levels of the total functionality of the system, or a prototype with full functionality of only a part of the system?

2. Comparative: Should several system solutions be chosen for evaluation? Evaluations can be comparative if several systems are to be checked for a differential result to support choice among design alternatives.

3. Absolute: testing whether a single system will be able to or does in fact achieve a given goal and level of performance.

4. Boundaries: What are the (categories of) situations to be evaluated?

5. What constitutes an unambiguous definition of goals and objectives which can be transferred to the evaluation level (what is the evaluation supposed to establish)?.

6. How will performance be defined and how will it be measured? What will be the linking between evaluation objectives and measurable performance variables

7. What are the effects of the intermediate variables (training, experience, task, environment, etc.)?

8. Who will participate in the evaluation? Real end users, test subjects, design team members, colleagues? In iterative design, a distinction should be made between use of subjects from the work place in a work situation, test users in a laboratory, and the testing done among the design team members and colleagues in the project group.

9. Where to perform the evaluation? In a laboratory or at the users' work place?

10. What evaluation data should be collected? Subjective user/expert judgements, qualitative

11. Objective, quantitative measures and objective performance criteria

12. What quantitative data should be collected? Quantitative measurements can be performed as objective measures or as subjective measures

13. What quantifiable performance measures are relevant? Quantifiable measurements may include time to do a task, error rate, number of features actually used, number of features never used

14. What methods to use to capture data? Synchronised audio recording and videotaping, questionnaires, interviews, logging of observational data, of the actual use of the varied functionality of a product. Automatic data logging

15. What methods to select for data analysis and data encoding to obtain reliability? What statistical methods and what data integration method for coding, sampling and analysis?

16. What methods to choose for qualitative analysis of case studies?

17. What methods to select for presentation of results for customers or test subjects, will a summary of videotapes be effective.

Top


3.3 Working group reports

3.3.1 WWW search engine evaluation (1) - Nicholas Belkin

This was the working group on Evaluation of WWW Information Systems, which worked on the Scenario of the company which wanted to sell more of its internet computers by writing, and making available for free, systems which took greater advantage of the hypertextual structure of the Web than existing systems.

With respect to the general issue of evaluation of WWW Search Engines, the Working Group made the following points:

  1. Prior to any evaluation of WWW Search Engines and Systems, it will be necessary to do serious empirical studies which will identify the tasks, goals, and behaviours of WWW searchers, and to characterise the resources which they search.
  2. This in general seems to be a classic evaluation problem, in that, before we can decide on appropriate evaluation methods and measures, we need first to understand what are the appropriate criteria which the measures are supposed to reflect. There is no obvious reason to believe that current IR, or HCI evaluation measures and methods are appropriate to the WWW situation.

With respect to the Scenario, we believe that this is also a classic case of an evaluation problem, namely, identifying the overall goal of the system, and then finding criteria and measures appropriate to that goal.

We tried to address this issue by considering various levels of evaluation.

Given that the overall goal of the system that the company intended to implement was to convince people to buy the computers that the company manufactures, the top level criterion is sales, and so the most appropriate evaluation measure for the system is whether, and how much, sales increased after introduction of the system.

Although obvious and realistic, we also see that this measure is somewhat facile. Therefore, a somewhat deeper level of analysis leads to attractiveness as the criterion for evaluation, based on the idea that if the system is attractive to users, they will want to buy the computers that the company makes. One candidate measure for this criterion is reuses of the system by individual users.

Attractiveness and reuse are perhaps better than sales, in that they might be applicable before the system is fully implemented, but they suffer from some measure of unrealism, with respect to the overall goal. We therefore suggested, as a next level of evaluation, the criterion of preference (of the proposed system to any other available alternatives), measured by willingness to pay for the use of the system.

Our overall approach to evaluation in this scenario, as mentioned above, was based on the assumption that one should always condition evaluation measures and methods to the goal of the system which is being evaluated. This led us to some quite non-traditional potential measures for evaluation of the proposed system. However, if the goal were changed, so that, for instance, the company was interested in designing and producing a "good" system, or in understanding how a system could be "good", or in knowing whether their ideas about the value of taking more account of the hypertextual structure of the Internet were "correct" or "useful", then the criteria that we would have suggested, and the associated measures, would have been quite different, and perhaps more like those which we normally associate with IR and HCI evaluation.

Top

3.3.2 Benchmark testing hypermedia search engines - Pia Borlund

Initially, we had a broad discussion about the concept of hypermedia, problems concerning indexing of different types of media (images and sound) and ways of doing it: manual, automatic and semi-automatic.

It was confirmed that the discussion was supposed to take place considering the evaluation of the engine level not evaluation of the interface design of hypermedia systems.

The evaluation of hypermedia systems, in general, was discussed with reference to the talk by the invited guest speaker: Annelise Mark Pejtersen, Risoe, Denmark. By applying different evaluation methods for different aspects, parts and media within the hypermedia system, combined with the application of several methods to the same part of the system, one may benefit from comparative analysis of overlapping results.

The big question is: how to conduct the evaluation.

It was suggested that because of the subjective nature of the content of hypertext/hypermedia systems, a possible method of evaluating the system performance would be to apply work task situations, in the way this approach was presented by Pia Borlund and Peter Ingwersen at the first Mira Workshop in Glasgow, May 1996. This approach was discussed.

In order to focus on the evaluational aspects with reference to something more concrete, we turned towards the proposed scenario (a scenario about a local radio broadcast station, which searches various sources of information such as images, sounds, text in different languages, to be used in news stories in the six o'clock news.

We never managed to come up with any finalised conclusions about how to conduct an optimal evaluation of hypermedia systems -- we even failed, when it came to separating the engine level from the interface level.

Top

3.3.3 Building a hypermedia test collection - Marion Crehange

This group first tried to work on the scenario:

Very soon two questions arise :

  1. To what extent have we to think about the future use of the test collection by Le Monde or only to the test collection itself ?
  2. What exactly comprises a HyperText (HT) test collection ?

Question 1 results in ... nearly giving up the scenario. Question 2 is the starting point and the kernel of the following discussion.

Let us begin by the conclusion of this discussion : the group thinks that - a test collection has not to include a predefined set of accesses nor a predefined set of evaluation schemes, - but a test collection must include good tools for these tasks so as to document experiments and to retrieve them.

Other major point : for us, building a HT test collection must not be affected by benchmarking or comparative experiments objectives.

The discussion may be described according 4 aspects :

  1. Data - The collection must include tools for building micro-worlds, for instance courseware, and, more generally, to build various syntheses. It must also be capable of evolving. The database implementing the test collection must be very well structured. In particular:
  2. Toolset and interface - The test collection must include several types of tools. Access tools must be very friendly but also must offer to the user some built-in standard interfaces and some platforms. Access must include possibility of multi-lingual access and will include:
  3. Users may be
  4. Tasks - The HT test collection may be used for testing :

The general conclusion is that this discussion was pretty rich but would have been more constructive and reusable if we had referred to the models displayed by previous speakers.

Top

3.3.4 Testing automatically constructed hypertexts - Kai Grossjohann

Because most of the participants of the working group didn't know much about the automatic construction of hypermedia (hypertext), Maristella first explained the main issues involved with automatically creating a hypertext. Some of the issues are of a more technical nature, such as how to store the data structure on disk in a DBMS. Other issues are of a more conceptual nature. We focused on the issues which types of links the hypertext contains (examples of link types are structural links according to the section/subsection hierarchy, citation links from a bibliography entry to the cited document, and similarity links from a node to a semantically similar nodes); which types of nodes it contains (examples for nodes are documents or parts of documents, index terms, and concepts; the ACM classification schema is a good example of a graph where the nodes are concepts); whether it only contains intra-document links or inter-document links also; how to visualise the structure of the hypertext (and other user interface issues); other media types than text; and how to update a hyperbase. We also mentioned the question of efficiency.

We then talked about the impacts of these issues with regard to evaluation. For example, with regard to the types of links, one could do user studies in how far a particular type of link is useful. This can be evaluated by comparing a system with that link type with a system without links of that type, or by evaluating the usage frequency of links of that type. A special case occurs if one includes manually constructed links, the performance of the automatic link creation can thus be asserted. Similar things can be evaluated with regard to the types of nodes. For instance, if the system provides different kinds of nodes and allows the users to navigate both between nodes of the same type and between nodes of dissimilar types (e.g. by associating a document with a concept) it could be tested whether a certain type of node facilitates better navigation or is confusing to the user. With regard to the user interface issues, one could compare different ways of visualising the graph structure of the hyperbase. Also, it would be possible how much difference it makes to include a search facility as opposed to a browsing-only interface. An interesting question would be how to include continuity features (as mentioned by Steve Draper in his position statement) and how much that helps the users. Another issue with regard to visualisation is whether or not the weights assigned to links by the system should be displayed to the user. Concerning updates to the hyperbase we thought of two main things to evaluate. One question is of course the efficiency. (Does the system need to rebuild the whole graph when a document is added?) Secondly, one could imagine that there are systems which perform a quick and dirty insertion operation where the resulting hyperbase is incorrect in some sense then only rebuild the hyperbase once a week say. For this kind of system it would be interesting to investigate the trade-off between correctness after insertion operations and the speed of the inserts. As there are some systems which build the graph of the hyperbase on the fly, at the time of posing a query, it is important to ascertain the efficiency of answering a query and the speed of link traversal.

Lastly, there is the question of different media types. As none of the participants of the working group was aware of the existence of any system that can cope with non-textual data in a non-trivial way, we took the easy way out and declared that there is nothing to evaluate, yet. One would have to at least have a look at one system that does this to learn about things to evaluate.

Top

3.3.5 WWW search engine evaluation (2) - David Hendry

How does one evaluate a WWW search engine? We quickly realised that in order to even consider this question we would first have to think about the setting in which the search engine would be used. To do this we proposed the following question: "If you were a travel agent how would a WWW search engine help your business"? To further focus discussion, we then considered Annelise Mark Pejtersen's model of design. These are five issues of the model:

  1. Does presentation match sensory characteristics?
  2. Are all relevant strategies supported?
  3. Does the system support relevant decision tasks?
  4. Does system support task repertoire of a work situation?
  5. Does system support co-operative work co-ordination?

We found that these questions helped us direct discussion about the travel agent scenario, but we did not progress sufficiently to propose a definite plan.

We believe that Pejtersen's model might be used as a basis for formulating such a plan because it sets guidelines for the collection of quantitative and qualitative data. Perhaps more importantly, it could be used as a reference model for considering the merits of any evaluation methodology.

We were impressed by two features of the Pejtersen framework. First, it treats the system in its context of use, rather than restricting evaluation to individual users; and second, it encourages an iterative reflection on different aspects of the system, so that thoughts about 'relevant strategies' are related to 'co-operative work co-ordination', for example. Without such a framework, it is all too easy to concentrate on a few aspects to evaluate, possibly leading to developing systems that are easy to use but can only support tasks that are irrelevant to the real context.

A further good point about the Pejtersen framework for designing evaluation is that it reminds us to think about both directions of reasoning in evaluation: (A) from the widest context towards its consequences for user interface detail, and (B) from what happens in the details of a user's actions and errors out to wider features. She calls (A) "analytical evaluation" and (B) "empirical evaluation". (B) is to do with validation e.g. you run an experiment, and take surface measures of time and errors which you assume can stand as measures of whole tasks and functions and the utility of the whole device. (A) is to do with verification: with whether the implementation does actually derive from and satisfy the wide requirements, including implicit requirements for it to work in its context. For evaluation, this is in part to do with "illuminative evaluation": with open ended observation in the work place that can spot whether some issue has been missed in the design (and so is not only a problem, but will not be allowed for in the (B) type experiments).

Top


University of Glasgow- - - - - -Universita degli Studi di Padova

drawing of Villa Duodo, Monselice

Proceedings of the Second Mira Workshop

4: New directions in the evaluation methodology.


Session co-chaired by, and short presentations from, Peter Ingwersen and David Harper.

Full presentation by Giorgio Brajnik, Stefano Mizzaro and Carlo Tasso.

Position paper from David Harper and David Hendry.

Working group reports by Giorgio Brajnik, Thomas Green and Stefano Mizzaro.

Closing paper by Marion Crehange.

Top


4.1 Evaluating User Interfaces to Information Retrieval Systems: A Case Study on User Support (*)

Giorgio Brajnik, Stefano Mizzaro, Carlo Tasso

{giorgio|mizzaro|tasso}@dimi.uniud.it Dipartimento di Matematica e Informatica, University of Udine Via delle Scienze, 206 Loc. Rizzi - 33100 Udine - ITALY

Abstract Designing good user interfaces to information retrieval systems is a complex activity. The design space is large and evaluation methodologies that go beyond the classical precision and recall figures are not well established. In this paper we present an evaluation of an intelligent interface that covers also the user-system interaction and measures user's satisfaction. More specifically, we describe an experiment that evaluates: (i) the added value of the semi-automatic query reformulation implemented in a prototype system; (ii) the importance of technical, terminological, and strategic supports and (iii) the best way to provide them. The interpretation of results leads to guidelines for the design of user interfaces to information retrieval systems and to some observations on the evaluation issue.

(*) paper presented at the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '96, Zurich, CH, August 18-22, 1996, pp.128-136.

Top


Position Paper

4.2 Evaluation Light - David Harper and David Hendry

Abstract

We propose rules of thumb, designed to guide the evaluation of interactive information retrieval systems. These rules are intended to assist the IR community in shedding maximum light on the effectiveness of IR "components" while, at the same, minimising the effort required to do so. By "component", we mean a focal concept, which the evaluator wishes to better understand, such as retrieval model, computational technique, interface dialog, visual display, or overall model for interaction. These rules can be thought of as rallying cry for thinking about the utility of an IR evaluation.

Introduction

During an `evaluation', we want to shed maximum light on effectiveness, yet do so economically. Thus, the title of this position paper, `Evaluation Light', is a play on two concepts `illumination' --- brightness rather than darkness --- and `economy' --- lightweight rather than heavyweight.

We present a set of rules of thumb we have devised to guide the evaluation of interactive information retrieval systems. We believe these rules can assist the IR community in shedding maximum light on the effectiveness of IR "components" while minimising the effort required to do so. We use the term "component" to stand for such things as retrieval models, computational techniques, interface dialogs, visual displays, and so on. It is important to note that in any particular evaluation, some components, or combination of components, will assume a prominence over others.

The list of rules is very much in the draft stage, and this position paper has been specifically written to provoke discussion on what constitutes useful evaluation. We are well aware of the importance of contextualising these rules with the work of others in HCI and IR, but have yet to do so. In our own work, these rules have assisted in the development of a light-weight plan for the evaluation of a system designed for the retrieval of pictures on the basis of spatial indexing features. (This is the second scenario presented in the Appendix.)

The Rules of Thumb

Don't do feel-good testing, do feel-right testing
In other words, don't do evaluation because it is expected of you, or because you want the test users to say "what a lovely system". Rather, do it because you want to gain insight into the component being tested. For example, too often we see Precision-Recall graphs presented, where components should be evaluated in their own terms.

Principled design drives principled evaluation
If you have a set of claims as to why your "component", in particular, should improve an information retrieval task, it is much easier to decide how to evaluate it. This idea underlies the development approach where the manuals for the user interface are written before the interface is prototyped.

Make claims/hypotheses for your system
If your system is designed with some goals, objectives or claims clearly stated, then you should be in a better position to test the system against these. Thus, it is important to state what the research or development goals are.

Break down a large hypothesis or design conjecture into discrete, smaller hypotheses or conjectures
All too often what one would like to learn is stated at too high a level. Thus, it is important to discover smaller aspects of the problem and to propose focused questions about these aspects. We believe that this approach will lead to greater insight.

Evaluate using small, controlled, tasks directed at evaluating a given claim or hypothesis
Focused trials are more likely to lead to specific insights. There is plenty of evidence to suggest that small trials, properly run, can be at least as effective as big ill-focused trials.

Know your intended users, and brief test users accordingly
This is important if the test users are not necessarily the same as the intended users. Or, if the evaluation is being done "at a distance".

Inform test users of the system's intended user model through the right cover story
The problem here is to decide what skills and point of view you would like test users to have. It seems that instructions can make a large difference on how test users approach an interface and thus what can be learned from a test.

Minimise testing, maximise analysis of results
Conducting experiments, and particularly those involving human subjects, can be very expensive. Moreover, analysing the results from such experiments can itself be very expensive, and particularly if the analysis is unfocused. Thus, it is important to plan your experiments and analysis in order to discover the minimal amount of data required to test a conjecture or hypothesis. Naturally, we would advocate collecting as much data as possible from experiments; you or others may be able to re-use this data for other purposes.

Consider micro-evaluation rather than macro-evaluation
Micro-evaluation looks at comparisons between individuals (users, searches, etc.) rather than averaging performance across individuals. Potentially, micro-evaluation may result in greater insights concerning the claims for your system. There is a perceptible move away from macro-evaluation (e.g. Precision-Recall graphs) in IR towards micro-evaluation. Micro-evaluation will certainly require quite sophisticated data extraction and analysis tools, and perhaps we in the IR community should be designing and building these to support our experimentation.

Use other people's results where possible
You should have a plan for how your data could be shared with, and used by, other experimenters. Also, you should consider how you might reuse someone else's data.

Have contingency/risk plan
If a major hypothesis is invalidated, it may be desirable to abandon/re-direct the research, design or development effort. This plan can be based on the set of conjectures/hypotheses you develop.

Conclusion

On reflection, the "Evaluation Light" rules seem self-evident. However, we think that by reminding ourselves of these "self-evident" rules, we may be able to design IR experiments which require less effort and which lead to greater insight into the component being tested. We would welcome suggestions for improving both the rules themselves, and the presentation of these rules.

Appendix

These two scenarios are intended to prompt you to think about how the Evaluation Light rules could be used in practice.

Scenario A

The cognitive dimensions framework includes a dimension called "role expressiveness", which captures the extent to which an application (or interface) reveals causal relationships between the various data/artefacts being manipulated by the user.

We conjecture that if the causal relationship(s) between a query, and the documents retrieved in response to that query were revealed to a user, then the user would be able to judge the relevance of a retrieved document more accurately. This is becoming increasingly important in situations where documents take an appreciable time to fetch, load and indeed read, e.g. full-text WWW documents, and where typically some surrogate of the document, e.g. title, is displayed in the list of retrieved results. (Incidentally, it may support users who prefer to identify likely relevant document first before reading the documents in detail.)

Examples of the kinds of information that might be displayed to indicate query-document role expressiveness in a list of retrieved results are:

Using the Evaluation Light guidelines, design a set of experiments to test the above design conjecture.

Scenario B

We have designed a indexing and retrieval model for images. In the present work the images are photographs of landscape scenes in Scotland. The photographs have been indexed manually, where people have identified salient objects, drawn a rectangle around them, and labelling each object with a keyword. The test collection consists of approximately 800 photographs.

A query comprises a set of spatial features (rectangular region plus label) and/or a set of text features (keyword which will be matched with text associated with the image). Let us concentrate on the use of the spatial indexing for querying. We suppose that a user builds up a "picture" of the image they wish to retrieve using labelled rectangles. This query is then matched against the indexed images, and a ranked list of matching images is presented to the user.

A major (and incomplete) design conjecture underlying this work is: "For some user groups, for some tasks, spatial indexing will result in more effective retrieval of photographs/images". Think about the user groups and tasks for which spatial indexing and retrieval might be useful. Then, consider how the Evaluation Light rules be used to home in on a set of simple, but revealing, experiments.

Top


4.3 Working group reports

4.3.1 Evaluation frameworks and methods - Giorgio Brajnik

A necessary step towards progress is the clarification and definition of evaluation frameworks and methods leading to models of the IR system and of user interaction with the system within task domain as well as to effective cost/benefit ratios.

FRAMEWORKS

A framework can help identify theories, methods and models appropriate for analysis of a given problem domain and data about it. A framework differs from theories, methods and models in the sense that it cannot be falsified or verified as can theories, methods and models. A framework can for example be evaluated in terms of criteria such as its completeness with respect to a) the problem domain that it should cover b) theories, methods and models addressing the problem domain.

The following aspects should be studied/explored:

Crucial issues will be:

METHODS

One thing that is needed is a map from aspects/goals of evaluation to methods for data acquisition and analysis. A framework like the framework for user-work domain interaction presented by Annelise can for example be used to identify a set of methods that are appropriate alone or in combination for different evaluation goals. For example,

In addition, task scenarios envisaged within frameworks can help identifying appropriate performance measures

It may be useful to 1) look into other scientific disciplines for other theories and methods such as experimental and cognitive psychology, sociology, HCI and usability inspection methods, human factors and ergonomics, cognitive systems engineering, hermeneutic approaches like discourse analysis and ethnographic studies, design in context/situated action/design studies, semiotics and natural decision making theories and studies. And to 2) adopt more advanced technology such as evaluation work benches using multimedia recording techniques.

Top

4.3.2 A test-drive for Evaluation Light - Thomas Green

Group 5 addressed David Harper's Scenario A: "design empirical investigations of whether users would benefit from knowing how terms in their query related to the documents retrieved". We treated the exercise mainly as an evaluation of Harper's Evaluation Light framework, which was precirculated and which should be available at the WWW site. Our approach was to move through the headings of the Evaluation Light framework in what we hoped was a reasonable sequence, pausing to include other design steps that we felt were needed. We shall put the existing Evaluation Light steps in bold and the additions in bold italic red.

We initially assumed that the scenario was supposed to describe standard IR, but there were two views of how the user might benefit. In View 1, the user is expected to look at the set of retrievals and then to reformulate a refined request; the improvement would lie in a better reformulation of the request. In View 2, the user is supposed to look at the set of surrogates for retrievals and then to fetch some of them for full examination; the improvement is supposed to lie in being better able to separate relevant from non-relevant documents.

The difference in our understandings was not initially clear, leading to some confusion. We started with the Evaluation Light question: Has it been done before? Not to our knowledge. (Later, someone who claimed to be a fly on the wall, although some of us sometimes wondered if he was totally off the wall, told us that there was a study by Veerasamy and Belkin, pub. 1996, with some relevance.) Next we asked for intuitive or anecdotal examples and counter-examples, a step strongly recommended by the HCI-wise members of the group. Here the difference between the views became initially apparent, as Steve Draper (who had interpreted the scenario according to View 1) argued that in modern full-text IR the user was if anything better off not even thinking about how the terms retrieved the documents, and should instead just cut and paste large relevant sections from the first set of retrievals and use those to form the refined query, leaving all the workings to the magical black box inside.

Our mixed interpretations led us into some difficulties at that stage and still more at the following stage, when we moved on to break the hypothesis down. Harper's idea was to avoid evaluating the whole IR process, but instead to isolate and test smaller hypotheses in reduced situations. For our scenario, we assumed the real-world process included the steps of

  1. formulate a query
  2. do retrieval
  3. inspect the surrogates returned
  4. choose some subset of those surrogates and view their documents
  5. start again with a refined query ....

Those of us who had assumed View 2 argued that step 2 could be omitted from the evaluation, and instead we should concentrate on step 4, the winnowing of sheep from goats.

Later we performed what should be the first step in the revised Evaluation Light framework: Grill the originator of the request over tea until you find out what he or she REALLY means. After Steve had trapped David Harper in a corner, David revealed that the scenario was meant to describe searching the web using an engine like AltaVista. Experience had revealed that when, say, ten terms was put in, intending to narrow the search, resulting documents included some that were retrieved by what David called 'bizarre subsets' of the terms. Harper's original hypothesis is that surrogates should additionally be tagged by which terms matched them, and the user could then by inspection dismiss surrogates with bizarre subsets e.g. if the retrieval query was "elephants and image" then any surrogate with only "and image" but no "elephant" could be dismissed. This scenario lies somewhere between views 1 and 2; titles of web pages usually reveal their utter irrelevance, so the problem of discerning relevance from surrogates is less important than step 5, refining the query.

Next in the Evaluation Light framework came finding a claim. The HCI-versed members were strongly in favour of this step but nevertheless had difficulty at first. What claim was there to find? So we moved on to consider possible manipulations of the task. (In retrospect, I believe this step is what Evaluation Light calls try gedanken experiments.) We listed several possible manipulations, including TileBars. Given our new understanding of the scenario, the manipulation we liked best was to present a histogram of how many documents had been retrieved by each combination of terms. (Possibly a bar of the histogram could then be queried to discover just which documents that combination had produced.) Users could then note that the combination of, say, terms 1 and 8 had fetched a disproportionate number of titles, and perhaps decide on reflection that it relied on a different sense of one of the words; they could refine the request accordingly.

We therefore decided to evaluate a design in which the usual list of titles was accompanied by a histogram like this:

Our claim for testing would be that reformulation would be improved. Now it was time to consider the experimental design. The obvious design would be compare performance on the usual AltaVista display against performance with the enhanced display, but that is not an adequate design because performance with extra information is always likely to be at least as good as baseline performance, so random deviations will support the experimenter's hypothesis. Instead, our proposed design would include a third condition, performance with only the histogram, giving a design with three conditions: list of titles; histogram; titles plus histogram.

Further Evaluation Light steps included consider generalizability (would it work with multimedia? why not, we said) and consider intended users. Our proposed investigation would include the usual suspects, i.e. information specialists in one group, naive users in another.

It was agreed by all that before starting with real subjects, the experiment should be debugged by running think-aloud pilot subjects at an early design stage. Part of the point is to do some open-ended observation on the system and task before committing to an experiment. This isn't so much piloting a controlled experiment as investigating to see if there are new issues to address that might make the experiment beside the point.

Finally, we would add an explicit principle about seeking to demonstrate possibilities. E.g. although it is of interest to know what typical users do on average, it is also important to know what is and is not possible at all with a system. To know what the fastest typing speed on a keyboard is, as well as the average; to know that a document cannot possibly be retrieved by a particular system, as well as to know that the average user fails to find it. Discovering what is possible is a scientific enterprise in itself, but one seldom mentioned by experimentalists.

Conclusion

We believed that Evaluation Light was an excellent design process. Its headings were good discussion tools and it encouraged us to look for probing, investigative evaluations, rather than the style of 'build the whole system and see what happens'. Our suggested additions merely fill out the gaps, they do not change the framework.

Top

4.3.3 Scenario based relevance - Stefano Mizzaro

The group worked on 2 main issues, the scenarios presented by Peter and the existence of different kinds of relevances.

1. The scenarios presented by Peter.

The intention was to discuss the normalisations over groups of test persons, engines, use of simulated needs and real needs, and so on, in interactive performance experiments where the foremost entity for statistical validity is the needs/requests applied by users - not the users themselves.

Naturally, if the experiment is to observe the use and behaviour by real-life users applying two systems (or two interfaces, or two methods of representation) the requirement is - for instance - 125 users each with their own need. However, if the goal is measuring the performance of two systems by users with or without real needs, the number and nature of the needs becomes crucial, less the number of persons in the groups to test the systems. This is the rationale behind the scenario and the idea of discussing number of test queries, their "openness", relevance assessments, etc. We don't need 125 x 2 = 250 users but much less (as also shown by the Udine Team).

Anyway, for having statistical significant experiment, we need a lot of users, and very huge efforts. This has almost never been done in the past, and now it should be the right time for doing it.

2. The existence of different kinds of relevances.

The relevance measured in TREC is a "low" (referring to the "Stefanology of relevance") one, while the relevance that other people (and Peter in his scenarios) are suggesting/trying to measure is a "higher" (nearer to the user) one, both in terms of (i) not only topic, but also task, and (ii) relevance to the information need, not to the request.

The group agreed that we need some method for evaluating a relevance near to the user, and the discussions seemed to confirm the "Relevance indetermination phenomenon" (The more we try to measure a relevance near to the user, the less we can measure it), though somebody in the group did not agree on it. Further discussion is needed on this issue.

Finally, we also discussed vaguely the use of Tague "Informativeness Measure" and its scenario applying task end product properties as entities for performance assessments (e.g. references actually used by searchers in their final articles compared to the retrieved ones from systems).

Top


4.4 Short Paper About Models and Abstraction - Marion Crehange

Why this contribution?

During MIRA meeting in Padova, several very interesting models have been displayed. But apart from telling <<we need a meta-model>> (Thomas Green), <<it might be desirable to be able to characterise (or classify) tasks, domains, and users, according to some agreed framework, so that at least we might be able to compare the results of experiments>> (David Harper), etc., the group discussions seem to me :

and so :

In fact, I think that referring to common model(s) is the best (the unique, perhaps) way for each of us to :

The displayed models

Over the background of classical models, at least 5 contributions have brought stuff about models during the meeting :

I won't enter in the model details here, but first I try to have a synthetic view.

Key words to distribute

One good exercise would be to try to distribute the various delivered keywords or keyword patterns into the levels of -a- and to structure them in an intra-level and then an inter-level way.

The keywords might be : user, documents, surrogates, requests, answers, mechanism, user goals, evaluation function, evaluation goal, data collection model, system response, transaction logs, user questioning, think-aloud protocols, effectiveness, efficiency, ease of interaction, usability of interface, user satisfaction, experimental settings, information, real information need, perceived information need, expressed information need, query, topic, task, context, comprehensibility, novelty, choice of which relevance you need, "which relevance to judge ?", "which relevance judgement ?", money to earn, time, activity analysis, information overhead, "getting lost", viscosity, hidden dependencies, visibility and juxtaposibility, progressive evaluation. role expressiveness, consistency, choice of what relevance you need, micro-world, extra-information provided, mental image, etc.

General considerations : abstraction

Annelise Mark Pejtersen said that one design challenge for system evaluation consists in identifying stable structures (another being : constraining actions / possibilities). Thomas Green and Steve Draper highlighted the importance of abstraction.

I think that the kernel of IR problems stands in abstraction and abstraction level.

When each of us speaks of abstraction, we mean different facets of abstraction, according to who we are, which object is concerned at this moment, which present level of meta is concerned,... It would be useful to try to precise on this point as often as possible. In particular, when we describe and use models, we necessarily make abstractions. The problem is to find the adequate level.

I won't enter in details here. But I'll have a look on one kind of abstraction : abstraction from the operational viewpoint. And I suggest one somewhat personal track about it.

1) Abstraction in Hypertext IR System constituents

Let us imagine an (H)IRS session in progress.

Each constituent (object or action) taking part in this IRS session may be seen as the embodiment of a hierarchy of abstractions (or perhaps several hierarchies ; but let us ignore that). For example : the keyword "Stradivarius violin" may be the visible element of a hierarchy including, as we go up in the abstraction degree, violin, string instrument, instrument, music, art, ... (Here, we could introduce the notion of characteristic level to express the upper level where a mental image arises in the user's mind). This is true for surrogates, each surrogate constituent, request, matching, relevance judgement, etc., and also -with a great power- for elements of images.

At each step of the session, according to the state of the user, of the user's need, of the whole context, etc., we may conceive what is the current good level in the hierarchy for each constituent ; let us call it its present abstraction point (PAP) . For instance, for the keyword "violin", it may be sometimes "string instrument" and at other times "music". Another example is : the action of comparing 2 keywords may be considered as the visible element of a hierarchy including : "compare the keywords themselves", "compare their PAPs", etc. So one may define a PAP for this action at every session stage.

2) Different uses of the hierarchies of abstractions

If you wish to compare a request word RW with a surrogate word SW, you can localise each of them in a common hierarchy, and then the comparison will pass through what we might call their youngest common ancestor (YCA) or more precisely the Youngest Common Ancestor of their respective PAPs.

In other respects each action of the IRS may be considered in terms of the PAP of each of the involved elements (objects or actions). And we may modulate how we climb in the hierarchy from each element's PAP to adapt oneself to the context, to a micro-world, to the result of a previous action trial, to an evaluation goal, etc.

Steve Draper evoked the problem of being lost after a jump in an Hypertext. One way of helping the user may be to show him the YCA between his former position's PAP and the new one's PAP (this is possible only if the system is able to find a common hierarchy of abstractions).

Moreover, when Thomas Green spoke of the good way of representing music, it is linked to finding a good illustration, a good visual (or multimedia) surrogate for a PAP.

At another point of view, each default in Thomas Green's Cognitive Dimension Framework may be viewed as the manifestation of some default of the abstraction degree adopted for at least one constituent.

Still another point of view. When we want to make comparisons between different experiments, what is comparable is probably made of the PAPs of the different problem constituents.

This could be easily connected with the propositions of Annelise Pejtersen, Stefano Mizzaro and probably other MIRAns and would perhaps aid applying David Harper's Rules of Thumb (e.g. for role expressiveness).

Finally, fixing some environment to a (H)IR session probably may be represented by keeping in mind (or in the system) a certain permanence in the underlying abstraction level -which may be modulated-, while the particular abstraction levels of the different objects may change rapidly.

Conclusions

In conclusion I think that abstraction aspects may (and must ?) be the pivot of the evaluation works. And in particular working about and in terms of models is necessary.

Final notice : this has been done with respect only on what was displayed at the meeting in Padova (and what I understood). But, obviously, it must be complemented and made more precise at the light of existing works (e.g. Ingwersen, Croft, Chiaramella, .......).

Top


University of Glasgow- - - - - -Universita degli Studi di Padova