The relationship between IR functionality and HCI issues has been the focus of several projects based on the Okapi system. The investigations have been concerned with the evaluation of automatic and interactive query expansion facilities implemented in different interface environments including: a VT100 character-based interface, and two GUI's one with overlapping windows and another with multiple panes. Operational field trials have raised three major questions regarding factors which contribute to the quality of interaction. Firstly, to what extent should the underlying functionality of the system, i.e. the relevance feedback & query expansion process, be made visible? Secondly, what is the cognitive loading if the user is required to select terms for query expansion? Thirdly, what is the resulting distribution of control between the user and the system?
The retrieval task can be viewed as two distinct sub tasks: one concerned with query construction and the other with viewing or browsing results. Interface features need to cater for each.
This session is about HCI and IR. Given the tradition of systematic and controlled experimentation in IR, I think the important lessons from HCI evaluation for IR are mainly about the importance of discovering surprises. In particular, users surprise you when you look at what they actually do as opposed to what you expect or what they might tell you if you interview them rather than observe them in action. First of all, we can't be sure of what the user's goals (information needs) are unless we investigate them too. For instance in HCI an easy mistake to make is to assume that every error message indicates a user error. However sometimes users are systematically experimenting with the machine, and using the error messages to get information. Similarly, it may be that some user sessions are not trying to retrieve any particular documents or information about the domain, but instead looking at what kind of thing the collection holds or what kind of search request is effective. Secondly, you can't always trust what users say or you expect about user methods of doing a task. In one project I supervised, my student tested a system not only on the children it was intended for but also on the programmers who had designed it. The latter insisted that allowing boolean operators as well as free text queries was important, but in fact in doing the set tasks they did not use any of the operators they had insisted on implementing. Thirdly, features of the user interface that may not seem important even to the user may turn out to be when you observe closely. My student noticed that, of the two systems she tested, in the one that did not highlight and automatically scroll to matched terms in documents users often dismissed as irrelevant documents that were in fact important for them. All of these issues illustrate the importance of real user observations, as opposed to interviews or numerical measures.
On the other hand, my own interest in IR evaluation comes from the fact that I believe it shows the importance of benchmark measures that can be used apart from user testing: and the challenge is how to combine the use of both. This is fundamentally because the machine's functionality and performance is one (although only one) of the crucial factors affecting the user: for example in some cases at least, whether or not the machine delivers all the relevant documents and delivers them fast enough is very important to the user. This is in contrast to, say, word processors where functionality is usually not a key issue. Thus while open-ended observation of real user behaviour is crucial for spotting what the important issues are, engineers can only tune a system efficiently if they can concentrate for periods on optimising performance on some simple measure. If we don't use recall and precision, we will still need to identify measures like them, even though it is important also to do more open-ended user studies to tell us how such measures relate to the overall task performance.
Although precision and recall are widely used in the IR community and provide a valuable tool for analysing the effectiveness of underlying IR engines, I see there being three key problems with respect to evaluating interactive IR systems:
Precision - Recall graphs only show the retrieval performance of the underlying system and do not take into account many aspect of the user interface which impinge on the applied effectiveness of the IR system. In other words, the user interface and speed issues are not taken into account and, as a result, having two PR graphs of two different IR systems does not tell us which would perform better in a real setting for a particular set of tasks and a particular set of users.
In my talk I compared PR graphs with engine performance for cars: brake horse power, torque, fuel consumption and top speed can all be used to give a statement on how effective the car's engine is. However, very few people buy a car based purely on these features - there is little point in buying a two-seater sports car as the only car for a family of four, even if the engine is superb. PR graphs and engine performance figures share the same major benefit, as Cooper [1966] said for his Expected Search Length measure, they are useful "in a theoretical investigation of how varying a certain design parameter would affect retrieval effectiveness under certain assumed conditions regarding input queries and document collection makeup". Their main use is to improve the engine, which must now be viewed as a very small part of reviewing the system.
I personally find PR-graphs hard to read. As an example, Porter's stemming algorithm should improve recall at the cost of precision, but how does this relate to the graph? And what does that mean for typical user tasks? I am currently working on a hypothesis that a more task and user oriented graphing method would help interpretation of what system is better for what tasks. Although some users may have a feel for what percentage of domain's documents they are looking for, I guess that this would be a small percentage of users. Most would be interested in finding, say, "2-5" or "10-20" documents on a topic. With ranked retrieval systems, a format which maps effectiveness to distance down the ranked list would also be closer to most users interactions with the system. While such graphs are not available, we must attempt to translate PR-graphs into tasks to establish when such features are useful and if an improvement is actually likely to help end-users.
Finally, I feel that the belief in having to plot recall and precision graphs to have a piece of IR work accepted may be holding back advances in interface design. It is simply too expensive to create a test collection for real tasks (and often too artificial to adapt test collections for interactive use).
We need alternatives....
Cooper, W.S. "Expected search length: a single measure of retrieval effectiveness based on the weak ordering action of retrieval systems", American Documentation, vol. 19, January 1968.
(see a complete paper on this topic)
It is widely recognised that relevance is a (if not 'the') central concept of IR. I show that: (i) there are many kinds of relevance, not just one, and (ii) these kinds can be classified in a four dimensional space. It is commonly accepted that relevance is a relation between two entities of two groups. In the first group, we have one of the following three entities:
In the second group, we have one of the following four entities:
On this basis, a relevance can be seen as a relation between two entities, one from each group: the relevance of a surrogate to a query, or the relevance of the information received by the user to the information need, and so on. Therefore, a relevance seems a point in a two-dimensional space. But these are not all the possible relevances, since two more dimensions have to be taken into account. First, the above mentioned entities can be decomposed in the following three components (third dimension):
Therefore, a surrogate (a document, some information) is relevant to a query (request, perceived/"real" information need) with respect to one or more of these components. The fourth dimension is the time: a surrogate (a document, some information) may be not relevant to a query (request, perceived/"real" information need) at a certain point of time, and be relevant later, or vice versa. Summarising, each relevance can be seen as a point in a four-dimensional space, the values of each dimension being:
Finally, some problems/questions:
This talk was a shortened version of a talk given at Visual Languages 96. Some of the illustrations I used in that talk are available from my home page. From there you can also download a copy of a much more detailed paper by Green and Petre, which appeared in the Journal of Visual Languages and Computing 1996.
The problem I address is, how to evaluate the usability of information-based artefacts and notations. More particularly, how to evaluate them cheaply. No laborious user-testing, no detailed analyses and modelling by HCI experts.
HCI folk have slowly learnt that expensive, time-consuming evaluative methodologies are not taken up by creators, for very good reasons. I propose something different, a frame-work of user-centred discussion tools.
We all have concepts that are vaguely known but unformulated. Discussion tools are elucidations of such concepts. If they resonate with your experience, they can promote a higher level of discourse amongst you, the designers and creators. They can create goals and aspirations, promote the reuse of good ideas in new contexts, and provide a basis for informed critique. Standard examples can become common currency and best of all, once concepts are named and exposed, their interrelationships can be appreciated.
The set I propose is called the cognitive dimensions framework, a still-unfinalised set of about a dozen terms such as 'viscosity', 'premature commitment', 'abstraction level'.
The problem of usability evaluation has been attacked in three ways. One way is to perform user testing, in which users are watched while they use the system. This is expensive and somewhat artificial: users in a laboratory, performing specified tasks, are different from users in the wild. Moreover, it takes far too long. Designers cannot hang around for the necessary weeks, then make a small modification and repeat the cycle. Results from lab testing are valuable but in general they are too expensive and not 100% trustworthy.
The second way is to use predictive user models. This is much cheaper and has been extremely successful in some cases. Much the commonest technique is GOMS, in which all the users' core tasks are scrutinised in detail, possible methods are worked out, and the time required for the user actions to accomplish these methods is predicted in detail. Although GOMS has achieved considerable success as a cheaper and quicker alternative to user testing, it is still an expensive undertaking, and it requires the assistance of HCI experts. Moreover, it really needs to be performed on the finished system, since the time-predictions it deals in depend on physical and perceptual characteristics of the interface display. (See Olson and Olson, 1990, for one of many accounts of GOMS).
I would certainly recommend designers of information retrieval systems to use GOMS rather than to perform user testing. But my purpose here is describe my cognitive dimensions framework, one of a new generation of lightweight, approximate evaluation methods which constitute the third type of attack on the problem of usability evaluation.
The cognitive dimensions framework is not an analytic method. Rather, it is a set of discussion tools. My purpose is to provide a way in which some evaluation can be done by the designers themselves.
I believe what we need is to improve the quality of discussion. Experts make sophisticated judgements about systems, but they have difficulty talking about their judgements because they don't have a shared set of terms. Also, experts tend to make narrow judgements, based on their own needs of the moment and their guesses about what other people may need; and other experts don't always point out the omissions. Again, if they had a shared set of terms, and that set was fairly complete, it would prompt a more complete consideration.
In short, experts would be in a good position to make useful early judgements, if (i) they had better terms with which to think about the issues and discuss them, and (ii) there was some kind of checklist. The terms might or might not describe a new idea; most probably, the expert will recognise a concept as something that had been in his or her mind, but had never before been clearly articulated and named.
Discussion tools are good concepts, not too detailed and not too woolly, that capture enough important aspects of something to make it much easier to talk about that thing. They promote discussion and informed evaluation.
To be effective, they must be shared - you and I must have the same vocabulary if we are going to talk. And it is better still if we share some standard examples. And it is best of all if we know some of the pros and cons - the trade-offs between one thing and another.
They have many advantages:
What discussion tools do not need to do, is do describe novel ideas. If they just give a name to something you had often thought about but thought about giving a name to, that's fine. But sometimes they might be ideas that are new to you, and that's fine, too.
Figure 1 illustrates a real-life discussion without the benefit of discussion tools; Figure 2 shows how it might have been if the participants had possessed shared concepts - shorter, more accurate, and less frustrating.
The previous section illustrated the idea of discussion tools; the 'cognitive dimensions' framework, first introduced in Green (1989), is meant to provide discussion tools to help people who are not HCI experts in making quick but useful evaluations.
I believe that taken together, the cognitive dimensions describe enough aspects to give a fair idea of how users will get on with a system, and can help both designers and users think and talk about the system. Each dimension describes one aspect of a system, something that affects how users will manage. The framework contains about 12 dimensions, but in this document I shall not go into detail.
Something that is cognitively hard in one environment may be much easier in another. As an extreme example, writing a Pascal program over the phone is not recommended, even though Pascal has an easy syntax for writing on paper. The properties also depend on the tools available in a given environment; a word processor with no search-and-replace is a different system to one that has a powerful and easy-to-use tool.
You can fix any kind of difficulty, either by changing the notation or by changing the environment, but you usually pay for it with another kind. For example the search-and-replace tool fixes one form of 'viscosity' but it introduces a new and slightly higher level of abstract thought. Search tools that use regular expressions are an even better fix - but they introduce a very much higher level of abstract thought.
I shall illustrate some of these trade-off relationships below.
The global spelling corrector might seem a good idea, but is it always? Sometimes you want to make sure the user looks at every single case separately.
In all the examples I shall give, you should remember that different circumstances might demand different designs.
A notation might be as good as you like, considered just as a static entity, but what needs to be evaluated is the whole process of using it: building or writing in it, debugging it, reading it, maintaining it over the years. Certain sorts of diagrammatic notations, such as diagrammatic query languages, are probably easier to understand than symbolic notations, but they are also harder to 'write' and harder to modify. These various aspects all need to be balanced out.
In its present form, the framework has 14 dimensions (Figure 3) although if I'm honest I think there are overlaps.
I obviously can't go through all of those here. On the other hand, to describe the trade-offs issue I have to at least give thumbnail sketches of three or four.
A viscous system resists change - you have to do a lot of work. For example, if you have produced a long and carefully formatted document and then someone tells you to change the style of all the level 2 headings, say a hundred of them, correcting each one individually is hard work ('repetition viscosity').
One solution is to create a 'style sheet' that defines a level 2 heading. By changing the definition, you can change all the headings.
In Green (1990) I distinguish repetition viscosity and 'knock-on' viscosity.
Change one thing, and who knows what might fall over? That's the hidden dependency problem. Spreadsheets are a fine example. So are some kinds of style sheets; if one style is defined in terms of another, changing the parent might give you an unpleasant surprise.
The key aspect is not the fact that A depends on B, but that the dependency is not made visible.
I like to think of this as the number of new high-level concepts that have to be learnt to make use of a system, such as 'style sheet' or 'regular expression'. Each new idea is a significant barrier to learning and acceptance. High-level ideas - that is, ideas that do not refer to easily-produced concrete instances - are very much harder still.
Let's continue with the word-processor example. If you decide to use styles, notice that you have to decide what styles you want and how they are related very early. Too soon, sometimes. Afterwards, you might wish that you had defined something like "inset block quotation [no space afterwards]" as a child of 'inset' rather than as a child of 'quotation' - but it's probably too late now; the viscosity of the system would make it too much work to redefine everything.
Premature commitment occurs in all sorts of places. Try drawing a map to guide someone to your house. What's the betting you start too close to one side of the paper, or start at the wrong scale, and the last few turnings are all scrunched up?
Or try working out a few formulae with a pocket calculator. How often do you get in a knot because you've started entering the formula in a way that makes the computation extra hard?
In the examples I have given, I have repeatedly illustrated how fixing a problem in one dimension leads to a problem with another dimension. A sort of law of conservation of cussedness.
However, the designer can choose. He or she can fix the viscosity problem by increasing the abstraction level OR by changing to a different kind of notation. Using more abstractions is the commonest solution, but by no means the only one.
I like to compare the cussedness of information structures with the behaviour of ideal gases. Three quantities, temperature, pressure and volume, describe an ideal gas. If you want to increase the temperature, you can keep the pressure constant (but the volume must be allowed to increase) or you can keep the volume constant (but the pressure must be allowed to increase). Taken in pairs, these three dimensions are orthogonal. But you cannot raise the temperature while holding constant both the pressure and the volume. The parallels may not be exact, but they are intriguing.
Figure 4 illustrates some of the trade-off relationships that are frequently observed. We have seen how viscosity can be reduced by introducing more abstractions; but getting the abstractions right demands thinking ahead (i.e. there is a premature commitment problem). Viscosity increases the cost of premature commitment; if the abstractions themselves are viscous, then getting them wrong means you're in trouble. Furthermore, all too often abstractions introduce problems of hidden dependencies, because one abstraction is defined in terms of another.
Secondary notation and visibility were not discussed above.
My approach is very easy to use.
What you get out of this approach is a rough and sketchy evaluation. As we saw back in Figure 1, it will correspond to what users talk about. And if you were to consider changing the design, it will alert you to some of the possible trade-off consequences.
What it will not do is give precise time estimates. For that you should use GOMS (see above).
Nor does the question of users' knowledge get much attention in my framework. A much more thorough approach to users' knowledge has been developed by Lewis et al. (1991). In their 'cognitive walkthrough' methodology, the emphasis is on how the user knows what to do next.
For best results, I think all three methods could be employed, since they address different facets of the problem.
In practice, the cognitive dimensions approach seems to have hit the right note for many people. It has been tried as a teaching tool and as a simple evaluation method, in both cases with success.
In the field of information retrieval not much has been done with it, but I think it would e a good way to make a preliminary evaluation of the usability of a system, rather than by going straight for expensive user testing.
Green, T. R. G. (1989) Cognitive dimensions of notations. In A. Sutcliffe and L. Macaulay (Eds.) People and Computers V. Cambridge University Press.
Green, T. R. G. (1990) The cognitive dimension of viscosity: a sticky problem for HCI. In D. Diaper, D. Gilmore, G. Cockton and B. Shackel (Eds.) Human-Computer Interaction -INTERACT '90. Elsevier.
Green, T. R. G. and Petre, M. (1996) Usability analysis of visual programming environments. J. Visual Languages and Computing, 7, 131-174.
Lewis, C., Rieman, J. and Bell, B. (1991) Problem-centered design for expressiveness and facility in a graphical programming system. Human-Computer Interaction, 6 (3-4), 319-355.
Olson, J. R. and Olson, G. M. (1990) The growth of cognitive modeling in human-computer interaction since GOMS. Human-Computer Interaction, 5, 221-265.
This example illustrates a discussion that would have been better if the discussants had shared appropriate concepts
NB: this discussion referred to a version of Framemaker that is now obsolete.
A: ALL files in the book should be identical in everything except body pages. Master pages, paragraph formats, reference pages, should be the same.
B: Framemaker does provide this ... File -> Use Formats allows you to copy all or some formatting categories to all or some files in the book.
A: Grrrrrrrrr ........ Oh People Of Little Imagination !!!!!!
Sure I can do this ... manually, every time I change a reference page, master page, or paragraph format .....
What I was talking about was some mechanism that automatically detected when I had made such a change. ( ..... ) Or better yet, putting all of these pages in a central database for the entire book ......
C: There is an argument against basing one paragraph style on another, a method several systems use. A change in a parent style may cause unexpected problems among the children. I have had some unpleasant surprises of this sort in Microsoft Word.
A: Framemaker is too viscous.
B: With respect to what task?
A: With respect to updating components of a book. It needs to have a higher abstraction level, such as a style tree.
C: Watch out for the hidden dependencies of a style tree.
(further possible comments)
The abstraction level will be difficult to master; getting the styles right may impose lookahead.
The terms are part of the framework of cognitive dimensions presented in this document.
| dimension | thumbnail description |
|---|---|
| Viscosity | resistance to change |
| Hidden Dependencies | important links between entities are not visible |
| Visibility and Juxtaposibility | ability to view components easily |
| Imposed Lookahead | Constraints on order of doing things |
| Secondary Notation | extra information in means other than program syntax |
| Closeness of Mapping | representation maps to domain |
| Progressive Evaluation | ability to check while incomplete |
| Hard Mental Operations | operations that tax working memory |
| Diffuseness/Terseness | succinctness of language |
| Abstraction Gradient | amount of abstraction required, amount possible |
| Role-expressiveness | purpose of a component is readily inferred |
| Error-proneness | syntax provokes slips |
| Perceptual mapping | important meanings conveyed by position, size, colour etc |
| Consistency | Similar semantics expressed in similar syntax |

We felt we would eventually like to achieve an overall, universal evaluation method for IR systems: that is what we would like to work towards.
The ultimate criterion for evaluating IR systems is: the satisfaction of the user's overall task in their work context (typically their information need will be a subtask of this overall task).
We may be able to develop standard evaluation measures for retrieval: but which of these measures are important and relevant to a particular case will depend upon the TYPE of the information need involved in that case. We need different measures for different information needs, not single standard measures for software independent of user needs and tasks; and a corollary is that a system optimised for one set of measures is most unlikely to be optimal or even good for all user tasks and needs.
It may be useful to divide the overall IR task into subtasks, and to have metrics or standard measures for each subtask. A first approximation to such a division for a typical modern IR system might be like this:
Note that understanding the information need better is an important output, not input, of most retrieval sessions. As the user interacts with the machine and performs successive retrievals, so their knowledge of their information need improves, and this is often an important and useful product of the activity.
Traditional experiments on recall and precision apply to only 1 of these 5 subtasks: the one the machine does. The other subtasks are done by the user, but note that features of the user interface strongly affect how easily and well the user can do these tasks e.g. the information displayed in the surrogate determines performance of (4), highlighting and scrolling to matched terms has a strong effect on (5).
It would be possible and sensible to do evaluation experiments on each subtask separately, and to devise performance measures for each separately.
In addition overall performance of the combined user-machine system is a distinct issue. It should be measured in itself i.e. the net result of complete retrieval session. Measures for this overall activity could be developed and standardised e.g. precision, recall, satisfaction of the information need, costs to the user.
Overall performance is affected not only by performance on each subtask, but also by the connection and co-ordination of these subtasks. This again is itself influenced by the user interface e.g. if only the first 10 surrogates are displayed, then the selection (precision and recall) of the machine for its first 10 items matters, but not for more items; and probably the order within the 10 doesn't matter very much. In other words, a feature of the user interface may completely change what measures for subtasks are important.
In summary, we are concerned with three levels:
Any measures used at lower levels are relevant only to the extent that they contribute to success and economy at achieving the task at the top level.
Initially, we considered what was meant by "Combining HCI and IR Evaluation". Two (of the possibly many) interpretations we thought about were: applying HCI evaluation techniques to the problem(s) of IR evaluation, and bridging the gap between the traditional IR component evaluation and HCI evaluation, where the latter involves task, domain, and possibly user modelling and analysis. We recognised that the IR community is already combining HCI and IR evaluation in the above senses, although there is much to learn from the HCI community (and this was amply demonstrated throughout the workshop).
There was some discussion on the position paper presented by Thomas Green, and specifically on the "Cognitive Dimensions Framework" (CDF). Some felt that CDF focused too much on the user interface, and in particular on the representation and presentation of information, and that the framework did not say enough about users, tasks, or domains. (My understanding is that CDF is independent of task, user or domain, in the sense that you choose the sets of tasks, users, and domains for which you are designing a system - note, not simply a user interface - and that the CDF can be used to explore the design space for such a system. Naturally, this observation was not made during the working session, but afterwards over a coffee!) We also wondered whether the CDF was based on some description or model of users' cognitive abilities. (I understand that it is, and that details can be found in papers on CDF.)
At this point, the focus shifted to looking at a recent large-scale investigation by Nick Belkin (and student) on the effectiveness, usability, and utility of relevance feedback in an IR system. Various important issues emerged during the description of, and discussions on, this investigation.
Perhaps the most difficult problem in the evaluation of interactive IR systems is how to generalise the findings of any particular experiment. Namely, given that an experiment is conducted in the context of some task(s), some domain(s), and some sets of users, how can the results be generalised, or at least applied to, different tasks, domains, and user groups. In this, it might be desirable to be able to characterise (or classify) tasks, domains, and users, according to some agreed framework, so that at least we might be able to compare the results of experiments. (Aside: The TREC experiments can be seen as an attempt to making experiments comparable by carefully defining the task(s), domain(s) and intended users.)
If we could find a way of classifying experiments, then we could develop a catalogue of experimental results which detailed, for each experimental investigation: task description, domain description, users, experiment(s) performed, findings of experiments, general findings (if possible), caveats, other related experiments, and supporting papers/publications.
We were reminded that real estate agents (realtors in the USA, I think), say that the three most important things about a property are: position, position and position. In the context of IR experiments, the three most important things may be: user, task and domain.
We discussed a variety of recurrent issues on the evaluation of information systems. When carrying out an evaluation a stance is necessarily taken on certain important issues, if in some cases only by accident. Thus, pinning down these issues and being explicit about them during the planning stages of an evaluation should be beneficial.
Three issues that seem particularly important are:
In sum, the group decided that the goal of an evaluation should be to evaluate the functional use of the information system with the aim of collecting data to improve its functionality. There is no silver bullet for deciding how best to do this.
The working group discussed the topic first generally and then on the basis of the given scenario concerning evaluating a hypertext-IR version of the British Highway Code. The aim was to develop an evaluation plan for the scenario, but the discussion result was rather a set of issues which need to be clarified in order to perform evaluation.
The group felt that evaluation must be built into the design of the system right at the start of the process.
The following four phase process may be found in standard text books on CBA
Phase 1: Analysis of the evaluation situation - problem definition - goal definition and setting the scope of the analysis - recognition (creation) of the alternatives to be evaluated - recognition of any limiting factors (e.g., cost) that must be met (e.g., not be exceeded) - identification of benefits, costs and other weaknesses to be taken into account in the evaluation
Phase 2: Clarification of evaluation methods - classifying benefits, costs and other weaknesses and identifying their place and time of occurrence - choosing scales of measurement for benefits, costs and other weaknesses - valuing benefits, costs and other weaknesses -- deciding on how to combine measurements in different dimensions - setting basis of prediction (of long term activities) -- e.g., how is the user population estimated to grow over the years?
Phase 3: Performing the analysis - data collection, e.g., in the form of test tasks - calculations - presentation of the results
Phase 4: Interpretation and decision-making This should take the whole process in to account: were the alternatives and criteria identified correctly, was the combination of measurements done correctly, etc.

![]()
