In this discussion paper, I attempt to provide a brief statement of the problems involved in evaluating hypermedia systems. Before going on to consider IR systems in general, and hypermedia IR systems in particular, I examine a few of the basic assumptions that are commonly made in this context, and a few of the issues that are raised if those assumptions are accepted.
The first of these assumptions is that evaluation is for systems, or in other words that it is systems that we should be evaluating.
One conventional model of a computer-based system identifies three principal elements:
Such a model makes explicit the principal function of the system, which is to assist or support the human user in their performance of some task or series of tasks. We could define a task, perhaps, as any activity or action that involves the manipulation of certain concepts or objects and that a person performs in order to attain a particular desirable goal or purpose -- that goal typically being to bring about a change in some state of affairs, problem space or domain, a change that has a specific result. The kind of support or assistance that a system provides to a user consists in enabling the user to perform certain of their tasks with greater ease, efficiency and effectiveness, and thus in facilitating the successful attainment of their goals.
One obvious result of accepting the first assumption is that, whatever type of system we are considering, we should not limit our evaluation to that of the operation of the mechanism. But equally, nor should we limit evaluation to that of the operation of the interface -- even though the importance of studying the interaction between user and computer has only been recognised relatively recently. Evaluation of the operation, the activity, of all three elements of the system will be important components of the evaluation of the whole.
The second assumption is that systems are for evaluation, or in other words that the research activity that we should be engaged in is one of evaluation.
Users might profitably undertake evaluation, so that they can compare one system with another, so that they can find out which is more appropriate for their needs. From the point of view of the system developer, the reason for conducting evaluation is to obtain some indication of the kind of changes that might have to be made to the system if its value or worth is to improve.
Evaluation, then, is essentially the activity of determining how well something does what it is supposed to do, or in other words measuring the quality of its performance of its function. So in the context of computer-based systems, whose function is to support their users' performance of certain goal-oriented tasks, evaluation is concerned with determining the level of success with which systems enable their users to achieve their goals. This is a significant point, and one to which I shall return.
The third assumption might be summed up in the phrase `each system its evaluation'. This rather glib statement has two separate implications.
The first of these is that there exist systems of different types, supportive of different tasks, and intended to facilitate the attainment of different goals.
One fundamental characteristic that varies from system to system is degree of interactivity: in other words, the continuity of the control that the user is allowed to exercise over decision-making in the course of the interaction process. For instance, there are a number of ways, differing in level of interactivity, in which a user may select an option from the range of options that are available to them at any time and then communicate this selection to a computer. Allowing the user to directly manipulate visual symbols on-screen, to modify actions or strategies in the light of the ongoing responses of the computer, to enjoy constant access to user-selectable options enabling parameters to be altered and forms of display to be modified -- these are all marks of a highly interactive system.
The second implication of this statement is that there exist different methods of evaluation, and that certain of these are considered to be appropriate for application to systems of different types. At least, this has historically been the case, because the particular sorts of activity carried out in the name of evaluation have generally proceeded in accordance with particular conceptions of the intended function of individual systems and the goals of their users. A statement of the function of a system, couched in terms of the way in which it is intended to assist its users in performing certain tasks and therefore in attaining certain goals, is often used as the principal criterion on which evaluation is deemed to be based.
Once a decision has been made as to the criterion or criteria on whose basis evaluation is to proceed, decisions of three further kinds need to be taken, which follow to a greater or lesser extent from the initial selection of criterion. The first of these is the identification of appropriate measures to be used in the evaluation process; the second is the design of the data-collection method to be used in the observation or computation of the values of these measures; and the third is the decision as to whether data should be collected in a real-life, operational setting, or in the context of an experiment in which certain variables may be subject to some degree of control.
An important subclass of activity involved in evaluation is that of measurement. Measurement involves establishing, for the purposes of comparison between objects, the position, on a quantitative scale, of the value of some directly observable attribute of the object under consideration.
Unfortunately, however, it is often considered that the level of success with which a system enables its users to achieve their goals is not an attribute whose values may be directly observed. And the need, therefore, is to determine what other attributes there are whose values (a) are directly observable and (b) are sufficiently strongly correlated as to be used as indicators or measures of the level of a system's success.
The nature of the measures that are selected will obviously have a bearing on the nature of the method used to collect the raw data that is to be used in the calculation of the values of the measures. There are, of course, many different methods of data collection, each geared to providing data either of an objective or of a subjective nature and that lends itself either to quantitative or qualitative analysis.
Any combination of these methods may be implemented either in an operational setting, or in the context of a laboratory experiment in which certain variables may be controlled, and in which respects the complexity of real-world experience is not reflected.
So, we can point to a few general conclusions that may be drawn if we accept these basic assumptions. These are: that whatever type of system we're concerned with, before we can begin to evaluate it, we need to:
Our research community is concerned with the design and evaluation of IR systems. So what is the function of IR systems, what is it that they are supposed to do? IR systems are designed and implemented in order to support the user in the performance of tasks that involve information-seeking activity. In engaging in such activity, the goal of the user is to bring about a change in a particular state of affairs, or situation, that is perceived by the user to be problematic, with the specific result that this problematic situation is resolved. It is implicitly assumed by the designer and the user of any IR system that at least part of the user's problem consists in their anomalous state of knowledge (ASK) in a particular respect or domain, and that such a state may be resolved through the acquisition of certain information. The user thus has a perceived need for information, and undertakes the task of identifying and retrieving information with the goal of satisfying that need, and resolving their anomalous state of knowledge.
But simply carrying out the task of identifying those documents that satisfy such a need for information: that's only part of the wider problem faced by the person in an ASK. Other aspects of the problem actually result from the user's decision to use assistance in their performance of this task. In order to call on the support of a computerised IR system, for example, the user needs to be able to figure out how to express or articulate their problem so that it may be communicated to the retrieval mechanism -- which may itself be difficult, especially if the user is unfamiliar with the terminology of the domain of their information need. And even before the user is in a position to specify their information need clearly, they may well need to learn more about the characteristics of the original problematic situation in which they find themselves, or indeed to come to the initial perception that such a problem exists. These cognitive tasks are also ones that an IR system might be expected to support, and it is the success with which the system allows the user to achieve its goals in the performance of multiple tasks that should be the criterion on whose basis the performance of the system should be judged or evaluated.
So, as evaluators, we need to consider:
but all of these criteria together, and more generally
Yet, the traditional model of the IR system lingers on, with
Two further assumptions here are
As a result, it is commonly determined that the system may be evaluated in terms of the relevance of the responses of the retrieval mechanism to single, static and specific queries -- i.e., in terms of the similarity between the retrieval set and an ideal set.
In real-world systems, however,
Other assumptions are:
The result is that a call is regularly made for such systems to be evaluated in terms of criteria other than the relevance of the documents in any individual retrieval set.
Discussion of the criteria that may be used in the evaluation of IR systems brings us on to the separate issue of which measures may be used as quantitative indicators of such criteria.
The standard measures of retrieval effectiveness are those of recall and precision, which are usually characterised as being system-centred; in terms of the terminology adopted here, they're mechanism-centred, in that they are used simply to quantify the degree of similarity between an ideal set of actually-relevant documents and a set of potentially-relevant documents retrieved through the operation of the retrieval mechanism. These measures were introduced at a time when the traditional model of IR was a more truthful representation of reality, and when systems were, for instance, far less supportive of a high level of interaction between user and mechanism. For a long time, however, there has been a need for complementary measures that take into account the changes in the design of IR systems that have taken place since.
Measures of search efficiency are also well established, and typically relate (i) some conception of the utility, usefulness, worth or value of a search, to (ii) the time, cost or effort expended, or to the length of the search measured in terms of the number of commands issued by the user or the number of documents viewed by the user.
Measures of the usability of a user interface, or the ease and appropriateness with which a user may perform the tasks, communicate the requirements, display the responses that they wish to, are again well-known, and are commonly based on the speed with which a user performs a specific, often quite low-level task, or alternatively on the speed with which they learn or come to understand how to perform such a task.
Measures of user satisfaction seem, however, to be less widely used in the IR community, although they have a distinguished history in, amongst other fields, library science. Users may be asked to give their own individual, subjective views on their satisfaction with various specific aspects of the information-seeking process, including the completeness or the precision of searches, or to give a judgement as to the overall success of the operation of the system.
One interesting point to be made about measures of user satisfaction is that their values are often, perplexingly, not related to the values of other measures. Several studies exist, for example, of end-users' searches in CD-ROM databases, which show users often to be highly satisfied with the results of their searches, even though the effectiveness of those same searches as measured in terms of recall and precision is low. Furthermore, it might not immediately be apparent to the designer what the causes of such paradoxes are, and for this reason it is especially important that measurement of user satisfaction should be supported by the elicitation of qualitative data using questionnaire-based methods.
The implication is that, to obtain a full, user-centred picture of the value of a system, it is necessary
In turn, these guidelines have obvious implications for the selection of the methods to be used in the collection of data enabling the values of particular measures to be observed or computed.
Hypermedia IR systems are IR systems of a particular type, and pose particular problems for evaluators.
The traditional model of a hypermedia IR system, just like that of a conventional IR system, includes, amongst other elements:
In terms of structural elements, the basic difference between a conventional IR system and a hypermedia IR system lies in the explicitness with which the relationships between documents are represented.
In a conventional IR system, records are stored, and considered by the retrieval mechanism, independently of one another. This might be viewed as an inappropriate simplification of reality, since one document may be related to another in any of a number of different ways. But these relationships between documents are represented only implicitly, through the use of similar sets of terms in the indexing of documents. Such relationships could be identified and acted upon by the user only if a facility were made available to calculate values of document-document similarity on the basis of the co-occurrence of index terms in each pair of records.
In a hypermedia IR system, some of the relationships that exist between documents are explicitly represented and stored in the form of a network of links, which are simply ordered-pairs of origin and target nodes, with the intention that, in the course of their information-seeking activity, users may exploit the additional information captured in this explicit representation of document-document relationships in their information-seeking.
In fact, it is through the implementation of links that a hypermedia system supports a particular kind of information-seeking activity or behaviour known as browsing. Instead of requiring the user to specify a query, which is then to be matched against every record in the database, the system allows the user to request specific, single nodes to be retrieved and displayed successively, by activating the link between a currently-displayed node and some target node. The level of interactivity of the retrieval process is thus particularly high.
Different implementations of hypertext systems vary greatly in the types of link that they support, and thus in their function. Links vary in several dimensions.
A few of the issues that are currently the concern of developers of hypermedia IR systems are:
In hypermedia IR systems, it is clear that the user is the primary actor, exercising a high degree of control over successive stages of a information-seeking process, and engaging in high-level interaction with the retrieval mechanism on the basis of information needs that the system does not require to be expressed in the form of clearly-defined queries. One conclusion that may be drawn, then, is that such systems should be evaluated in terms of mechanism-based criteria such as retrieval effectiveness.
Some of the problems faced by prospective evaluators of hypermedia IR systems, then, are as follows:
The urgency of the situation is fuelled by two immediate needs, for the design and evaluation of systems:
Evaluation of the prototype for the wide area dissemination of the on line version of "The Computer Journal" (http://www.dcs.gla.ac.uk/~mark3/Irides/prototype)
Irides concerned with the design and implementation of a prototype offering the user sophisticated searching and browsing facilities. Through these facilities the user can retrieve relevant articles of "The Computer Journal" by combining query-based retrieval and hypertext browsing.
Evaluation has been the final work package of the Irides project. The evaluation step has been implemented by the GMD-IPSI, Darmstadt. The main goal of evaluation has been described by the project proposal as follows: "We will evaluate the goodness of the multimedia authoring of the hyper-structure and of the technique for metering the access for charging, preserving copyrights, and articles delivery". This brief communication aims to give some highlights of the evaluation of automatically constructed hyper-text employed to browse the articles.
Tests have been carried out with small of number of users due to the time constraints. The test users were six computer scientists, and were asked to interact with the system and concentrate on the design of the hypertext. After the users were given, and solved some retrieval tasks, the following results have been obtained:
- the browsing tool implemented by the ACM classification scheme anticipates the users' needs
- the network of index terms can be complementary used to make the retrieval more precise
- the network of similar articles can be employed to make recall higher
- the interface is still a crucial part of such a system: users need more information on the hypertext to make both retrieval more effective and risk of disorientation lower
Risø National Laboratory, Postbox 49, Denmark
Tel: 45-46775149 Fax: 45-46755170 Email:amp@risoe.dk
Keywords: Cognitive framework; empirical evaluation experiments; library system
Abstract: For a comprehensive experimental evaluation of a computer system, it is necessary to define a suitable evaluation sequence consisting of a set of boundary conditions to be able to evaluate more encompassing features of the system. This paper suggests a framework for identification of boundary conditions to be used in empirical evaluation experiments. The framework is briefly described and empirical evaluation at the boundary of the actual work environment is demonstrated by evaluation of a library system.
A systematic evaluation of complex systems should be well structured and performed at several well defined levels of user-work place interaction. At each of these levels, evaluation should be performed either analytically or empirically, or both approaches should be applied. Issues related to the contents of the information and the functionality of the system can be evaluated analytically, while issues related to its form involve context, user experience and preferences, and therefore, very likely will need an empirical approach. An analytical evaluation depends on a structured comparison of work requirements as defined by a work analysis with the design specifications. In contrast, empirical evaluation involves experimental tests of the performance of a system with reference to design objectives, or with reference to its actual performance in a laboratory with test users, or its actual performance in the ultimate context of use in a real work place.
Empirical Evaluation Approaches. For empirical evaluation experiments, it is necessary to establish an experimental work situation that creates a well defined boundary around the subject, and to study whether subjects' responses to this boundary leads to the mode of behaviour which was assumed as the design basis. For proper integration of the results, such experimental evaluation scenarios should be compatible with the structure of the work analysis underlying design as well as with the design specifications. To structure an empirical evaluation of the match between a new design and the work domain including the characteristics of its users, we need a comprehensive framework for description of work systems, see figure 1.
Figure 1. The figure demonstrates how different evaluation questions can and should be asked at the various levels described in the framework for work analysis. In addition, it is shown that a different ordering of the evaluation questions should be considered for an analytical and for an empirical approach to evaluation.
Both analytical and empirical evaluation should be considered for all the levels of figure 1, but the sequence of the levels considered will be different for an analytical and an empirical approach. For analytical evaluation of design objectives, a natural approach will be top-down from global system properties to detailed task functions. For empirical evaluation, a path from details to global features will be the best approach. Complex experiments involving e.g., entire task situations will be meaningless, if the system does not match user characteristics at the elementary level. For example if experiments are planned to evaluate the functionality of a prototype in advance of a test of the interface readability. The various boundaries of evaluative experiments are summarised below with reference to figure 1. These boundaries "move" the context successively further from the actor to encompass more and more of the total work content in some kind of increasingly complete simulation and field evaluation.
1. Match with Users' Resources and Characteristics. This level addresses the sensory - motor characteristics as well as perceptual and cognitive resources. Evaluation of perceptual and cognitive resources will focus on the size of letters, the readability of the typography and the graphics of the displays, display composition, consistency, coherence, use of colours, icons, WYSIWYG interfaces etc.. This level also addresses evaluation principles that are important for the understandability of the information flow in the communication between the system and the user.
2. Support of Task Strategies and Mental Models. At this level, the following questions are addressed: Does the system support several task strategies and can the user shift goals and tasks concurrently without loosing support from the system? Does the system provide the mental representations of novices and experts, and is the user's mental model of the work domain supported by the interface - also during distributed decision making?
3. Support of Cognitive Decisions and Processes. A basic question to be asked at this level is: Does the system effectively support the cognitive decisions that have to be made during task performance ? Does the system support the actor's decision making- are exploration, situation analysis, goal evaluation and planning supported for familiar as well as less familiar situations?
4. Support of Relevant Task Situations. The question is here whether the system supports the entire task repertoire - are the tools adequate, their functionality sufficient and does the information cover the complete work task space? Is its capacity adequate? Experiments may serve to evaluate whether information is available about the basic concepts of the system and its overall architecture. Is it possible to navigate among tasks, and to pursue several, different task related goals?
5. Adequate Representation of Work Environment. Evaluation experiments here will investigate the relationship between the actual use of the system and users' intellectual and emotional style and their personal problem solving habits in the total work place context. For this boundary, the evaluation must be based on actual work scenarios generated from an actual work analysis, and the aim is not a task simulation but a work place simulation and must include not only the system's effect on a complex work place situation.
6. Field Evaluation in Actual Work Environment. Evaluation in the actual work context will address the question: Does the system match organisational policies and employee's acceptance and development? How is its impact on the work context and the quality of the work situation? Does the system support several coherent work task activities and the co-operative co-ordination of activities among several users, maybe in different departments of the organisation, and does it support interaction and co-ordination with institutions outside the organisation? In other words, will it answer the question whether the design approach and the assumed work organisation does match the performance criteria and preferences of the users. Will the system be used? And do the users like to use the system, and do they actually use it over a longer time span in the daily task situations, for which it was designed and to the degree as was expected?
Evaluation in the actual work environment at boundary 6 is illustrated by a selection of evaluation tests from the evaluation of a full scale library system. The system design was based on a work analysis and then tested in laboratory experiments and evaluated at the work place within the framework boundaries of figure 1. Extensive experimental validations in the laboratory were performed to ascertain that the system could meet the work requirements before the evaluation of the system took place in the actual work context in a library. The subsequent evaluation of its use by the general public was thus an attempt to validate whether or not the system actually is the right design for supporting actual library users. The test took place over six months in a public library in order to evaluate whether the information system: 1) could be accessed and was accepted by the general public and professional librarians; 2) could provide the books asked for to the users' satisfaction; 3) would impact the library work in a way that was satisfying to the public and the professionals, and cost/effective to the organisation.
Users' resources and value criteria. One of the tests conducted to pursue the first goal was an evaluation of the iconic interface at boundary 1 and 2. The efficiency of use, the comprehensibility of icons and the subjective user satisfaction was evaluated at the work place in a full scale prototype system by 1030 users, who responded to on-line questionnaires which appeared automatically on the screen, after the user ended his/her session with the system. The questionnaire adapted to the individual user's navigation trajectory and displayed those icons, which the user had met at the interface and actually employed during a search. It contained questions about the understandability of icons, which were used both as action buttons, and as a means to express the topics contained in books. Fifteen different icons used as action buttons were displayed together with a textual list of action possibilities, and users were then asked to select the action that would match the icon. Evaluation of the associative relationship between the message of the icons and the contents of the books in the database was measured on a scale that expressed the users' perception of degree of match. Finally, users' subjective satisfaction with an icon based interface was evaluated relative to a similar text based interface. The result of the quantitative test at the work place was then tested qualitatively by 75 observations and interviews with library users after they had used the system.
Task strategies. Field studies of task strategies before the design showed that several different strategies were employed such as analytical search by attributes, search by analogy and similarities with previous examples, browsing strategies etc.. At the work place, 7100 on-line logging of all dialogue events (mouse clicks, etc.) tracked the users' strategy choice, and 220 questionnaires gave answers to their reason for choice of strategy, its ease of use, their strategy preference etc.. During use of the system over a longer time span, the analytical strategy became the most popular strategy: users and librarians adapted their strategy choice to the most effective strategy in the new environment. Field studies before the system was introduced showed that the analytical strategies were very rarely used in a library due to its high demands on knowledge, time, and memory resources etc..
Decision task. The second goal was pursued by an evaluation at boundary 3 of users' subjective satisfaction with the books they had retrieved from the database by use of the classification scheme. The classification scheme used had been developed from extensive field studies, and now the support of this classification scheme, its keywords and book descriptions employed for retrieval of relevant books was evaluated from structured questionnaires by 120 end users based on their reading of books. The most important performance measure was the precision of retrieved books based on users' comparison of the database classification of book contents with their own estimation of the book content and its relevance in a use situation.
Work space. Another type of experiments were used to pursue the third goal and aimed at an evaluation of the impact of a new retrieval system on user behaviour and preferences, on the means and ends required and the impact on the total work situation. Professional intermediaries working with a new computer system in information retrieval and cultural mediation tasks reported in questionnaires and focus group interviews at boundary 4 how the system changed their roles and left more resources for co-operation and a thorough dialogue with the users. The system supported their cultural mediation strategies, and allowed a shift to the role as a consultant analysing task problems, evaluating the quality of alternative proposals, and assist in choice of solutions. Secondly, they reported how important it was for the professional image and pleasure of use that errors did not occur as the system supported exploration of alternatives, and no error messages occurred.
Organisational work context. The possible positive or negative impact on quality of work and the system's potential deterioration of professional skills during changes in role allocation among users and librarians was evaluated at boundary 5 and 6. Whether the new system would lead to a simplistic interpretation of users' needs, an impoverishment of their reading experiences and, as well, an impoverishment of the librarian's domain knowledge. A computer logging of librarians' and users' use of the system was implemented and combined with focus group interviews with the staff and user groups. Both types of data were compared with records of librarians' and users' book descriptions from earlier field studies, to make sure that the database information exceeded in number and breath their book knowledge. This was done to make sure that both users and librarians through the use of the system would increase their competence and knowledge about the document collection.
The impact on cost/effectiveness was measured by the increase in number and distribution in loan of high quality books, as the ultimate institutional goal for public libraries is to promote education and cultural values. A more even distribution of book loans means more effective use of the book stock, which has economic implications for a library's costs for book acquisition.
The framework suggested here has been developed to be used concurrently for analysis of user-work interaction in system design and evaluation both in the laboratory and in field studies of real work environments (Pejtersen and Rasmussen 1997, Rasmussen, Pejtersen and Goodstein, 1994, Pejtersen, 1994, 1993, 1992, 1991, Goodstein and Pejtersen, 1989). Empirical evaluation pose special problems with respect to validity and generalization of results. Very often, evaluation questions are zapping among several levels of analysis with very different requirements for control of the boundary conditions. This makes it very difficult to generalise the evaluation results to other, similar work domains and support systems. To enable generalization and the transfer of findings among actual work analyses and different experimental designs, a consistent framework is necessary.
PEJTERSEN, A. M. and RASMUSSEN, J. (1997): Effectivenes testing of complex systems. In: Handbook of Human factors and Ergonomics. Ed. by G. Salvendy, Wiley. In Press.
PEJTERSEN, A. M. (1996): Empirical work place evaluation of complex systems. In: Advances in Applied Ergonomics. Proceedings of the 1st International Conference on Applied Ergonomics. (ICAE'96), Istanbul, Turkey, May, 21-24, 1996. (Eds.): Ozek, Ahmet F. and Salvendy, G..USA Publishing Coorporation.
RASMUSSEN, J. PEJTERSEN, A. M. and GOODSTEIN, L. P. (1994). Cognitive systems engineering. (John Wiley, London).
PEJTERSEN, A.M. (1994): A Framework for Indexing and Representation of Information based on Work Domain Analysis: A Fiction Classification Example. In: Knowledge Organisation and Quality Management. Proceedings of Third International ISKO Conference, Copenhagen, 20-24 June 1994. Eds: Albrechtsen, H. and Ørnager, S. Indeks Verlag. Frankfurt. 1994. pp. 251-264.
PEJTERSEN, A. M. (1993): Designing Hypermedia Representations from Work Domain properties. In: Hypermedia. Proceedings der Internationalen Hypermedia Konferenz. (eds): Frei, H.P. and Schauble, P. Springer Verlag. Heidelberg.
PEJTERSEN, A. M. (1992). The Book House. An icon based database system for fiction retrieval in public libraries. In: The Marketing of Library and Information Services 2. Ed: Cronin, B., (ASLIB, London). pp. 572-591.
PEJTERSEN, A. M., (1992). New model for multimedia interfaces to online public access catalogues. The Electronic Library, the International Journal for Minicomputer, Microcomputer and Software Applications in Libraries. Vol. 10, No. 6.
PEJTERSEN, A. M. (1991): Interfaces based on Associative Semantics for Browsing in Information retrieval. Risø M-2794. p.143.
GOODSTEIN, L.P and PEJTERSEN, A.M. (1989): The Book House. System Functionality and Evaluation. Risø National Laboratory, Risø-M-2793.
With some overheads from the workshop.
In modern work, stable work procedures are not the norm. Many tasks are discretionary. Explicit consideration of goals and constraints and exploration of the boundaries of acceptable performance are often required to optimise effectiveness. For this reason, the object of modelling can no longer be the "task," but must include all the features of the work environment, and the interpretation of these features by the actors. The interaction of work environment and actors' resource constraints creates the task ad hoc. A wide variety of options is found with respect to when and how to approach a given task. Therefore, to understand why a particular piece of behaviour is preferred instead of another possible pattern, we have to understand how the action alternatives in a particular situation are eliminated so that one unique sequence of behaviour can manifest itself. Only then can we hope to predict how a new tool will change a present work practice and the users' interaction with a new system design.
In other words, we have to identify the constraints or boundaries within a work environment that shape the behaviour of users together with the subjective performance criteria they apply to optimise performance within the remaining action possibilities.
A framework for analysis and evaluation must serve to represent the characteristics of both the physical work environment and the "situational" interpretation of this environment by the actors involved, depending on their physical, perceptual properties, and their skills, strategies and values. The analysis of the work domain activity includes the identification of behaviour-shaping constraints
This dimension represents the landscape within which the work takes place and it serves to make explicit its goals, constraints, and productive resources. The representation presents an inventory of system elements and it is, in the short perspective, independent of particular situations and tasks. It identifies the functional elements and their means-ends relations or, in other words, the productive resources which are available for the actors to 'design' their local activity. The analysis is structured at several levels of functional abstraction and, in this way, include representations of physical configuration and anatomy, of physical work processes, of general functions, of priority measures, and, finally, of system goals and constraints with reference to the environment.
An analysis within this dimension of the framework will identify the structure and general content of the global knowledge base of the work system which must be considered for design of work support systems.
This dimension instantiates that subset of the basic means-ends network which is relevant for a particular task. Analysis should not be made in terms of work procedures but in terms of the objectives, functions and resources active in prototypical work situations, and the related information requirements. A set of such prototypical work situations can be used in various combinations to characterise a set of task situations to be considered for information system design.
For the next dimension of analysis, a shift in representational language is made. For each of the activities defined, the relevant tasks are identified in terms of decision making functions, such as situation analysis, goal evaluation, planning, or actual execution. This representation breaks down work activities into subroutines which can be related to the cognitive activities of the involved people and which serves to identify the cognitive tasks that are the targets for support systems. The information gained in this analysis will identify the knowledge items from the work domain representation which are relevant in a particular situation. In addition, it assists in identifying the queries which are likely to be made by decision makers for retrieving information.
A further analysis of the decision task requires another shift in language in order to be able to compare task requirements with the cognitive resources and subjective preferences of the individual actors. For this purpose, the mental strategies which can be used for each of the decision functions are identified by detailed analyses of the actual work performance (e.g., by protocol analysis). Each strategy is based on a particular kind of mental model, a set of tactical rules and a related mode of interpretation of observations. The characteristics of the various strategies are identified with reference to subjective performance criteria such as time needed, cognitive strain, amount of information required, cost of failure, etc. Knowledge about the available effective strategies is important for two purposes: 1) The design of the user-system dialogue and navigational paths to the database content and 2) the user interface design, - because it supplies the designer with several coherent sets of mental models, data formats, and tactical rule sets which can be used by actors with varying expertise and competence and, therefore, should be supported by several, different, corresponding interface representations.
At this stage, the action possibilities in work performance of the individual have been delimited through an identification of the work-dependent behaviour-shaping constraints down to the level of mental strategies which can be employed for the decision functions allocated to each individual actor. In order to judge which strategy will actually be used, the resource requirements of the various strategies have to be compared to the cognitive resource profiles of the actors. Therefore, this perspective of analysis is focused on the background of the relevant user category and on the level of expertise and the performance criteria of the individual actors.
In order to identify the actors actually involved in the prototypical task situations, it is necessary to find the principles and criteria governing the allocation of roles among the groups and individuals involved. This allocation of roles to actors is governed by the social organisation and management structure and is dynamically dependent upon the circumstances and criteria such as actor competency, access to information, minimising the communication needed for co-ordination, sharing of work load, complying with regulations (e.g., union agreements),
A work analysis will not proceed as an orderly top-down through these perspectives as described above, starting with the work domain analysis and finishing with the users' cognitive resources and value criteria. The broader context of the entire work environment will be activated both during the analysis of task activity and user characteristics. In particular, the analysis of division and co-ordination of work and social organisation will be closely related to the analysis of the work domain and the task situation and frequent iterations among the perspectives will be necessary.
1. Define Boundaries
2. Generalization and transfer of results
Some examples of laboratory and work place evaluations
* Compatibility with human sensory and anthropomorphic characteristics
* Understandability:
The multimedia (text, icons, animation, sound etc.) of the interface provided functions
*Effectiveness:
Are task strategies supported?
Are decision tasks supported?
Do provided functions meet task requirements?
Does system support work situation?
Can system be used?
* Acceptability
Individual: does system match users' preferences?
Social: does system match work organisation and co-ordination?
Will the system be used?
1 What is to be evaluated: a product, a concept, a partial solution, a prototype with surface levels of the total functionality of the system, or a prototype with full functionality of only a part of the system?
2. Comparative: Should several system solutions be chosen for evaluation? Evaluations can be comparative if several systems are to be checked for a differential result to support choice among design alternatives.
3. Absolute: testing whether a single system will be able to or does in fact achieve a given goal and level of performance.
4. Boundaries: What are the (categories of) situations to be evaluated?
5. What constitutes an unambiguous definition of goals and objectives which can be transferred to the evaluation level (what is the evaluation supposed to establish)?.
6. How will performance be defined and how will it be measured? What will be the linking between evaluation objectives and measurable performance variables
7. What are the effects of the intermediate variables (training, experience, task, environment, etc.)?
8. Who will participate in the evaluation? Real end users, test subjects, design team members, colleagues? In iterative design, a distinction should be made between use of subjects from the work place in a work situation, test users in a laboratory, and the testing done among the design team members and colleagues in the project group.
9. Where to perform the evaluation? In a laboratory or at the users' work place?
10. What evaluation data should be collected? Subjective user/expert judgements, qualitative
11. Objective, quantitative measures and objective performance criteria
12. What quantitative data should be collected? Quantitative measurements can be performed as objective measures or as subjective measures
13. What quantifiable performance measures are relevant? Quantifiable measurements may include time to do a task, error rate, number of features actually used, number of features never used
14. What methods to use to capture data? Synchronised audio recording and videotaping, questionnaires, interviews, logging of observational data, of the actual use of the varied functionality of a product. Automatic data logging
15. What methods to select for data analysis and data encoding to obtain reliability? What statistical methods and what data integration method for coding, sampling and analysis?
16. What methods to choose for qualitative analysis of case studies?
17. What methods to select for presentation of results for customers or test subjects, will a summary of videotapes be effective.
This was the working group on Evaluation of WWW Information Systems, which worked on the Scenario of the company which wanted to sell more of its internet computers by writing, and making available for free, systems which took greater advantage of the hypertextual structure of the Web than existing systems.
With respect to the general issue of evaluation of WWW Search Engines, the Working Group made the following points:
With respect to the Scenario, we believe that this is also a classic case of an evaluation problem, namely, identifying the overall goal of the system, and then finding criteria and measures appropriate to that goal.
We tried to address this issue by considering various levels of evaluation.
Given that the overall goal of the system that the company intended to implement was to convince people to buy the computers that the company manufactures, the top level criterion is sales, and so the most appropriate evaluation measure for the system is whether, and how much, sales increased after introduction of the system.
Although obvious and realistic, we also see that this measure is somewhat facile. Therefore, a somewhat deeper level of analysis leads to attractiveness as the criterion for evaluation, based on the idea that if the system is attractive to users, they will want to buy the computers that the company makes. One candidate measure for this criterion is reuses of the system by individual users.
Attractiveness and reuse are perhaps better than sales, in that they might be applicable before the system is fully implemented, but they suffer from some measure of unrealism, with respect to the overall goal. We therefore suggested, as a next level of evaluation, the criterion of preference (of the proposed system to any other available alternatives), measured by willingness to pay for the use of the system.
Our overall approach to evaluation in this scenario, as mentioned above, was based on the assumption that one should always condition evaluation measures and methods to the goal of the system which is being evaluated. This led us to some quite non-traditional potential measures for evaluation of the proposed system. However, if the goal were changed, so that, for instance, the company was interested in designing and producing a "good" system, or in understanding how a system could be "good", or in knowing whether their ideas about the value of taking more account of the hypertextual structure of the Internet were "correct" or "useful", then the criteria that we would have suggested, and the associated measures, would have been quite different, and perhaps more like those which we normally associate with IR and HCI evaluation.
Initially, we had a broad discussion about the concept of hypermedia, problems concerning indexing of different types of media (images and sound) and ways of doing it: manual, automatic and semi-automatic.
It was confirmed that the discussion was supposed to take place considering the evaluation of the engine level not evaluation of the interface design of hypermedia systems.
The evaluation of hypermedia systems, in general, was discussed with reference to the talk by the invited guest speaker: Annelise Mark Pejtersen, Risoe, Denmark. By applying different evaluation methods for different aspects, parts and media within the hypermedia system, combined with the application of several methods to the same part of the system, one may benefit from comparative analysis of overlapping results.
The big question is: how to conduct the evaluation.
It was suggested that because of the subjective nature of the content of hypertext/hypermedia systems, a possible method of evaluating the system performance would be to apply work task situations, in the way this approach was presented by Pia Borlund and Peter Ingwersen at the first Mira Workshop in Glasgow, May 1996. This approach was discussed.
In order to focus on the evaluational aspects with reference to something more concrete, we turned towards the proposed scenario (a scenario about a local radio broadcast station, which searches various sources of information such as images, sounds, text in different languages, to be used in news stories in the six o'clock news.
We never managed to come up with any finalised conclusions about how to conduct an optimal evaluation of hypermedia systems -- we even failed, when it came to separating the engine level from the interface level.
This group first tried to work on the scenario:
Le Monde wishes to raise its profile in the IR community and is particularly interested in work on hypertexts/hypermedias as they see this as a major future direction for newspapers. To start this work they are sponsoring the construction of Le Collection du Monde - a large hypermedia test collection. Design it...
Very soon two questions arise :
Question 1 results in ... nearly giving up the scenario. Question 2 is the starting point and the kernel of the following discussion.
Let us begin by the conclusion of this discussion : the group thinks that - a test collection has not to include a predefined set of accesses nor a predefined set of evaluation schemes, - but a test collection must include good tools for these tasks so as to document experiments and to retrieve them.
Other major point : for us, building a HT test collection must not be affected by benchmarking or comparative experiments objectives.
The discussion may be described according 4 aspects :
The general conclusion is that this discussion was pretty rich but would have been more constructive and reusable if we had referred to the models displayed by previous speakers.
Because most of the participants of the working group didn't know much about the automatic construction of hypermedia (hypertext), Maristella first explained the main issues involved with automatically creating a hypertext. Some of the issues are of a more technical nature, such as how to store the data structure on disk in a DBMS. Other issues are of a more conceptual nature. We focused on the issues which types of links the hypertext contains (examples of link types are structural links according to the section/subsection hierarchy, citation links from a bibliography entry to the cited document, and similarity links from a node to a semantically similar nodes); which types of nodes it contains (examples for nodes are documents or parts of documents, index terms, and concepts; the ACM classification schema is a good example of a graph where the nodes are concepts); whether it only contains intra-document links or inter-document links also; how to visualise the structure of the hypertext (and other user interface issues); other media types than text; and how to update a hyperbase. We also mentioned the question of efficiency.
We then talked about the impacts of these issues with regard to evaluation. For example, with regard to the types of links, one could do user studies in how far a particular type of link is useful. This can be evaluated by comparing a system with that link type with a system without links of that type, or by evaluating the usage frequency of links of that type. A special case occurs if one includes manually constructed links, the performance of the automatic link creation can thus be asserted. Similar things can be evaluated with regard to the types of nodes. For instance, if the system provides different kinds of nodes and allows the users to navigate both between nodes of the same type and between nodes of dissimilar types (e.g. by associating a document with a concept) it could be tested whether a certain type of node facilitates better navigation or is confusing to the user. With regard to the user interface issues, one could compare different ways of visualising the graph structure of the hyperbase. Also, it would be possible how much difference it makes to include a search facility as opposed to a browsing-only interface. An interesting question would be how to include continuity features (as mentioned by Steve Draper in his position statement) and how much that helps the users. Another issue with regard to visualisation is whether or not the weights assigned to links by the system should be displayed to the user. Concerning updates to the hyperbase we thought of two main things to evaluate. One question is of course the efficiency. (Does the system need to rebuild the whole graph when a document is added?) Secondly, one could imagine that there are systems which perform a quick and dirty insertion operation where the resulting hyperbase is incorrect in some sense then only rebuild the hyperbase once a week say. For this kind of system it would be interesting to investigate the trade-off between correctness after insertion operations and the speed of the inserts. As there are some systems which build the graph of the hyperbase on the fly, at the time of posing a query, it is important to ascertain the efficiency of answering a query and the speed of link traversal.
Lastly, there is the question of different media types. As none of the participants of the working group was aware of the existence of any system that can cope with non-textual data in a non-trivial way, we took the easy way out and declared that there is nothing to evaluate, yet. One would have to at least have a look at one system that does this to learn about things to evaluate.
How does one evaluate a WWW search engine? We quickly realised that in order to even consider this question we would first have to think about the setting in which the search engine would be used. To do this we proposed the following question: "If you were a travel agent how would a WWW search engine help your business"? To further focus discussion, we then considered Annelise Mark Pejtersen's model of design. These are five issues of the model:
We found that these questions helped us direct discussion about the travel agent scenario, but we did not progress sufficiently to propose a definite plan.
We believe that Pejtersen's model might be used as a basis for formulating such a plan because it sets guidelines for the collection of quantitative and qualitative data. Perhaps more importantly, it could be used as a reference model for considering the merits of any evaluation methodology.
We were impressed by two features of the Pejtersen framework. First, it treats the system in its context of use, rather than restricting evaluation to individual users; and second, it encourages an iterative reflection on different aspects of the system, so that thoughts about 'relevant strategies' are related to 'co-operative work co-ordination', for example. Without such a framework, it is all too easy to concentrate on a few aspects to evaluate, possibly leading to developing systems that are easy to use but can only support tasks that are irrelevant to the real context.
A further good point about the Pejtersen framework for designing evaluation is that it reminds us to think about both directions of reasoning in evaluation: (A) from the widest context towards its consequences for user interface detail, and (B) from what happens in the details of a user's actions and errors out to wider features. She calls (A) "analytical evaluation" and (B) "empirical evaluation". (B) is to do with validation e.g. you run an experiment, and take surface measures of time and errors which you assume can stand as measures of whole tasks and functions and the utility of the whole device. (A) is to do with verification: with whether the implementation does actually derive from and satisfy the wide requirements, including implicit requirements for it to work in its context. For evaluation, this is in part to do with "illuminative evaluation": with open ended observation in the work place that can spot whether some issue has been missed in the design (and so is not only a problem, but will not be allowed for in the (B) type experiments).

