Providing a structured method for integrating non-speech audio into human-computer interfaces.

All of the citations here are in my bibliography. The full thesis can be downloaded from my publications list page, papers by me witten on the topic can be found on my publication list.

Thesis Abstract

This thesis provides a framework for integrating non-speech sound into human-computer interfaces. Previously there was no structured way of doing this, it was done in an ad hoc manner by individual designers. This led to ineffective uses of sound. In order to add sounds to improve usability two questions must be answered: What sounds should be used and where is it best to use them? With these answers a structured method for adding sound can be created.

An investigation of earcons as a means of presenting information in sound was undertaken. A series of detailed experiments showed that earcons were effective, especially if musical timbres were used. Parallel earcons were also investigated (where two earcons are played simultaneously) and an experiment showed that they could increase sound presentation rates. From these results guidelines were drawn up for designers to use when creating usable earcons. These formed the first half of the structured method for integrating sound into interfaces.

An informal analysis technique was designed to investigate interactions to identify situations where hidden information existed and where non-speech sound could be used to overcome the associated problems. Interactions were considered in terms of events, status and modes to find hidden information. This information was then categorised in terms of the feedback needed to present it. Several examples of the use of the technique were presented. This technique formed the second half of the structured method.

The structured method was evaluated by testing sonically-enhanced scrollbars, buttons and windows. Experimental results showed that sound could improve usability by increasing performance, reducing time to recover from errors and reducing workload. There was also no increased annoyance due to the sound. Thus the structured method for integrating sound into interfaces was shown to be effective when applied to existing interface widgets.

 

1.1 INTRODUCTION

The combination of visual and auditory information at the human-computer interface is a natural step forward. In everyday life both senses combine to give complementary information about the world; they are interdependent. The visual system gives us detailed data about a small area of focus whereas the auditory system provides general data from all around, alerting us to things outside our peripheral vision. The combination of these two senses gives much of the information we need about our everyday environment. Dannenberg & Blattner ([23], pp xviii-xix) discuss some of the advantages of using this approach in multimedia/multimodal computer systems:

"In our interaction with the world around us, we use many senses. Through each sense we interpret the external world using representations and organisations to accommodate that use. The senses enhance each other in various ways, adding synergies or further informational dimensions".

They go on to say:

"People communicate more effectively through multiple channels. ... Music and other sound in film or drama can be used to communicate aspects of the plot or situation that are not verbalised by the actors. Ancient drama used a chorus and musicians to put the action into its proper setting without interfering with the plot. Similarly, non-speech audio messages can communicate to the computer user without interfering with an application".

These advantages can be brought to the multimodal human-computer interface. Whilst directing our visual attention to one task, such as editing a document, we can still monitor the state of other tasks on our machine. Currently, almost all information presented by computers uses the visual sense. This means information can be missed because of visual overload or because the user is not looking in the right place at the right time. A multimodal interface that integrated information output to both senses could capitalise on the interdependence between them and present information in the most efficient and natural way possible. This thesis aims to investigate the creation of such multimodal interfaces.

The classical uses of non-speech sound can be found in the human factors literature (see Deatherage [48] or McCormick & Sanders [116]). Here it is used mainly for alarms and warnings or monitoring and status information. Alarms are signals designed to interrupt the on-going task to indicate something that requires immediate attention. Monitoring sounds provide information about some on-going task. Buxton [38] extends these ideas and suggests that encoded messages could be used to pass more complex information in sound and it is this type of auditory feedback that will be considered here.

The use of sound to convey information in computers is not new. In the early days of computing programmers used to attach speakers to a computer's bus or program counter [168]. The speaker would click each time the program counter was changed. Programmers would get to know the patterns and rhythms of sound and could recognise what the machine was doing. Another everyday example is the sound of a hard disk. Users often can tell when a save or copy operation has finished by the noise their disk makes. This allows them to do other things whilst waiting for the copy to finish. Sound is therefore an important information provider, giving users information about things in their systems that they cannot see. It is time that sound was specifically designed into computer systems rather than being an add-on or an accident of design that can be taken advantage of by the user. The aim of the research described here is to provide a method to do this.

As DiGiano & Baecker [55] suggest, non-speech audio is becoming a standard feature of most new computer systems. Next Computers [175] have had high quality sound input and output facilities since they were first brought out and Sun Microsystems and Silicon Graphics [154,185] have both introduced workstations with similar facilities. As Loy [110] says, MIDI interfaces are built in to many machines and are available for most others so that high quality music synthesisers are easily controllable. The hardware is therefore available but, as yet, it is unclear what it should be used for. The hardware manufacturers see it as a selling point but its only real use to date is in games or for electronic musicians. The powerful hardware plays no part in the everyday interactions of ordinary users. Another interesting point is made by DiGiano & Baecker [55]: "The computer industry is moving towards smaller, more portable computers with displays limited by current technology to fewer colours, less pixels, and slower update rates". They suggest that sound can be used to present information that is not available on the portable computer due to lack of display capability.

We have seen that users will take advantage of sounds in their computer systems and that there is sophisticated sound hardware available currently doing nothing. The next step that must be taken is to link these two together. The sound hardware should be put to use to enhance the everyday interactions of users with their computers. This is the area addressed by the research described in this thesis.

1.1.1 Research topics in auditory interface design

In 1989 Buxton, Gaver & Bly [39] suggested six topics that needed further research in the area of auditory interfaces. The areas that they suggest for further investigation partly motivated the work in this thesis. The research topics are:

These topics are presented so that the description of research in the thesis which follows is put in context. After the contents of the thesis have been described in section 1.6 the work in the thesis will be explained in terms of this research agenda.

One question that might be asked is: Why use sound to present information? A graphical method could be used instead. The drawback with this is that it puts an even greater load on the visual channel. Furthermore, sound has certain advantages. For example, it can be heard from all around, it does not disrupt the user's visual attention and it can alert the user to changes very effectively. It is for these reasons that this thesis suggests sound should be used to enhance the graphical user interface.

1.2 MOTIVATION FOR RESEARCH INTO AUDITORY INTERFACES

Some of the general advantages that can be gained from adding sound have been described but what are the specific benefits that it offers? There are many reasons why it is important to use sound in user interfaces:

The area of auditory interfaces is growing as more and more researchers see the possibilities offered by sound because, as Hapeshi & Jones ( [89], p 94) suggest, "Multimedia provide an opportunity to combine the relative advantages of visual and auditory presentations in ways that can lead to enhanced learning and recall". There are several examples of systems that use sound and exploit some of its advantages. However, because the research area is still in its infancy, most of these systems have been content to show that adding sound is possible. There are very few examples of systems where sound has been added in a structured way and then formally evaluated to investigate the effects it had. This is one of the aims of this thesis.

1.3 WHAT SOUNDS SHOULD BE USED AND WHERE?

Section 1.2 showed that there are many compelling reasons for using sound at the interface. This brings up two fundamental questions:

Prior to the work reported in this thesis there was no structured method a designer could use to add sound. It had to be done in an ad hoc manner for each interface. This led to systems where sound was used but gave no benefit, either because the sounds themselves were inappropriate or because they were used in inappropriate places. If sounds do not provide any advantages then there is little point in the user using them. They may even become an annoyance that the user will want to turn off. However, if the sounds provide information users need then they will not be turned off. The work described in this thesis answers these two questions and from the answers provides a structured method to allow a designer (not necessarily skilled in sound design) to add effective auditory feedback that will improve usability. The structured method provides a series of steps that the designer can follow to find out where to use sound and then to create the sounds needed.

There are several different methods for presenting information in sound and two of the main ones are: Auditory icons [74] and earcons [25]. Auditory icons use natural, everyday sounds to represent actions and objects within an interface. The sounds have an intuitive link to the thing they represent. For example, selecting an icon might make a tapping sound because the user presses on the icon with the cursor. Auditory icons have been used in several interfaces. Whilst they have been shown to improve usability [79] no formal evaluation has taken place. One drawback is that some situations in a user interface have no everyday equivalents and so there are no natural sounds that can be used. For example, there is no everyday equivalent to a database search so a sound with an intuitive link could not be found.

Earcons are the other main method of presenting information in sound. They differ from auditory icons in the types of sounds they use. Earcons are abstract, synthetic tones that can be used in structured combinations to create sound messages to represent parts of an interface. Earcons are composed of motives, which are small sub-units that can be combined in different ways. They have no intuitive link to what is represented; it must be learned. Prior to the research described in this thesis, earcons had never been evaluated. The best ways to create them were not known. It was not even clear if users would be able to learn the structure of earcons or the mapping between the earcon and its meaning. This lack of knowledge motivated the investigation of earcons carried out in this thesis. When more was known about earcons a set of guidelines for their production could be created. These guidelines should also embody knowledge about the perception of sound so that a designer with no skill in sound design could create effective earcons.

Neither of the two sound presentation methods above give any precise rules as to where in the interface the sounds should be used. The work on auditory icons proposed that they should be used in ways suggested by the natural environment. As discussed above, this can be a problem due to the abstract nature of computer systems; there may be no everyday equivalent of the interaction to which sound must be added. This work also only uses sounds redundantly with graphical feedback. Sounds can do more than simply indicate errors or supply redundant feedback for what is already available on the graphical display. They should be used to present information that is not currently displayed (give more information) or present existing information in a more effective way so that users can deal with it more efficiently. A method is needed to find situations in the interface where sound might be useful and this thesis presents such a method. It should provide for a clear, consistent and effective use of non-speech audio across the interface. Designers will then have a technique for identifying where sound would be useful and for using it in a more structured way rather than it just being an ad hoc decision.

In the research described in this thesis sound is used to make explicit information that is hidden in the interface. Hidden information is an important source of errors because often users cannot operate the interface effectively if information is hidden. There are many reasons why it might be hidden: It may not be available because of hardware limitations such as CPU power; it may be available but just difficult to get at; there may be too much information so that some is missed because of overload; or the small area of focus of the human visual system may mean that things are not be seen. This thesis describes an informal analysis technique that can be used to find hidden information that can cause errors. This technique models an interaction in terms of event, status and mode information and then categorises this in terms of the feedback needed to present it.

Many uses of sound at the human-computer interface are never evaluated. One reason for this is that research into the area is very new so that example systems are few in number. Most of the interfaces developed just aimed to show that adding sound was possible. However, for the research area to develop and grow it must be shown that sound can effectively improve usability. Therefore, formal testing of sonically-enhanced interfaces is needed. One aim of this research is to make sure that the effects of sound are fully investigated to discover its impact. In particular annoyance is considered. This is often cited as one of the main reasons for not using sound at the interface. This research investigates if sound is annoying for the primary user of the computer system.

The answers to the two questions of where and what sounds are combined to produce a structured method for adding sound to user interfaces. The analysis technique is used to find where to add sound and then the earcon guidelines are used to create the sounds needed. This method is tested to make sure the guidelines for creating sounds are effective, the areas in which to add sound suggested by the analysis technique work and that usability is improved.

1.4 A DEFINITION OF TERMS

1.4.1 Usability

In the section above one of the aims of the thesis was shown to be creating a structured method for adding sound that would increase usability. What is meant by usability in this case? In ISO standard 9241-11 (reported in [19], p 135 and also described in [126]) it is defined as: "The effectiveness, efficiency and satisfaction with which specific users achieve specified goals in particular environments". Bevan & Macleod [19] suggest that effectiveness can be measured by accuracy, efficiency by time and satisfaction by subjective workload measures. This definition of usability will be used when measuring the effectiveness of the structured method for adding sound.

1.4.2 Multimedia and multimodal systems

The research described in this thesis aims to create multimodal interfaces. What is a multimodal interface and how does it differ from a multimedia one? There are, as yet, no accepted definitions of the terms multimedia and multimodal as Alty and Mayes both describe [3,115]. This thesis uses the definitions proposed by Mayes:

Almost all computer systems are multimedia by this definition. They all have the ability to present information via different media such as graphics, text, video and sound. They are not all multimodal however. Most of the different media they use present information to the visual system. Very few systems make much of their capacity to produce sound. Errors are sometimes indicated by beeps but almost all interactions take place in the visual modality. The aim of this research is to broaden this and make everyday interactions with computers use the auditory modality as well as the visual.

1.4.3 Musical notation used in the thesis

Standard musical notation is used to describe the earcons in this thesis. In this very brief description only the parts of musical notation used by the sounds in the thesis are described. For a more detailed description of the notation used see Scholes [148]. The earcons used are based around the quarter note. Whole notes are four times as long as quarter notes, half notes twice as long, eighth notes half the length, etc. A quarter note rest is a period of silence for the length of a quarter note. These time divisions and their iconic notations are:

A
picture of some notes

The arrangement of notes on the stave (the series of horizontal lines) defines the rhythm of the earcon. An example earcon might look like this:

These are three quarter notes of increasing pitch. A note with a `>' above it is accented (played slightly louder than normal), with a `<' it is muted. A sequence of notes with a `<' underneath indicates that they get louder (crescendo) and with a `>' they get quieter (decrescendo). The height of the note on the stave indicates its relative pitch. This is only a very simple overview of musical notation.

1.4.4 Pitch notation used in the thesis

In addition to describing the notes and rhythms used the octave of the notes must be specified. There are eight octaves of seven notes in the western diatonic system [148]. There are many different systems for notating pitch. The one used in this thesis is described in Scholes. In this commonly used system a note, for example `C', is followed by an octave number, for example:

Middle C C1 C2 C3 C4 C5 1046 Hz 523 Hz 261 Hz 130 Hz 65 Hz
So A above middle C (440 Hz) would be A3. This system will be used throughout the thesis to express pitch values.

1.5 THESIS AIMS

In this section the main aims of the thesis will be summarised. The overall aim of this research is to provide a structured method that designers can use to integrate sounds into human-computer interfaces. By doing this it is also hoped that sound will be shown to be effective at communicating information and able to increase the usability of systems. Before the method can be created two questions must be answered:

What sounds should be used at the human-computer interface? The main aims of this part of the work are:

Where should sound be used at the human-computer interface? The main aims of this part of the work are:

These two components will be brought together and the structured method will be evaluated. The aim of the evaluation will be:

1.6 CONTENTS OF THE THESIS

Figure 1.1 shows the structure of the thesis and how the chapters contribute to the two questions being investigated. Chapters 2 and 3 set the work in context, Chapters 4 and 5 investigate what sounds are the best to use, Chapter 6 shows where sound should be used and Chapter 7 brings all the work together to show the structured method in action. The following paragraphs give an overview of each chapter.

Figure 1.1: Structure of the thesis.

Chapter 2 gives an introduction to psychoacoustics, the study of the perception of sound. This is important when designing auditory interfaces because using sounds without regard for psychoacoustics may lead to the user being unable to differentiate one sound from another or being unable to hear the sounds. The main aspects of the area are dealt with including: Pitch and loudness perception, timbre, localisation and auditory pattern recognition. The chapter concludes by suggesting that a set of guidelines incorporating this information would be useful so that auditory interface designers would not need have an in-depth knowledge of psychoacoustics.

Chapter 3 provides a background of existing research in the area of non-speech audio at the interface. It gives the psychological basis for why sound could be advantageously employed at the interface. It then goes on to give detailed information about the main systems that have used sound including: Soundtrack, auditory icons, earcons and auditory windows. The chapter highlights the fact that there are no effective methods in existence that enable a designer to find where to add sound to an interface. It also shows that none of the systems give any real guidance about designing the types of sounds that should be used. One of the main systems, earcons, has not even been investigated to find out if it is effective.

Chapter 4 describes a detailed experimental evaluation of earcons to see whether they are an effective means of communication. An initial experiment shows that earcons are better than unstructured bursts of sound and that musical timbres are more effective than simple tones. The performance of non-musicians is shown to be equal to that of trained musicians if musical timbres are used. A second experiment is then described which corrects some of the weaknesses in the pitches and rhythms used in the first experiment to give a significant improvement in recognition. These experiments formally show that earcons are an effective method for communicating complex information in sound. From the results some guidelines are drawn up for designers to use when creating earcons. These form the first half of the structured method for integrating sound into user interfaces.

Chapter 5 extends the work on earcons from Chapter 4. It describes a method for presenting earcons in parallel so that they take less time to play and can better keep pace with interactions in a human-computer interface. The two component parts of a compound earcon are played in parallel so that the time taken is only that of a single part. An experiment is conducted to test the recall and recognition of parallel compound earcons as compared to serial compound earcons. Results show that there are no differences in the rates of recognition between the two types. Non-musicians are again shown to be equal in performance to musicians. Parallel earcons are shown to be an effective means of increasing the presentation rates of audio messages without compromising recognition. Some extensions to the earcon creation guidelines of the previous chapter are proposed.

Chapter 6 investigates the question of where to use sound. It describes an informal analysis technique that can be applied to an interaction to find where hidden information may exist and where non-speech sound might be used to overcome the associated problems. Information may be hidden for reasons such as: It is not available in the interface, it is hard to get at or there is too much information so it cannot all be taken in. When information is hidden errors can occur because the user may not know enough to operate the system effectively. Therefore, the way this thesis suggests adding sound it to make this information explicit. To do this, interactions are modelled in terms of events, status and modes. When this has been done the information is categorised in terms of the feedback needed to present it. Four dimensions of feedback are used: Demanding versus avoidable, action-dependent versus action-independent, transient versus sustained, and static versus dynamic. This categorisation provides a set of predictions about the type of auditory feedback needed to make the hidden information explicit. In the rest of the chapter detailed analyses of many interface widgets are shown. This analysis technique, with the earcon guidelines, forms the structured method for integrating sound into user interfaces.

Chapter 7 demonstrates the structured method in action. Three sonically-enhanced widgets are designed and tested based on the method. The chapter discusses problems of annoyance due to sound and some ways it may be avoided. The first experiment tests a sonically-enhanced scrollbar. The results show that sound decreases mental workload, reduces the time to recover from errors and reduces the overall time taken in one task. Subjects also prefer the new scrollbar to the standard one. Sonically-enhanced buttons are tested next. They are also strongly preferred by the subjects and they also reduce the time taken to recover from errors. Finally, sonically-enhanced windows are tested. Due to a problem with the experiment it is not possible to say whether they improve usability. In all of the three experiments subjects did not find the sounds annoying. The structured method for adding sound is therefore shown to be effective.

Chapter 8 summarises the contributions of the thesis, discusses its limitations and suggests some areas for further work.

1.6.1 The thesis in terms of the research topics in auditory interface design

How does the work in this thesis fit into the research agenda described in section 1.1.1? The investigation of earcons in Chapters 4 and 5 falls into three of these areas. It investigates the use of non-speech sound. The experiments investigate the best types of sounds to use; the best timbres, pitches, rhythms, etc. The work deals with the mapping of information to sound and how hard these mappings are to learn. Finally, the chapter looks at the structure of sounds. Earcons are investigated to find out if listeners can extract and learn their structure.

Chapter 6 investigates mapping information to sound. The agenda suggests that a method for translating events and data into sound is needed and this is what the research provides. It gives an analysis technique that models hidden information and from this produces rules for creating sounds. The chapter also investigates the sound in relation to graphics, suggesting that sound and graphics can be combined to create a coherent system.

Chapter 7 again looks at the use of non-speech sound and particularly at the annoyance due to sound. It considers sound in relation to graphical feedback. Sounds are shown complementing and replacing graphics.

The thesis does not investigate system support for sound, although from the research the types of sounds necessary in an interface are shown. This knowledge could then be used when deciding what hardware and software are needed to support sound in a computer system. The research also does not investigate user manipulation of sounds.

The work undertaken for this thesis has been shown to address many of the major research issues that Buxton et al. suggest are important for the future of auditory interfaces. The answers gained from this thesis will extend knowledge of how sounds can be used at the interface.


These pages are maintained by Stephen Brewster
Email: stephen@dcs.gla.ac.uk