ESSIR
what are the subjects?

Contents

Introduction
Architecture
Models
IR & databases
Natural language processing
Multimedia information retrieval
User interfaces
Evaluation in information retrieval
IR & hypermedia
IR & wide area networks
Intelligent retrieval

Introduction

Keith van Rijsbergen

This introduction will enable participants to reach an understanding of the science and engineering underlying information retrieval research and development. It aims to set the scene for those who want to

These needs will be met by giving brief presentations about key areas in IR, which will be pursued in greater depth by the other sessions during the week.

The introductory session will address the following questions:


Architecture

Peter Willett

This Workshop will discuss a range of topics relating to the development and use of novel forms of information retrieval system that eschew the traditional Boolean searching model in favour of best-match approaches that not only provide a higher level of retrieval performance but that are noticeably easier to use than conventional systems. After a brief introduction to this new generation of information retrieval systems, my sessions will focus on several key aspects of their implementation, as follows:

Comparison of Boolean and best-match searching. This session will briefly introduce the Boolean model, on the assumption that its use will already be familiar to most of the participants on the course, and then highlight some of the characteristics of the approach that make it inherently inappropriate for large-scale text-searching by end-users. The best-match (or probabilistic or vector- processing) model will be introduced as an alternative approach to information retrieval that requires much lees user effort but that is still able to retrieve large amounts of material that is relevant to the user's query. The relative merits of the two approaches will then be discussed and their use illustrated using operational systems on the World Wide Web. In closing, some discussion will be given as to the extent that best-match approaches have recently become available from public online database hosts, in addition to the research laboratories who have previously been the largest users of such systems.

File structures for information retrieval This session will briefly introduce the use of the serial file and the signature file for organising files of text and then focus on the inverted file structure, which is the most important implementation method for searching large files using both the Boolean and the best-match approaches. After outlining its use for Boolean searching, the session will focus on efficient inverted-file implementations of best-match searching, and other applications of the inverted file for the calculation of similarities in information retrieval systems. Finally, an overview will be given of methods for the implementation of clustered files, in which groups of documents are searched, rather than individual documents as in conventional file organisations.

Conflation methods for information retrieval Free-text information retrieval systems are effective in operation only if it is possible to map the particular word forms in a user's query to the forms that occur in the documents in a database. This is effected by means of a conflation procedure and this session will discuss the two main classes of conflation techniques: stemming algorithms and string-similarity measures. These provide an effective means of conflating morphological, and other, variants but are appropriate only when there are substantial numbers of characters and character- substrings in common in the words that are being compared. When this is not so, conflation can be achieved by automatic thesaural methods that identify related words on the basis of term co-occurrence information, and the last part of this session will discuss the effectiveness of such methods.


Models

Norbert Fuhr

A retrieval model specifies syntax and semantics of document representations and queries. Following Rijsbergen's approach of regarding IR as uncertain inference, we can distinguish different models according to the expressiveness of the underlying logic and the way uncertainty is handled.

Classical retrieval models are based on propositional logic. Boolean retrieval ignores uncertainty, whereas fuzzy retrieval uses fuzzy logic for this purpose, and probabilistic retrieval is based on probability theory. In the vector space model, documents and queries are represented as vectors in a vector space spanned by the index terms, and uncertainty is modelled by considering geometric similarity.

For IR applications dealing not only with texts, but also with multimedia or factual data, propositional logic is not sufficient. Therefore, advanced IR models use restricted forms of predicate logic as basis. Terminological logic has its roots in semantic networks and terminological languages like e.g. KL-ONE. Datalog uses function-free horn clauses. Probabilistic versions of both approaches are able to cope with the intrinsic uncertainty of IR.

Certain aspects of IR models can be expressed best in probabilistic models. Acquisition of uncertain knowledge is a parameter estimation problem in simple cases, but more sophisticated models require probabilistic classification or learning methods. For the combination of probabilistic evidence, Bayesian inference networks are considered.


IR & databases

Yves Chiaramella

Though sharing many paradigms such as the management of large amounts of persistent data, high interactivity, and access concurrence, the domains of Databases and Information Retrieval have developed in separate ways for many years. Why? During this lecture it will be interesting to analyse the reasons for this long period of separation; this will in turn help in understanding what was fundamentally different between the two approaches and the underlying best known models.

But in the past five years the situation has drastically changed in both domains: classical database models had reached their limits while database research was pushed ahead by new data, new applications and users. The powerful database community has already provided several answers to these challenges. Information retrieval systems are faced with the same evolutionary problems. The availability of larger and larger amounts of complex, multimedia data has at least two obvious consequences: the increasing need for powerful IR techniques to help users in accessing new types of data on the one hand; and the increasing limitations of current IR models and systems to face these problems on the other. These are similar to problems that faced database systems some years ago. Many would say in fact that the situation is worse considering the obsolescence of IR systems.

Even partial, not always integrated by now, these new approaches in the database domain deserves a closer analysis in themselves and from the IR point of view: what are they, and to what extent could they be used as a basis for the development of future IR systems? Are the two domains converging?

These are some interesting questions that we will try to answer in this lecture. After a review of previous experiences in the use of databases in IR, a particular emphasis will be put on multimedia databases, object oriented databases, and deductive databases.


Natural language processing

Alan Smeaton

Natural language processing and computational linguistics is about trying to engineer solutions to problems by using computational techniques to analyse and process natural language texts. Since the earliest days of NLP research in the 1960s, information retrieval has been cited as a potential application for NLP techniques yet after 30 years of NLP research we do not currently have NLP techniques embedded within IR applications. Why is this so ?

In this lecture we shall examine the nature of text an information. We shall look at words, terms and concepts and examine their inter-relationships. We shall look at the basics of computational NLP techniques covering the levels of morphology, syntax, semantics and pragmatics and for each level we shall look at the problems that NLP research wrestles with. We shall then look at how NLP tools and techniques can be used in indexing and retrieval covering, for example, base forms of words, word senses, phrases and the problems with handling phrases.

We will then look at using NLP resources rather than NLP processes and see what is available in the line of thesauri, dictionaries, knowledge bases, etc. Subsequently, we shall try to "paint the landscape" of the inter-action between NLP and IR, looking at who is doing what kind of work and what progress is being made. Finally, we shall have a discussion on the prospects that NL tools and techniques hold for information retrieval processes.


Multimedia information retrieval

Peter Schauble

In the near future, digital libraries and other large multimedia databases will be commonly available and sophisticated search techniques are required to find relevant information in these large digital data repositories. Because of the dramatically increasing amount of multimedia data that is made available to everybody, there is a growing need for new search techniques that provide not only fewer bits but also the right bits to the users. This course on Multimedia Information Retrieval is aimed at bridging the gap between classic ranking of text documents (usually bibliographic references) and modern Information Retrieval where composite multimedia documents are searched for relevant information. In particular, we want to pave the way for speech retrieval. The search for information in speech recordings entered very recently the state where it became feasible. Unfortunately Information Retrieval researchers usually do not have the necessary background in speech recognition. We therefore give a short introduction on speech recognition for those Information Retrieval persons who want to learn about speech recognition to understand the various speech retrieval approaches.


User interfaces

Peter Ingwersen

The first third of the couse introduces fundamental models for IR interface design and analysis, i.e. from the Monstrat to the Mediator Model. Knowledge types necessary to be represented in interface mechanisms are discussed and exemplified. The distinction between intelligent and supportive mechanisms will be emphasized. The second part will go into more detail, in particular concerning Domain, system, user and request model building and stress the imprtance of conceptual Feedback during retrieval for assessment purposes by the end-user. Icon and graphically based interfaces will be demonstrated, e.g. the Bookhouse, Dynamic Homefinder, Personal Librarian, and Envision. The third part will focus on methodologies of functionality and performance assessments. Functionality assessments involve methods of recording and protocol analysis and stress the ergonomic phenomena taking place during actual use of an interface. Performance involves to measure the retrieval power of the interface in connection with retrieval engines and end-user searching. Such methodologies and experimental settings are discussed, e.g. in relation to "relative, partial and differentiated" recall and the use of pre-defined or real-life information requests.


Evaluation in information retrieval

Steve Robertson

The field now known as "information retrieval" long predates computer-based systems, and there has been a tradition of evaluating information retrieval systems for over 35 years (again predating computer-based system). In part, this tradition is embedded in an empirical, laboratory-based paradigm, in which evaluation is a process of measurement, requiring the definition of quantitative measures of performance. This tradition is now best represented by the TREC programme.

At the same time, a diversification of this tradition has taken place, and attempts are made to observe information retrieval activities in real-life environments, perhaps in a more qualitative fashion. There is some tension between the laboratory and real-life evaluation ideas, but they have complementary functions. Studies outside IR, of for example information-seeking behaviour, can also tell us something about IR systems.

The object of this session will be to give participants an overview of the principles of evaluation in IR, some idea of the difficulties of conducting experiments, and some of the results of past studies.


IR & Hypermedia

Maristella Agosti

The usage of hypertext systems in the handling of a collection of documents is one which is normally used; this presentation tackles and illustrates the various problems connected to the usage of current hypertext systems as powerful information retrieval tools.

The different aspects to address in designing and using hypertext and hypermedia for information retrieval are presented together with new methodologies for the automatic authoring and construction of hypermedia bases for information retrieval.


IR & wide area networks

Pier Giorgio Marchetti

The Network Issue
In order to properly address the issue of networked information retrieval, the Internet evolution, development and the basic elements of communicating either on internet or via public networks are revised. Topics - the history of Internet, Internet and OSI, TCP/IP and FTP, Networking via PSDN/ISDN, SLIP.

Information Retrieval on Internet
A revision of the fundamental experiences of networked information retrieval is performed, together with the more evolved Z39.50 standard and the ISO search and retrieve protocol. An overview of the domain of application of these standards is performed. Topics - the WAIS experiment, WAIS searching, indexing and loading. The ANSI Z39.50 protocol, ISO protocol.

Browsing on Internet
The more visible elements of networked browsing are revised in perspective and future evolutions of networked information retrieval are addressed. The World Wide Web is analysed from its origin to current applications including publishing and access to any kind of database. Topics - the HTTP protocol, The HTML format, WWW, Internet GUIs for WWW, The Common Gateway Interface, Publishing on Internet.


Intelligent retrieval

Ulrich Thiel

The process of finding information in a large amount of stored data involves a variety of tasks - ranging from problem definition to relevance assessment - which, if performed by humans, require semantic information processing and reasoning. Intelligent Information retrieval (IIR) systems are intended to support the user in these tasks, and, over the last decade, a growing number of approaches to employ automatic reasoning techniques in IR systems were proposed. The applicability of rule-based reasoning which was demonstrated by the first working expert systems stimulated a lot of experiments in IR. Most of these efforts took the 'expert' system approach literally and devised 'automatic search intermediaries'. Exploiting the semantic and - to some limited extent - - the pragmatic knowledge about a specific problem domain, these systems are intentionally restricted to functions that simulate a human intermediary, while the data are assumed to be stored in a traditional IR system. More advanced systems enable the user to directly explore parts of the semantic representation of the document contents. The integrated retrieval engine can be based on probabilistic reasoning supporting relevance feedback and clustering, a spreading activation mechanism, or employ a form of plausible inference as suggested by van Rijsbergen.

The course will first introduce the basic concepts of automated reasoning and knowledge representation which are used in IIR systems. Then we will outline the methods employed in prototypical systems and discuss the underlying assumptions and prerequisites. Finally, we will show the potential benefits of IIR systems, as well as their limitations.




Mark Sanderson