Despite 50 years of development, information retrieval is still fundamentally carried out through Boolean logic with dumb command driven interfaces. We propose in this Basic Research Action to develop a Retrieval Model for a new generation of IR systems handling every kind of media in an integrated manner through interaction with a user. The query language will allow interrogation using a mixture of expressions drawn from different media. The syntax of such expressions to be given a semantics appropriate for the job of retrieval and constrained by the pragmatics and ergonomics of interaction.
Information retrieval is a domain mainly devoted to the processing of content-oriented queries for large textual data bases or reference data bases. The field has been investigated intensively since the fifties: research and implementation activities have been numerous and fruitful in this domain; though it is not possible to mention every contributor in the field during such a long period, contributions from G. Salton (Cornell University), C.J. van Rijsbergen (Glasgow University), K. Sparck Jones (University of Cambridge), were preeminent in the design of Retrieval Models. The Boolean model was most successfull from a practical point of view, it underlies most of the available Information Retrieval Systems (IRSs): STAIRS from IBM, MISTRAL from BULL, GOLEM-PASSAT from SIEMENS etc. Though able to manage very large reference data bases, these systems proved to have somewhat limited qualitative performance. It seems obvious that these current limitations are largely due to the naivety of the underlying Retrieval Model, and in particular to the primitiveness of the semantic models used for documents and queries, and to the lack of a suitable matching function able to manage such more elaborate semantic structures.
Available Information Retrieval Systems have been driven by an old technology based on naive retrieval models and unfriendly query languages as interfaces. Though often criticized for its lack of effectiveness, this technology has served us well until fairly recently when large scale multimedia data sets emerged on the scene. This old technology is even less suitable to cope effectively with these new types of objects; users are increasingly impatient with the straight jacket of Boolean logic.
Therefore the research community in Information Retrieval has developed during the past few years a whole range of new IR approaches which may be classified into two main trends: a classical one which is based on the enhancement of existing Retrieval Models (as an example, the use of non-strict Boolean operators in queries [Sal89]), and approaches based on the design of new Retrieval Models. The latter trend is most promising from our point of view and ranges from new formal models such as the vector space model, the probabilistic model, the logic model, to attempts to integrate technologies such as Natural Language Processing (NLP), Artificial Intelligence (AI) and Data Bases (DB) into the information retrieval process.
The use of NLP techniques seems mandatory to enhance the indexing language for textual documents. Attempts range from the consideration of noun phrases (instead of keywords) to using a deep understanding of complete sentences. The goal is usually to recognise more elaborate concepts from documents and thus to define more representative semantic models for documents and queries (a first step in making the semantics of textual documents explicit according to a model) [Sme90][CDKB86]. Natural Language Interfaces are also investigated which could improve the expression of user needs in a more natural way than Boolean expressions (or any other artificial language). This area of research is very active (e.g. Sparck Jones, van Rijsbergen, Chiaramella, Smeaton), and most researchers are aware that specific approaches have to be defined to cope with the particular needs and constraints in Information Retrieval: the necessity to restrict the linguistic tools to manageable problems given the problem of natural language ambiguities, and the necessity to cope with huge amounts of texts ensuring scaleability of the retrieval techniques [SS91][Fag87][Ber90].
AI techniques seem extremely promising in the field because the design of semantic models for documents and queries may be based on many existing results derived from current researches on logics, reasoning, problem solving and knowledge representation. Several studies have been launched in the last seven years, some related to theoretical investigations about the design of new Retrieval Models based on logics [MSST93][vR86b], the others being aimed at the integration of AI techniques in the design of new Information Retrieval Systems: the I3R project [CR87], the IOTA and RIME projects [CN90][CDKB86], the IRUS system [BB83], the RUBRIC system [TAAC86]. Many approaches integrate expert systems and specific knowledge bases, but none of these has yet been implemented in realistic scale applications. Though very interesting results have already been obtained, strong limitations still remain which are due to the lack of a retrieval model able to provide a common framework for all these approaches each of which addresses one aspect of the problem.
The problem of managing complex and/or multimedia objects is a major topic in current database research. We have found in our recent experience that the basic features of conceptual database models can be fruitfully employed in modelling the structure of multimedia documents [MRT91] and, at the same time, that there is an increasing need for information retrieval capabilities within these new systems (in office information systems in general or in software reuse platforms in particular e.g. the SIB in project ITHACA [Tsi90]). But we may observe that Database Management Systems (DBMSs) and IRSs are now more than ever bound to cope with ever more common paradigms such as the management of large amounts of multimedia complex objects (simply called documents in the IR terminology, even though in the early days of IR the term ``document'' was used in a more restrictive sense), security, confidentiality, shared access, distributed environments.
This convergence is a fast and natural evolution in our opinion: allowing users to design and store large amounts of complex, multimedia objects induces the need to retrieve them according to specific, often complex, users' mental schemes. These schemes cannot be fully foreseen or constrained by a predefined data model as it is in current DBMS. In other words, we think that the more complex the stored information is in its structure and content, various in its nature (text, images, and so on), flexible in its interrelations (related texts, images), the more the underlying semantics of this complexity and variety has to be made accessible, or explicit, to the users in order to allow them to properly retrieve and exploit this information. This leads for example to precision-oriented IRSs which are the only ones suitable to fulfill the information needs of highly specialized users, such as engineers or physicians, who use complex data and whose queries are often extremely technical and precise. Hence without any description of their semantic content, multimedia documents may be perceived by the users as raw, complex, hardly accessible information.
Making semantics explicit and accessible for retrieval purpose is the basic problem of information retrieval, and the design of effective retrieval models able to cope with complex objects constitute by now an important research trend in Information Retrieval. Considered separately, each non-textual media has its specific features, and is available as raw digitalized signals which we are able to store, display (or play) and alter in controlled ways. These media present specific features that are:
The other dimension of multimedia information is related to structural aspects. Talking about multimedia data often implicitly refers to ``association'' of various media, and this in turn corresponds to structures that combines information of different types of media.This concerns both classical properties of structured documents (instances of predefined types of data models) which correspond to intra-document links, and more advanced aspects about hypertext and hypermedia which correspond to inter-document links. Making explicit the semantics of these links will provide additional semantic information for improving the retrieval of composite multimedia objects.
The design of a new generation of Information Retrieval Systems able to manage this kind af data implies fundamental research on Retrieval Models providing a much needed common framework, which has unfortunately been missing in the past.
We are also prepared to devote effort to prototypically implement these new models within the scope of this Action and will relate our work to existing research projects carried out by the various partners of this proposal. Taken as a whole the range of tools for improving IRSs has developed in a rather uncoordinated way and it is still difficult to see how they would fit together. In other words a common framework has been missing. Many studies are concerned with monomedia data such as full text or images, the latter being the most recent to be investigated; but studies involving multimedia in IR are few and far between.
No IRS can be effectively designed without the previous definition of a suitable model which guides its fundamental design and specifies a framework for an integrated approach to various aspects of storage and retrieval leading to an efficient implementation. There are three main components to any Retrieval Model. Firstly there is a model for documents which provides the representational primitives for describing the structure and the content of each document. Secondly there is a model for users' queries which shares the same document representation as the document model and specifies the interface for issuing queries. Thirdly there is a matching function between document descriptions, and users' needs expressed through queries, which underlies the retrieval process: by using document and query models it estimates the system relevance for any document compared to the query.
In order to reduce silence (lack of recall) or noise (lack of precision) in system answers, contrary to the classical database approach, the matching of content descriptions cannot be exact. Hence the qualitative performance of Information Retrieval Systems is based on the concept of document relevance which measures the adequacy of the set of retrieved documents in response to a user's need. The way to define the computation of relevance and how to implement it in an efficient way on a large scale is the core problem of information retrieval.
Among the retrieval models, the Boolean model has been the most successful in practice. According to this model every document may be considered as a conjunction of attribute-value pairs (the external attributes) and of keywords which constitute the representation of the semantic content. Keywords are selected either manually or automatically during an Indexing Phase. Queries may be viewed as Boolean expressions of attribute-value pairs and keywords, using the classical AND, OR and EXCEPT logical operators. If documents and queries are considered as predicates, retrieved documents will be those which logically imply the query (as measured by a simple matching function). One limitation of this model arises from the inherent simplicity of the representation used for the semantic content of documents and queries: keywords by themselves cannot express elaborated concepts! A further limitation derives from the use of Boolean (two-valued) logic itself which rejects equally a completely unmatched and partially matching document, and conversely, also retrieves equally an exactly matching and a partially matching document. In other words, this model does not provide any estimation of system relevance and acts only on a strict rejection/acceptance criterion. Various attempts have been made to overcome these limitations through, for example, the introduction of proximity operators, or jokers to improve the expression of concepts, and through ``relaxed'' Boolean operators to smooth the rigidity of pure Boolean logic.
The vector space model proposed by Salton [SM83] is much more elaborate. It proposes to consider both documents and queries as vectors in a space whose dimension is the cardinality of the indexing language (based on keywords) and so that the coordinates of each vector in this space are weights assigned to each of these terms for a given document or query. The matching function is a comparison between a document vector and a query vector based, for example, on their inner product. This model was among the first to introduce the computation of system relevance whereby documents were retrieved in order of estimated relevance, from the best to the worst, and thus was not subject to the strict rejection/acceptance process. But it has also its own intrinsic limitations the most important of which concerns the indexing language iself (the keywords with all their limitations) and the notion of indexing space which presupposes independent dimensions for a proper evaluation of the matching function.
The probabilistic model, originally proposed by Robertson and Sparck Jones [RSJ76], and van Rijsbergen [vR79][vR77], based on the early work of Maron and Kuhns [MK60] comes closest to defining retrieval as inference, and is therefore akin to the new model proposed in this project. It computes the relationship between a document and a query by including estimates of the likelihood that a shared term indicates relevance. The emphasis is on somehow finding out how index terms discriminate between relevant and non-relevant documents. The underlying retrieval process is adaptive and iterative; it learns about relevance through a process of belief revision: Bayesian conditionalisation; thereby computing the probability of relevance of each document. Recently this form of conditionalisation has been generalised by van Rijsbergen [vR86b] to handle conditional information in general. Nevertheless the underlying logic is not yet defined, nor has this form of adaptive retrieval been applied to document and queries represented by complex semantic structures.
All these models have been primarily designed for the processing of bibliographic or reference databases, which store fairly simple kinds of objects compared to the ones considered by this project. Some have been applied recently to full text databases but without taking into account their structure and possible interrelationships. Basically they remain applications of old technology to new kinds of data the processing of which they were not designed for. The extensive work in AI, NLP and DB approaches for improving information retrieval systems, mentioned before, has not yet been related to formal modelling. They are informal tools which all participate in the modelling of document and query contents, and in the proof that a document matches a query or users' needs. Typical examples of these attempts led to prototype systems such as I3R [CR87] and IOTA [CD87]. Except for the RIME project [Nie90][Ber90], no attempts have been made to fully integrate all these heterogeneous aspects within a common model.
Over the last few years a new model for Information Retrieval has emerged based on formal logic and probability theory [vR86a]. The basic idea underlying this model is that for a document to be relevant to a given query, this document has to logically imply this query. The preliminary results from this recent research show that it is indeed possible to define a formal model within which the different approaches to IR can be expressed thus proving its generality compared to existing models [CN90]. This framework has as its basis the expression of objects and their entailment relations within a non-classical logic. The entailment is assumed to be uncertain and thus is measured in terms of probability theory. It is important to realise that this formulation is at a high level of abstraction: it allows the specification of different logics with their appropriate syntax and semantics, and allows for the measurement of uncertainty in different ways (although the preferred way is through probability theory).
There are still no complete available results in this area, but very promising approaches have been recently investigated by the members of the FERMI consortium [MSST93][SVR90][CN90].
In the last few years, IR techniques have been extended to deal with multimedia information, e.g. image, graphics, voice, video, etc. It is interesting and also important to realise that IR has this universality of application. Information retrieval systems like Rivage [Hal89], Kato's system [KK90], Hirabayashi's system [HMK88] are based on IR approaches for retrieving images, but they are based on manual indexing of these documents. Other available systems are text retrieval systems that allow multimedia data, such as graphics and images, to be stored and retrieved by the system but the information these data contain is not used in searching because it has not been indexed.. Some of these systems allow to query images on the basis of manually inserted textual descriptors (captions) of the image content. This is due to the fact that these systems store multimedia information components as uninterpreted strings, making it difficult to capture their structure, properties and interrelationships.
A new research area is, thus, emerging, the Multimedia Information Retrieval (MIR) area, which is concerned with the development of models and techniques enabling the retrieval of information on multimedia documents. The major question currently being addressed in this area is whether significantly better retrieval effectiveness and efficiency can be obtained following a new approach to modelling retrieval.
An appropriate experimental methodology has to facilitate the statistically significant estimation of the effectiveness of the representation formalism developed. Since it is usually impossible to determine absolute quantities for each evaluation factor. Instead, relative quantities are determined, e.g. the effectiveness of a retrieval method compared with the effectiveness of another retrieval method.
The Information Retrieval research community has a strong tradition of evaluation by means of a well developed experimental methodology. This tradition goes back to the Cranfield experiments in the sixties [SJ81][CK66] and the SMART experiments in the early seventies [Sal72][SL71]. Numerous measures for retrieval effectiveness have been developed by Information Retrieval researchers. A more systematic foundation of these measure is given by the approach described in [CL70] which states that every measure is connected with a preference relation (or viewpoint) on retrieval outputs.
The most popular representations of retrieval effectiveness are based on recall and precision. The standard recall and precision graphs have several severe drawbacks [RBJ91][FMS91]. For overcoming these problems, these measures have to be combined with tests of significance. An alternative method is provided by the usefulness measure presented in [FS91] which compares two retrieval outputs, whereas other measures relate a retrieval output to the ideal one. As a major advantage, the usefulness measure comes with an error probability that expresses how reliable the derived usefulness value is.
Retrieval systems are usually evaluated by means of static test collections which have a rather moderate coverage [Fox83][CK66]. The construction of such test collections is a laborious process even when they contain relatively few documents, i.e. between 1,000 and 10,000 documents. The documents of these collections are extremely short (20 - 50 terms per document) and they are poorly structured (only title, author(s), abstract, and keywords). Hence, retrieval algorithms cannot be evaluated on real data collections at the present time.
The US project TIPSTER is a major project which is aimed at the evaluation of Information Retrieval techniques with very large test collections containing English and Japanese text documents. The English material is also used within the TREC (Text REtrieval Conference) initiative where several reseach groups apply their text retrieval methods to this material [Har93].
Collections of multimedia documents often contain a variety of links between different documents. This contrasts with typical text retrieval applications, where documents are regarded as being independent of each other. This feature affects the evaluation of multimedia IR systems, since the standard evaluation criteria assume independent documents. So far, IR systems with interrelated documents can only be evaluated by regarding the interactive usage of the system, e.g. by measuring the time for solving a specific task [Les91]. An attempt to set up a list of evaluation criteria for interactive IR systems is described in [TS89]. A major disadvantage of this approach is that only the overall system performance can be measured, but it is difficult to estimate the contribution of specific system components (e.g. user interface design, representation of documents, retrieval model) to this performance.
The IR scientific community has gradually become aware that a significant breakthrough in the performance of IR systems can be achieved only by allowing these systems to handle representations of documents that capture more of the documents content, and by embodying the notion of likely relevance in their retrieval capability. In other words, an IR system can perform significantly better than the current ones only if its underlying retrieval model provides a sophisticate representation of the documents content, and an inferential process which is able to keenly distinguish the documents that are likely to be relevant to the users requests from those that are not.
There are therefore three issues that must be addressed:
There is no doubt that a logic is the best candidate for this framework, in that:
This explains why the main objective of the project is the development of the MIR logic, and at the same time justifies the requirements that the MIR logic must satisfy. It is important to recognize that any significant advance in the modelling of multimedia information retrieval can be obtained only through a logic, and, therefore, that any significant advance in the performance of multimedia information retrieval systems can be obtained only via the adoption of a MIR logic as the formal foundation of the system operation.
The achievements of this approach will be demonstrated by means of an experimental evaluation. In order to evaluate MIR, two research issues have to be addressed:
As alread highlighted in Section 2.2.2.3, all the members of the FERMI consortium have already been involved in central research activities in the fields of Information Retrieval and Multimedia Information Systems, and have contributed significatly to the advance of these areas in terms of research papers, prototype implementations, organisation of international conferences and training activities, and editorial work in international journals. The FERMI consortium has the scientific and technical competence which are the basis for the success of its proposal. The proposed project is, thus, inserted in a scientific and technical context which is based on the previous research activities and achievements of the partners. In the following paragraphs we summarize the contribution of each member of the FERMI consoritum to the state of the art, the knowledge gained through participation to previously funded ESPRIT projects and actions as well as other national or international initiatives, and the involvement in other proposals in the current framework.
The ``Istituto di Elaborazione della Informazione del Consiglio Nazionale delle Ricerche'' (CNR-IEI) team has contributed to the state of the art in the fields of the conceptual modelling of multimedia documents [MRT91], image indexing and retrieval [RS91b][RS91a] and logic-based information retrieval [MSST93].
The CNR-IEI team has gained a useful experience from the results of related ESPRIT research projects. In the MMFS (Mixed Mode File Server, 1983-1984) ESPRIT pilot project and the MULTOS (Multimedia Office Server, 1985-1990) ESPRIT project experience has been obtained in multimedia document modeling, mixed data and text query processing, and multimedia data (text, images) access methods. In the context of the ESPRIT BRA MIRO working group (No. 6576) preliminary studies have been conducted on the use of terminological logic for modelling information retrieval.
The same CNR-IEI research department is involved in another ESPRIT Basic Research Proposal in the same Priority Theme 6.4 i.e. HERMES (Foundations of High Performance Multimedia Information Management Systems). However, it is important to stress the diverse and complementary nature of the proposed contributions of the CNR-IEI team: in HERMES is mainly focussed on performance modelling and performance related research, while in FERMI is focussed in the development of a model for multimedia information retrieval based on terminological logic.
The research group at the Computer Science Department of Glasgow University (GU-CS) has a long history in Information Retrieval research. For example, the first complete version of the Probablistic Model was formulated in the seventies by members of the Glasgow team [vR79][vR77]. In the last few years IR research at Glasgow has concentrated on constructing a formal framework for specifying new IR models, thus opening the most promising research line in information retrieval [vR92][vR91][vR86a].
GU-CS has been involved in the ESPRIT funded projects Comandos (project 834), Comandos 2 (project 2071), FIDE (project 3070), FIDE 2 (project 6309), IMIS (project 6548), KWICK (project 2466), Semantique (3124), and SHAPE (project 5398). Glasgow is also involved in the MIRO (project 6576) and Semantique II (project 6809) working groups and the IDOMENEUS Network of Excellence (NoE 6606).
GU-CS is involved in proposals in the current round for three other BRAs, these are FADIVA (under Task 6.3), MIME (Task 6.3), and MIX (Task 6.4). However, it should be noted that the FERMI and the MIX proposals, although addressing the same Task, are totally complementary: the former is aimed at the design of logic-based retrieval models whereas the latter is aimed at indexing multimedia objects. Since retrieval models are based on the features extracted by the indexing process, the topic of MIX and the topic of FERMI are disjoint and complete each other. If both proposals are accepted, cooperation will take place within the framework of the ongoing MIRO working Group and of the IDOMENEUS Network of Excellence. Most of the participants of the two proposals are already participating in MIRO or in IDOMENEUS.
The ``Laboratoire de Genie Informatique'' (LGI-IMAG) of Université Joseph Fourier (Grenoble) has given substancial contributions to the domain of automatic indexing of textual information [Ber90] and to the development of logic-based retrieval models. A model based on fuzzy modal logic has been developed [CC92][CN90] and experimented in the framework of projects that address the retrieval of multimedia medical information (RIME), and the retrieval of complex objects in software databases (ELEN).
LGI-IMAG has been subcontractor of BULL company within the framework of ESPRIT I project DOEOIS (No 231). Its contribution to this project was the design and implementation of signature techniques for indexing and retrieving textual information in the context of highly dynamic corpuses of Office Information Systems. Both hardware and software tools were developed to implement and test signature filtering algorithms and methods. LGI-IMAG is involved in the ESPRIT Working group MIRO (WG No. 6576) and in the ESPRIT Network of excellence IDOMENEUS (NoE No. 6606). Most of the current activities of LGI-IMAG in the domain are supported by PRC BD3 (Programme de Recherche Concerte "Bases de donnees de troisieme generation") funded by the french Ministry of Research and Technology.
Beside the FERMI proposal, LGI-IMAG is involved in the MIX proposal, addressing the same Task (6.4) of the present call. The complementarity of these two proposals has been already explained above.
The research group at the Computer Science Department of the University of Dortmund (UNIDO-CS) has obtained important results in the fields of probabilistic retrieval [Fuh92][FB91][FB89] and the integration of database and IR systems [Fuh90]. Currently, the group is participating in the TREC (Text Retrieval Conference) initiative, where indexing and retrieval with free text terms is performed for a large database with 2 GB of text.
Three members of the FERMI consortium, namely the CNR-IEI, GU-CS and LGI-IMAG teams are partners in the MIRO Working Group (n. 6576) funded by the ESPRIT BRA Programme. The MIRO Working Group is a useful forum for discussing the research lines of the single teams and exchanging the results individually obtained by the teams. However, it does not allow the effective cooperation of the teams in the pursuit of a common goal.
On the other hand, the development of a theory of multimedia information retrieval requires the exploration of several, in some cases orthogonal, research dimensions, and the integration of results of different nature into a unified, coherent formal system. There is no single research group that possesses the different, sometimes complementary knowledge, experience and research skills required to carry out this development, as does the FERMI consortium, whose members are recognised experts in their own fields. Therefore it is felt that a project carried out by this consortium is the most appropriate context where to pursue the development of the multimedia information retrieval theory.