Level 3 Proposals

Building an Operational Computer-based Essay Marking Software
Building a MetaSearch Engine on the Web

Mining the Web for Cross-Language Information Retrieval

Developing a Conceptual Graph API Platform

Building an Operational Computer-based Essay Marking Software

Supervisors: Iadh Ounis (Dept of Computing Science) and Andrzej Huczynski (Dept of Business and Management)

Roughly speaking, a computer-based essay marker is a program that, given some text of a given language, can return feedback to the author explaining how the text can be improved. A computer-based essay marking system might be helpful for the students to practice and to get a very quick feedback. It could also allow the instructors/lecturers to gather valuable feedback in the form of statistical data about how well the students are doing. The aim of this project is to build a computer-based essay marker that will integrate the Strathclyde University's Web sites, in order to be used by large classes of students.

This project is quite challenging as it brings together several computing sciences disciplines, including information retrieval (for the marking and feedback functionalities), information management systems (for the storage and statistical functionalities), interactive systems (for ergonomic and user interaction functionalities) and Web technologies (for use on the WWW).

Last year, a group of students constructed some software in Java, which provides a basic computer-based essay marker. The approach taken was to hold all the information necessary to mark the essays and provide the feedback in an Oracle database. The software will calculate the grade when given an essay, will also provide a feedback and finally will update the database with the relevant information.

This work needs to be taken on in a number of ways:

The marking process needs to be worked on, to see if a better solution could be made to work. More specifically, the system should ensure that the most "fair" mark is provided as output.
The feedback process needs to be improved to be more intuitive to use.
The system interface needs to be completed and improved.
The statistical analysis needs to be completed and optimised.
The current system is only available as a Java application. Some work must be done in order to migrate the application and make it available for use on the Web.

Building a MetaSearch Engine on the Web

Supervisor: Iadh Ounis

There are many search engines on the Web (e.g. Altavista, Google, Lycos, etc.). However, their qualitative performance is unsatisfactory, and their coverage of the Web (proportion of the Web indexed/collected by these search engines) is poor. One solution to this problem is to use a MetaSearch engine. A metasearch engine is a Web server that sends a given query to several search engines, collect their answers and present them to the user. This allows to have a better coverage of the Web, while having a simplified interaction.

The aim of this project is to develop something similar to Vivisimo, but the results should be provided to the users using information visualisation techniques (rather than the traditional sequential list of relevant documents). The final system should include the following functionalities:

Query one or more search engines
Parse their result pages to extract the documents (titles, URLs and short description)
Group (or cluster) the documents into some dynamic categories based on the above information
Order the groups and the documents within each group
Present the results in a suitable graphical way

Mining the Web for Cross-Language Information Retrieval

Supervisor: Iadh Ounis

Although the majority of Web content is in English, it also shows great promise as a source of multilingual content. Such multilingual data can be useful in Cross-Language Information Retrieval (CLIR) on the Web. In addition to the classical information retrieval tasks, CLIR also requires that the query (or the documents) be translated from a language to another. CLIR is currently becoming a hot topic in the Web community. Indeed, it is estimated that by 2005, 78% of Internet users will be non-English speakers. One of the objectives of a CLIR system is to remove the language barrier.

In this project, we will investigate a new translation approach, based on the use of parallel texts. Hence, one of the aims of this project is to automatically find parallel translated documents on the Web. One possible idea is to develop an intelligent agent to "mine" sites where bilingual text is known to be available. In fact, the Web is a great source of translation examples. In fact, many sites are bilingual, mostly English and another language. Automatically extracting parallel text from the Web is an interesting Web application. Translational equivalence between words could then be automatically detected on the basis of the obtained parallels documents and used in CLIR purposes.

The aim of this project is to develop a Cross-Language Information Retrieval system based on parallel texts. The final system will be built on top of the available SMART information retrieval system and should include the following specific functionalities:

Mine the Web for parallel translated documents
Use of HTML tags of the Web documents to "align" the collected documents into parallel blocks of texts
Build a bilingual dictionary using the above parallel blocks and classical information retrieval techniques
Use the above bilingual dictionary for CLIR pupose

Developing a Conceptual Graph API Platform

Supervisor: Iadh Ounis

Note: Mainly Suitable for SE students

XML is a well established universal standard, which has been mainly used for the exchange of information between different applications and data repositories. However, if XML is fine for the original purpose for which its ancestor SGML was designed, i.e. specifying the formats of documents, it is currently no more than a kludgy language for all other aspects. Indeed, for specifying anything else than the formats of documents (e.g. semantics, mathematics, or anything else that has a rich set of operators) the syntax of LISP or Conceptual Graphs is vastly superior.

Conceptual Graphs (CGs) are a very popular/simple knowledge representation formalism developed by John Sowa in the 80s. A conceptual graph is a bipartite graph that has two kinds of nodes called concepts and conceptual relations. The nodes are linked by arcs. CGs have a great deal more to offer than XML: they have a methodology for building larger structures of contexts that can express natural language semantics, Petri nets, UML-like diagrams, and many other kind of information in a way that is (1) more readable for humans and (2) more efficient for certain kinds of graph-processing algorithms.

CGs have been developed as a conceptual schema language for information/knowledge interchange between IT systems that required a structured representation for logic. CGIF (Conceptual Graphs Interchange Format) is currently developed as an ISO standard for implementers of IT systems that use CGs as an internal representations or as an external representation for interchange with other IT systems. The external representations (graphical) are readable by humans and may also be used in communication between humans or between humans and machines.

The goal of this project is to develop an API specification designed to provide an implementation-independent interface for manipulating conceptual graphs- i.e. a set of tools written in Java that transmit CG graphs by using the CGIF standard file format and including:

Support for the 4 standards conceptual graphs operators
Support for graph matching
Support to allow multiple, distinct knowledge bases (applications) with distinct type hierarchies, etc., running under the same virtual machine
Reader/Writer-based parser(s) that can be used on files, network connections or any other form of stream and support internationalisation
A graphical interface allowing the use of the above developed tools

A good starting point will be the excellent NOTIO free API developed at the University of Waterloo by Finnegan Southey and demonstrated last August at ICCS 2001.

Iadh Ounis

Last modified: Mon Oct 22 20:14:48 BST 2001