Formalising TOPS cartoons,
the development of an interactive similarity search mechanism over TOPS databases and
a TOPS-based structure comparison method

David Gilbert

City University, and EPSRC Visiting Research Fellow at the EBI

David Westhead

Leeds University, and previously at the EBI

Aims and objectives

The aim of the research is to formalise protein topology TOPS cartoons and to design a system to enable interactive similarity searches to be made over the existing TOPS database at EBI.

TOPS cartoons are an optimal projection 3-dimensional structure of proteins in two dimensions, and there is a program which generates cartoons from PDB entries (Flores et al [FMJ94]). Over the last year the original program has been extensively modified by David Westhead at the EBI to improve its robustness and reliability over a much larger set of protein structures. The new program has been used to generate an atlas (database) of TOPS diagrams which are available over the WWW (http://www3.ebi.ac.uk/tops/). In addition a Java program has been produced which permits display and editing of the cartoons.

Our detailed aims include the design and implementation of:

Our proposed research falls at the `abstract' end of the spectrum of methods for the definition, representation and comparison of protein structural topology (for more details, see ``Background'' below). At the purely geometrical end are methods for representing proteins by backbone coordinates; in the centre are methods for representing protein by secondary structure elements which are vectors in 3d space (see [MARW89, GARW93] for example. Approaches related to ours include those by Rawlings et al [RTN+85, CSR91, CRS+93a, CRS+93b] Selbig and Koch [KKS92] or the deductive databases of Tsukamoto [TTS+97]. However, in contrast to the work by Rawlings, Clark et al where constraint logic programming was used in an attempt to predict protein topologies from raw protein database entries, our research aims to formalise and search over an existing topology database which has been created by the TOPS program [FMJ94].

Background

Even when as few as thirty protein structures were known it was apparent that many of them shared some main features while differing in less important geometrical details (see for example [Ric77]), opening the possibility of a topological description of the structures. Such structures can be viewed as comprising a protein backbone to which side chains are attached. The backbone can be divided into two different types of structural region: (i) the segments of backbone constituting secondary structure elements (alpha helices and beta sheets), and (ii) turn and irregular loop regions connecting the secondary structure elements and making up the N and C terminal tails. A sequential order and relative spatial position can be given to the secondary structure elements; in addition the beta strands have well defined hydrogen bond relationships grouping them into sheet and barrel structures and conferring a well defined relative direction. It has been observed that many protein structures shared secondary structure elements with the same sequential order, relative position, direction and hydrogen bond relationships, while differing in many other structural details such as side chain conformations, lengths and conformations of the connecting loop regions, and even the lengths of the secondary structure elements themselves. These observations have pointed the way to a topological description of protein structures including the conserved details pertaining to the secondary structure elements and shared by many structures differing in less important geometrical details.

Over the years many authors have studied protein structural topology and a number of definitions, notations and representations have emerged. Some of these representations are purely diagrammatical, and are designed to help the human observer understand the topology and topological relationships of protein folds, while others are more formal and mathematical, allowing greater possibility for automated comparison, and hence database searches of various types, and more rigorous attempts at enumeration of topological possibilities and ultimately topology prediction.

The simplest representation of protein topology is diagrammatical, and this is really no more than a schematic diagram of a protein fold illustrating secondary structure elements and their topological relationships. However, it is no less useful for its simplicity, and it permits the human observer to understand and compare protein folds much more quickly and easily than would be possible using full 3D representations of structure. Early work related to structural topology considered how it might be represented in hand drawn diagrammatic form [Lev76, ST77, Nag77]. These diagrams are two-dimensional schematic representations of protein folds in which particular symbols were used to represent helices and strands and their sequential connections. Although the diagrams did not imply any formal definition of protein topology and were not generated automatically, they were found to be a very valuable aid to understanding protein folds and their similarities and differences at the topological level. In particular the diagrams were shown to be good for representing topologies within beta structures and chiralities of certain sub-structures like the beta-alpha-beta unit. Automated generation of the diagrams was first attempted by Flores [FMJ94].

It is desirable to be able to compare topologies automatically in silico as well as by manual inspection. This is the necessary condition to be able to perform (sub-)topology database searches which enable the retrieval of topologically related structures. Such comparison of purely diagrammatical representation is difficult since the diagrams are not unique - for each structure there is more than one possible diagram - and all that would be possible in a hypothetical database of topological diagrams would be a "fuzzy" search matching diagrams according to a similarity criterion known to reflect the way in which diagrams of identical structures may vary. This has, to the knowledge of the author, never been attempted.

This problem lead to the need for the more formal representations of protein topology which have appeared in the literature. These vary from representation as mathematical graphs [GMR94], to string representations [KKS92, Flo95], and representations as rules and facts in deductive databases [TTS+97] and languages such as PROLOG [RTN+85] and constraint logic programming [CSR91, CRS+93a, CRS+93b]. These have all been used in searching applications to some extent, usually the application being the extraction of all examples of particular sub-topologies in the database. More often than not the Greek key sub-topology (super-secondary structure) has been taken as an example. While some methods do allow consideration of helical secondary structures, most do not, and even when a method could use helices it is rare to find real examples.

In the recent work of Koch and co-workers [KLE96, KL97] helices are included and a number of examples are presented. The work is aimed a maximal common sub-topology and thus at global structural comparison rather than sub-topology search, but modification to do the latter should be straightforward. A question to be addressed is whether a topological search really detects weaker structural similarities and are these significant. Many folds which are really quite unrelated topologically might end up equivalent in a method which only considers neighbour contact relationships, making no distinction between contacts involving helices and the much more well defined hydrogen bond relationships between strands, and ignores sequential connections. All these potential problems could be addressed within a more sophisticated version of the same basic graph theoretical method. The work could equally be compared with work like that done by Willett and co-workers [MARW89] Grindley et al [GARW93] and Artymuik et al [AGP+94b], which is similar in that graph theoretical methods are used, but differs in being carried out at the geometrical level using spatial vectors to represent secondary structure elements.

TOPS cartoons

Anyone who has viewed the full atomic representation of a protein structure, either on a computer screen or on paper, will agree that such diagrams can be complicated and difficult to interpret. There are a number of simplified representations which make interpretation easier. For instance, many programs for display of protein structures produce formats in which only the protein backbone is displayed, and even simpler formats where secondary structures are represented by arrows and cylinders (eg. Molscript [Kra91]). An even more simplified representation is TOPS cartoons which are two dimensional schematic diagrams of protein structures. Alpha helices are represented by circular symbols and beta strands by triangular ones. The lines connecting the symbols represent loops. The symbols are assigned a direction which is either up or down depending on the relative orientation of the secondary structure elements in the protein fold. The direction of a symbol can be deduced from its connecting loops: if a loop connects to the base of a secondary structure element then the connection is drawn to the boundary of the symbol, if it connects to the top the connection is drawn to the centre of the symbol. This information is duplicated in the case of beta strands where those with the direction up are drawn as up triangles and those with the direction down are drawn as down triangles. An example TOPS cartoon and associated Molscript diagram for CU/ZN superoxidide dismutase (1jcv) is shown in Figure 1.

 figure54
Figure 1: Molscript (left) and TOPS (right) diagrams for 1jcv 

At present the database of TOPS cartoons covers more than 2000 protein chains. It has an interface which only permits searches to be made for the structures of specific proteins, given their PDB (Protein Data Bank) codes and chains, for example 8fab01 . Users are returned some information on the chain:
INFORMATION FOR PROTEIN 8ABP
BINDING PROTEINS
L-*ARABINOSE-BINDING PROTEIN (MUTANT WITH MET 108 REPLACED BY LEU) (/M108L$) COMPLEX WITH D-*GALACTOSE

and can view the topology cartoon (see Figure 2).

 figure62
Figure 2: TOPS cartoon of 8abp 

TOPS diagrams and patterns

A TOPS diagram describes more information about a structure than the corresponding cartoon.  In addition to the Secondary Structure Elements  - their type,  position on the backbone, and orientation relative to the plane of the page, the diagram describes the H-bonds (parallel or anti-parallel) between strands, and the chiralities (right or left) between some of the SSEs.   For example, the TOPS diagram for 2bop is shown below; it comprises five strands, which form a bifurcated anti-parallel sheet, and three helices. There are two right-handed chirality connections, between strands 1 and 4, and between strands 6 and 8 respectively.

TOPS diagram for 2bop

TOPS diagram for 2bop

A TOPS pattern is like a TOPS diagram, and may describe several (none or more) TOPS diagrams. This is achieved by permitting insertions of some SSEs into the pattern to obtain a diagram. Each backbone connection between every pair of SSEs is annotated with a pair of integers standing for the minimum and maximum number of SSEs which can be inserted. The values of min and max can range from 0 to a large number (in practice 60) which we denote N.   For example, a plait pattern (or motif), which matches amongst others 2bop, is illustrated below:

Plait pattern

Plait pattern

This pattern describes, amongst others, 2bop. In order to illustrate this, consider the plait pattern and the diagram for 2bop to each be `stretched out' and laid side by side, as below:

TOPS diagram for 2bop
2bop diagram
Plait pattern
Plait pattern

Matching a pattern to a diagram

Informally, we match a pattern to a diagram by matching on the SSEs, the H-bonds and the chiralities. When matching on the SSEs we obtain a correspondence. This is a sequence of matching pairs of SSEs, one for each SSE in the pattern, where first member is an SSE from the diagram and the second  is an SSE from the pattern.  We can describe the result of matching by the the correspondence and also a list of the total number of inserts between adjacent members of the SSE sequence in the diagram. A pattern may match a diagram in none or more ways, and may match none or more diagrams in a given database of diagrams.

For example, there are two ways in which the plait motif can match the diagram for 2bop:

Progress to date

Motif-based searching

Tops cartoons have been formalised as a string constraint language (TOPS_c) and a representation in graph format (TOPS_g) defined. A compiler has been constructed from TOPS files to a TOPS_g database. There is one TOPS_g entry per domain (a TOPS file may contain more than one domain). The 2200 TOPS files in the TOPS Atlas (total 10MB) are compiled into into 2900 TOPS_g entries (total 1.9MB) in less than 3 minutes [dec-alpha]. The entire PDB (as of April 1998) has been converted, over 6500 TOPS files into over 14200 domain graphs, giving a TOPS_g database size 10.6MB. Loading overheads are approximately 3sec/MB, i.e. 6.4 sec for the Atlas files and 31 sec for the PDB database. The database is in relational format, and CATH numbers have been added for the domains (some domains are not in CATH and hence have missing numbers)

High-level descriptions have been written in TOPS_g for some well-known motifs: left-handed beta-alpha-beta, greek key, jelly rolls, Rossmann type NAD-binding domains, ig's, barrels and tim-barrels (perfect and distorted), trefoils, and propellers.

A fast search algorithm to match the high-level descriptions against TOPS_g entries has been designed and implemented. The match gives a similarity measure based on the number of inserts in the matched domain; results can be sorted on a variety of parameters (cath-number, domain-name, nearness-of-match, etc) and the node numbers of the sse's matching the description can also be output. Results can be optionally written to file. A Web interface to the system has also been constructed.

Typical search times over the TOPS Atlas database are e.g. 10 secs to find all matches to the jelly-roll type 1 description (96 hits out of 2900) on a dec-alpha. Some queries take advantage of precompiled topological information generated by the TOPS program; for example a search for Tim barrels takes less than 1 second to find 50 matches in the Atlas database. A domain can be tested against all motif descriptions to see which it matches and by how much. To run this battery test over all 2900 entries takes 51 sec (dec alpha).

The system (compiler to TOPS_g, and query engine plus motif descriptions) has been implemented as executable binaries using clp(FD) from INRIA on dec-alpha's, suns and linux platforms. A database paging mechanism has been developed in order to overcome memory restrictions so that a database of all of the PDB entries can be searched.

The system is being evaluated against the CATH database.

Pattern discovery

We have developed an efficient pattern-discovery method for TOPS patterns. It is essentially a pattern-driven approach based on discovering discovering patterns of H-bonds (and chiralities) based on the properties of sheets for TOPS diagrams; we also can derive T-patterns, i.e. the associated sequences of SSEs and insert sizes. Briefly, the algorithm attempts to discover a new sheet by finding, common to all the target set of diagrams, a (fresh) pair of strands, sharing an H-bond with a particular direction. Then it tries to extend the sheet by repeatedly inserting a fresh strand which is H-bonded to one of the existing strands in the (current) sheet. The algorithm then finds all further H-bonds between all the members of the current sheet. The entire process is repeated until no more sheets can be discovered; the result is the least general common pattern of H-bonds. The numbers of inserts between each strand in the pattern are then computed for all the patterns in the learning set, and the minimum and maximum size of the gaps in the corresponding insert positions in the pattern are thus found, and combine d with the SSE sequence to give the T-pattern.

Topology-based comparison

We have developed a topology-based structure comparison method which is not based on residues, but on secondary structure elements (helices and strands) plus `h-bonds' as in TOPS diagrams, where an `h-bond' refers to a ladder of individual hydrogen bonds between adjacent strands. This system is available on the Web at http://tops.ebi.ac.uk/tops/compare.html

The system discovers a common pattern between the target and the compared domain, and then uses this pattern to align their secondary structure elements. A a score is calculated as a measure of inserts (strands,helices and h-bonds), and can be thought of as a measure of the number of inserts-difference between the target and the diagram, normalised by the size of the pattern discovered.

The system outputs the domains in the representative set, sorted by distance from your target protein, annotated with their CATH codes (where known) and also with the distance measure. It also outputs your target domain, with a reference distance score of zero. Larger distance measures indicate more remote topological relationships. Distance scores computed using different target domains should not themselves be compared.

Software developed

The software developed during the project includes:

Future work

Future work planned includes

On-line articles

Papers

References

ABG+92
P. J. Artymiuk, P. A. Bath, H. M. Grindley, C. A. Pepperrell, A. R. Poirrette, D. W. Rice, D. A. Thorner, D. J. Wild, P. Willett, F. H. Allen, and R. Taylor. Similarity searching in databases of three-dimensional molecules and macromolecules. J. Chemical Information and Computer Sciences, 32:617-630, 1992.

AGP+94a
P. J. Artymiuk, H. M. Grindley, A. R. Poirrette, D. W. Rice, E. C. Ujah, and P. Willett. Identification of beta-sheet motifs, of psi-loops, and of patterns of acid residues in three-dimensional protein structures using a subgraph-isomorphism algorithm. J. Chemical Information and Computer Sciences, 34:54-62, 1994.

AGP+94b
P.J. Artymuik, H.M. Grindley, A.R. Poirette, D.W. Rice, E.C. Ujah, and P. Willett. Identification of beta sheet motifs, of Psi loops, and of patters of amino acid residues in three dimensional protein structures using a subgraph isomorphism algorithm. J. Chem. Inf. Comput. Sci., 34:54-62, 1994.

BG95
Alvis Brazma and David Gilbert. A pattern language for molecular biology. Technical Report TCU/CS/1995/11, Department of Computer Science, City University, 1995.

BJEG97
A. Brazma, I. Jonassen, I Eidhammer, and D. Gilbert. Approaches to the automatic discovery of patterns in biosequences. To be published by the Journal of Computational Biology, january 1997.

CRS+93a
D. A. Clark, C. J. R. Rawlings, J. Shirazi, L.-L. Li, K. Schuerman, M. Reeve, and A. Vernon. Solving large combinatorial problems in molecular biology using the ElipSys parallel constraint logic programming system. The Computer Journal, 36(8):690-701, 1993.

CRS+93b
D. A. Clark, C. J. R. Rawlings, J. Shirazi, A. Vernon, and M. Reeve. Protein topology prediction through parallel constraint logic programming. In Proceedings of the First International Conference on Intelligent Systems in Molecular Biology, pages 83-91, 1993.

CSR91
D. A. Clark, J. Shirazi, and C. J. R. Rawlings. Protein topology prediction through constraint-based search and evaluation of topological rules. Protein Engineering, 4(7):751-760, 1991.

Ehr78
H. Ehrig. Introduction to the algebraic theory of graph grammars. In H. Ehrig V. Claus and G. Rozenburg, editors, Graph-Grammars and Their Application to Computer Science and Biology, pages 1-69. Springer-Verlag LNCS 73, 1978.

Flo95
D.R. Flower. FOLD: Integrated analysis and display of protein secondary secondary structure. Journal of Molecular Graphics, 13:377-384, 1995.

FMJ94
T.P. Flores, D.M. Moss, and Thornton J.M. An algorithm for automatically generating protein topology cartoons. Protein Engineering, 7(1):31-37, 1994.

GARW93
H.M. Grindley, P.J. Artymuik, D.W. Rice, and P. Willett. Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. Journal of Molecular Biology, 229(707-721), 1993.

GEJ97a
David Gilbert, Ingvar Eidhammer, and Inge Jonassen. CBSDL: a constraint based structure description language for biosequences. In Workshop on Constraints and Bioinformatics/Biocomputing, held in conjunction with the Third International Conference on Principles and Practice of Constraint Programming (CP97), Oct 1997.

GEJ97b
David Gilbert, Ingvar Eidhammer, and Inge Jonassen. Structweb: Biosequence structure searching on the web using clp(fd). In Workshop on Constraint Reasoning on the Internet held in conjunction with the Third International Conference on Principles and Practice of Constraint Programming (CP97), Oct 1997.

GMR94
I.V. Grigoriev, A.A. Mironov, and A.B. Rakhmaninova. Interhelical contacts determining the architecture of alpha-helical globular proteins. Journal of Biomolecular Structure and Dynamics, 12(3):559-572, 1994.

Got86
H. Gottler. Graph grammars and diagram editing. In G. Rozenburg H. Ehrig, M. Nagel and A. Rosenfield, editors, Graph Grammars and their application to Computer Science, pages 216-231. Springer-Verlag LNCS 291, 1986.

KKS92
I. Koch, F. Kaden, and J. Selbig. Analysis of protein sheet topologies by graph theoretical methods. Proteins: Structure, function and genetics, 12:314-323, 1992.

KL97
I. Koch and T. Lengauer. Detection of distant structural similarities in a set of proteins using a fast graph-based method. In T. Gaasterland et al, editor, Proceedings of the 5th International Conference on Intelligent Systems in Molecular Biology, pages 167-178. AAAI Press, Jun 1997.

KLE96
I. Koch, T. Lengauer, and Wanke E. An algorithm for finding maximal common subtopologies in a set of protein structures. Journal of Computational Biology, 3(2):289-306, 1996.

Kra91
Per J. Kraulis. MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. Journal of Applied Crystallography, 24:946-950, 1991.

Lev76
M. Levitt. Structural patterns in globular proteins. Nature, 261:552-557, 1976.

MARW89
E.M. Mitchell, P.J. Artymuik, D.W. Rice, and P. Willett. Use of techniques derived from graph theory to compare secondary structure motifs in proteins. Journal of Molecular Biology, 212:151-166, 1989.

Nag77
K. Nagano. Logical analysis of the mechanism of protein folding. iv. super-secondary structures. Journal of Molecular Biology, 109(2):235-250, 1977.

Ric77
J.S Richardson. Beta-sheet topology and the relatedness of proteins. Nature, 268:495-500, 1977.

Ros86
A. Rosenfield. Array grammars. In G. Rozenburg H. Ehrig, M. Nagel and A. Rosenfield, editors, Graph Grammars and their application to Computer Science, pages 67-70. Springer-Verlag LNCS 291, 1986.

RTN+85
C.J. Rawlings, W.R. Taylor, J. Nyakairu, J. Fox, and M.J.E. Sternberg. Reasoning about protein topology using the logic programming language PROLOG. Journal of Molecular Graphics, 3(4):151-157, 1985.

ST77
M.J.E. Sternberg and J.M. Thornton. On the conformation of proteins: The handedness of the connection between parallel beta strands. Journal of Molecular Biology, 110:269-283, 1977.

TTS+97
Y. Tsukamoto, K. Takiguchi, K. Satou, T. Furuichi, E. Takagi, and S. Kuhara. Application of a deductive database system to search for topological and similar three-dimensional structures in protein. CABIOS, 13(2):183-190, 1997.

Ull76
J. R. Ullmann. An algorithm for subgraph isomorphism. Journal of the ACM, 23(1):31-42, January 1976.