Lexical analysis

Course Notes Compilers IV

Paul Cockshott

Lexical analysis

Lexical analysis is the process of recognizing the elementary words or lexemes of a programming language. Suppose we wish to recognise the reserved words begin or while or end.

A file which contained one or more of these words could be produced by a grammar of the form:

.*(begin | while | end).*

where

. means any character

* meansan arbitrary number of repetitions

| means alternation

This type of grammar which contains no recursive definitions corresponds to the class 3 grammars described in the previous chapter, is also termed a regular grammar. It is known that to each regular grammar there corresponds a finite state machine that can act as a recognizer for that grammar. The grammar above would produce the set of strings recognised by the finite state machine shown in Figure . In this diagram each state of the machine has a number and is drawn as a circle. There are lines ( or arcs as they are called in graph theory) going between the states. The arcs are labeled with letters. If we start out in state 0 and get a b then we go to state 1.

The letters e g i will then take us through states 2 to 4. The final n puts us into the hit state indicating that we have found one of the words. If at any state we get the wrong letter the machine goes back to state 0.

Figure 2.1: A state machine recognizer

Figure

This class of finite state machines can be represented in computer memory in a number of different ways. One way to do it would be with the pascal record types shown in the listing below. To use this datastructure you would need a pointer to the present state, then for each character in the file you were searching you would run down the list of transitions to see if any applied. If they did you would take the transition, otherwise you would jump back tot he start state.

type

 transition = record

 letter :char ;

 nextstate : ^ state;

 other_transitions: ^transition;

end;

state= record

 hit: boolean;

 transitions:^ transition;

end;

2.1 Tabular FSM representation

A simpler datastructure to use would be a two dimensional array indexed by the current state and the incoming character. The recognition can then be performed by a very simple fast algorithm. Given:

a state transition table T[S,C] indexed on a set of states

S = {0,P,h}

where

0 is the start state,

h the hit state

P the parse states

a character set C

a file of characters F with the operator

next(F)® C that returns the next character in a file

a current state s

then the table T may be interpreted as a finite state recognizer by the algorithm shown in figure .

1. s¬ 0

2. s¬ T[s,next(F)]

3. if s Î {0,P} goto 2

4. stop, the pattern has been recognised

Figure 2.2: Table driven finite state recognition algorithm

This algorithm provides the bare bones of a lexical analyzer. We can envisage coding it up in pascal and getting something like listing.

We assume that the finite state automaton algorithm will be applied to a program held in a buffer. A function scan can be applied to this buffer and will return whether it found a lexeme. In a var parameter found the compiler returns the second last state. In this simple algorithm we will assume that this state is sufficient to distinguish the lexemes that are being looked for.

#1tabular FSM representation

type buffer = array[0..maxbuf] of char;

var buf:buffer;

function scan(var found:state;var hitpos:integer):boolean;

this returns true if a hit is found in the buffer

var i:integer;S:state;

label 1,2;

begin

for i:=hitpos to maxbuf do begin

S:=table[s,buffer[i]];

if S = hitstate then goto 1;

found:=S;

end;

scan:=false;

goto 2;

scan:=true; hitpos:=i;

end;

2.2 Character Classes

In practice the task of a lexical analyzer is more complicated than this.

The lexemes of the language are not just made up of reserved words. There are variables and strings, comments and numbers to contend with. Although the reserved words make up a small finite set, all possible variables of a language make up a very large set indeed. For languages which allow arbitrary length variables, the set is infinite. There can be no question of setting up a transition table that would be capable of recognizing all the possible identifiers, all the possible distinct numbers and all the possible distinct strings.

We are better to handle the process in two stages. First a simple finite automaton which tells us: ‘we have a number’ or ‘we have an identifier’. Second, come other automatons that are invoked by the first to distinguish between different numbers or different identifiers, strings etc. This is the strategy used in the Compiler Writer’s Toolbox. Lexical analysis is split into two levels. First level lexical analysis splits the source program into broad categories of lexemes like identifiers, numbers and strings. Second level lexical analysis then distinguishes between different identifiers etc.

Figure 2.3: Two stage Lexical analysis

Picture Omitted

Consider the problem of recognizing Pascal identifiers and numbers. We might define them as follows:

number ® digit digit*

identifier ® letter alphanumeric*

alphanumeric ® letter

alphanumeric ® digit

digit ® [0-9]

letter ® [a-zA-Z]

By [0-9] we mean any one of the set of characters from 0 to 9. This is termed a character class. Using these character classes we can define state machines that will recognise either an identifier or an number.

Figure 2.4: Distinguishing identifiers from numbers

Figure

The graph in figure 2.4 has far fewer states than the one in figure 2.1, although that one could only recognise 4 reserved words and this one will recognise all numbers or identifiers. It is an indication of the simplification that can arise from just dealing with classes of characters and classes of lexemes. Note that the reserved words recognised by the first state machine begin, while, end will all be categorized as identifiers by the state machine in figure 5.3. This does not matter. We can distinguish between reserved words and other identifiers with second level lexical analysis.

The Compiler Writer’s Toolbox is supposed to be easily configurable to handle different languages. An attempt has been made to isolate language dependent parts and make them readily alterable. The first level lexical analyzer has been made very general. It splits the input into numbers, strings, comments and lexemes. Lexemes are interpreted very liberally.

Anything which is not a string, number or comment is a lexeme. This means that things like the brackets [ ] { } ( ) or the operators + - = are treated as lexemes, as well as the more obvious alphanumeric strings. Although this may seem strange, you should realise that some languages allow sequences of operator symbols to be strung together as in: < = , ++, +=, % =. The first level analyzer sees all of these as lexemes whose meaning will be sorted out later. All it is concerned with is deciding where a lexeme starts and finishes. Consider this example:

x1:=ab<=c;

Where do the individual lexemes start and finish here?

If we write it down with spaces between the lexemes it is obvious:

x1 := ab <= c ;

but the lexical analyzer can not rely upon the programmer putting it down in this way. Without knowing what all the operators in the language are it needs to be able break the string up. This is where character classes come in handy. Using a recognizer like that in figure 5.3 you can pick out the identifiers, but that still leaves a variety of partitionings that are possible.

x1 : = ab < = c ;

x1 := ab < = c ;

x1 : = ab <= c ;

x1 := ab <= c ;

We can pick out the last one as the right one, because our experience of programming languages leads us to treat := and > = as single lexemes. We know that a string of operator symbols going together should be interpreted as a single lexeme.

We can add to our grammar the rules:

operator ® opsym opsym *

opsym ® [+ * - % : = / ]

Exactly which symbols go to make up the class of operator symbols will vary from language to language, but most languages do have a recognizable class of characters used in operators.

There is another class of symbols which have to be treated distinctly: brackets. We want (( to be two lexemes not one composite lexeme. We have certain broad classes of characters that are likely to be used in many languages : digits, letters, operators, brackets, spaces. The exact definition of these will vary from language to language. In some languages the underbar symbol ‘ _ ‘ can be part of an identifier, in others it is an operator.

2.2.1 The primitive state machine

The first level lexical analysis is performed by the unit FSM.PAS. The finite state machine is accessed via the function newlexeme which returns an fsmstate . Each time this function is called the finite state machine processes a new lexeme and terminates in a state which indicates what class of lexeme has been found. The finite state machine program is encoded as a transition table. This is a two dimensional array indexed on the current state and the current character. In association with the transition table is another array: the action table. Yielding one of (last,skip,include) the action table is accessed in step with the transition table. It provides instructions as to how to build up the lexeme. The finite state machine controls the actions of two pointers into a text buffer. One points at the start of a lexeme, the other, called finish points at the current character. As each character is processed finish pointer is advanced.

If the action specified is skip , the start pointer is advanced to the current character.

If the action specified is include , the start pointer remains where it is.

If the action specified is emit , the machine stops and returns the pointers plus its current state to the second level lexical analyser.

As a subsidiary function, the finite state machine keeps a count of the current line that has been reached in the program. To configure the analyser for a different language, you would alter the mckctab program to produce a new set of character class definitions, and make any necessary changes to the transition and action tables.

#1Finite state transducer module FSM

{ -----------------------------------------------------------------
Module      : FSM.cmp
Used in     : Compiler toolbox
Author      : W P Cockshott
Date        : 3 Oct 1986
Version     : 1
Function    : Finite State Transducer to perform first level lexical
              analysis.
              Splits up text in a buffer into lexemes
Copyright © WP Cockshott & P Balch
----------------------------------------------------------------
}
UNIT fsm;
INTERFACE  USES
  editdecl;
‘sec ‘Interface to FSM‘
IMPLEMENTATION
var
       include_sp :integer ;
          NEWSTATE:fsmstate;
       buffstack : array[1..includedepth] of textbuffer;

procedure rewind;
‘sec ‘Rewind‘ ;
function the_line:integer;
begin
  the_line:=the_buffer.linenumber;

end;
function push_buffer:boolean;
‘sec ‘Push buffer‘;
function pop_buffer:boolean;
‘sec ‘Pop buffer‘;
function newlexeme(var B:textbuffer):fsmstate;
‘sec ‘Transition Table‘ ;
         label 1,99;
         var
            S:fsmstate;
            C:charclass;
            A:action;
            T:textbuffer absolute the_buffer;
            I:integer;
         begin
             1:
               t.start:=t.finish;
              if listprog then
              { put the condition outside the loop to prevent things
                being slowed down too much }
                ‘sec‘Recognise and print‘
              else
                ‘sec‘Recognise‘ ;
              99:
              newlexeme:=S;
         end;
  var i:integer;
begin
       NEWSTATE:=startstate;
       listprog:=false;
       include_sp := 0;
end.

2.2.2 Class table

Table 2.1: char class table

@	whitespace,	A	whitespace,	B	whitespace,	C	whitespace,
D	whitespace,	E	whitespace,	F	whitespace,	G	whitespace,
H	whitespace,	I	whitespace,	J	whitespace,	K	whitespace,
L	whitespace,	M	separator,	N	whitespace,	O	whitespace,
P	whitespace,	} Q	whitespace,	R	whitespace,	S	whitespace,
T	whitespace,	U	whitespace,	V	whitespace,	W	whitespace,
X	whitespace,	Y	whitespace,	Z	whitespace,	[	whitespace,
	whitespace,		whitespace,		whitespace,	_	whitespace,
	whitespace,	!	shriek,	“	dquote,	#	whitespace,
$	operator,	%	operator,	&	operator,	‘	quote,
(	bracket,	)	bracket,	*	bracket,	+	operator,
,	bracket,	-	operator,	.	digits,	/	operator,
0	digits,	1	digits,	2	digits,	3	digits,
4	digits,	5	digits,	6	digits,	7	digits,
8	digits,	9	digits,	:	operator,	;	separator,
<	operator,	=	operator,	>	operator,	?	operator,
@	operator,	A	alpha,	B	alpha,	C	alpha,
D	alpha,	E	exponent,	F	alpha,	G	alpha,
H	alpha,	I	alpha,	J	alpha,	K	alpha,
L	alpha,	M	alpha,	N	alpha,	O	alpha,
P	alpha,	Q	alpha,	R	alpha,	S	alpha,
T	alpha,	U	alpha,	V	alpha,	W	alpha,
X	alpha,	Y	alpha,	Z	alpha,	[	bracket,
	whitespace,	]	bracket,		operator,	_	operator,
‘	operator,	a	alpha,	b	alpha,	c	alpha,
d	alpha,	e	exponent,	f	alpha,	g	alpha,
h	alpha,	i	alpha,	j	alpha,	k	alpha,
l	alpha,	m	alpha,	n	alpha,	o	alpha,
p	alpha,	q	alpha,	r	alpha,	s	alpha,
t	alpha,	u	alpha,	v	alpha,	w	alpha,
x	alpha,	y	alpha,	z	alpha,	{	bracket,
\|	operator,	}	bracket,		operator,	del	whitespace

Figure 2.5: Use of the class table to economise on the finite state machine

Figure

The FSM uses two enumerated types for its operation. Fsmstate lists the states that the machine can be in, whilst charclass specifies the classes into which the character set can be mapped. Between them, these types will act as indices into the state transition table for the low level Finite State Machine.

type

  fsmstate =(startstate,opstate,idstate,numstate,
     expstate,commentstate,stringstate,escapestate,
     lastquotestate,sepstate,brackstate
   );

   charclass=(operator,bracket,alpha,digits,exponent,dquote,
     quote,shriek,separator,whitespace
   )

Const classtab:array[0..127] of charclass = ( see table 2.1 );

2.2.3 McCtab program

The Compiler Writer’s Toolbox encodes the classes to which individualcharacters belong in the file { CLASSTAB.CMP}. This includes a definition of the type charclass , and an array classtab which maps from characters to this type. The definitions of this type and the array are given in table 2.1.

The file CLASSTAB.CMP is produced automatically by a program called MckCtab .

This is an example of the use of a program generator program: a program which produces another program. these are very useful tools when writing a compiler. In MckCtab the classes can be specified as sets of char, the program then generates the source declaration for an appropriate character class table.

#1mckctab

{ -----------------------------------------------------------------

Program      :Mkctab
Used in     :CompilerWriter’s Toolbox
Author      :WPCockshott
Date        :3Oct1986
Version     :1
Function    :to build a class table include file for the
                lexical analyser
  Copyright ©WPCockshott & P Balch

}

‘sec ‘Define Character Sets‘
varc:char;
begin

    ‘sec ‘Output Classes and Actions‘

    writeln(‘const classtab:array{[}0..127{]} of charclass=(‘);
    forc:=chr(0) to chr(127) do begin
      ‘sec ‘Assign Characters to Classes‘
    end;
    writeln(‘); {endofclasstab}’);
end.

2.2.4 Output Classes and Actions

The program outputs the type definitions that are to be used in the finite state machine module of the compiler. There are two types representing the state of the finite state machine and the character classes that are jointly used to index the state transition table.

An example of the output generated from listing can be seen in section .

writeln(‘type fsmstate =(startstate,opstate,idstate,numstate,’);
  writeln(‘           expstate,commentstate,stringstate,escapestate,’);
  writeln(‘           lastquotestate,sepstate,brackstate);’);
  writeln(‘ charclass=(operator,bracket,alpha,digits,exponent,dquote,’);
  writeln(‘            quote,shriek,separator,whitespace);’)

2.2.5 Assign Characters to Classes

The program writes out the contents of a constant array that maps characters to their classes. These are written 4 entries to a line separated by commas. A somewhat prettified version of the output from this section was shown in section .

Each entry is made up of a comment followed by the value of the entry.

#1assign characters to classes

if (ord© mod 4) = 0 then writeln;

‘section ‘Make Comment‘

‘section ‘Output Class‘

2.2.5.1 Make comment

The comments for printable characters are the characters themselves. If they are non printable ( DEL or a control character ) they are output in mnemonic form.

For a control code the menmonic is of the form ^ followed by the printable character that is 64 greater than the control code.

Thus the bell character (07) will appear in the comment as ^ G .

The character ‘}’ has to be handled as a special case so as not to prematurely end the comment.

write(‘ { ‘);

if c < chr(32) then write(“,chr(64+ord©),’} ‘)

else if c=’}’ then write(‘closing bracket}’)

else if c=chr(127) then write(‘del}’)

else write(c,’} ‘);

#1make comment

2.2.5.2 Output class

Certain characters have to be treated as special cases.

The letter E can occur either as part of an identifier or as the label for the exponent part of a floating point number. For this reason it is convenient to assign it to a special class EXPONENT. The newline character (ASCII 13) is assigned to the class of separators, this is done explicityly because it can not be given as a Pascal character literal in the definition of the set SEPSY.

Quotes and double quotes are single member character classes and so are more efficiently dealt with by simple equality tests here.

#1Output class

if c in opsy then write(‘operator,’) else

if c in [’E’,’e’] then write(‘exponent,’) else

if c in alpha then write(‘alpha,’) else

if c in digits then write(‘digits,’) else

if c in bracket then write(‘bracket,’) else

if c = ‘!’ then write (‘shriek,’) else

if (c=chr(13)) or (c in sepsy) then write(‘separator,’) else

if c = “”” then write(‘quote,’) else

if c = ‘”’ then write(‘dquote,’) else

begin

write (‘whitespace’);

if ord©<>127 then write(‘,’) else writeln;

end;

Exercises

1. Define the character classes that would be appropriate for first level

lexical analysis of Pascal.

2. Draw a state transition diagram based upon the listing of file fsm.pas

that describes the behavior of the first level lexical analyser.

3. Devise a regular grammar to describe Pascal floating point numbers.

4. Modify the files Mckctab.pas and fsm.pas to construct a first level

2.2.5.3 Define Charactersets

The characters are divided into several subsets depending upon where they occur in the source language. All are declared using the structured constant facility of Turbo Pascal.

#1define charactersets

const

sepsy:set of char = [’;’];

‘sec ‘Opsys‘ ;

‘sec ‘Brackets‘;

‘sec‘Alphanumerics‘

2.2.5.4 Opsys

These characters can be put together to form an operator. That is to say if we have the characters ‘+’ and ‘=’ occuring next to one another they are to be treated as the composite operator ‘+=’. An operator in the source language must be made up entirely of these operator characters.

#1Opsys

const opsys:set of char=[’:’,  ‘-‘, ‘=’, ‘+’, ‘@’,’&’,
                        “,’%’,’‘’,  ‘_’,’$’,
                        ‘?’, ‘/’, ‘>’, ‘<’,”,’|’
                       ]

2.2.5.5 Brackets

These characters include the obvious bracket characters like ‘[’ or ‘]’ and also, less obviously ‘,’ and ‘*’. These are thrown in because all of the characters in the bracket set stand as single lexemes. They do not form part of a larger lexeme.

A ‘*’ put next to another ‘*’ must not be read as ‘**’ nor must a pair of ‘(‘s be read as ‘((‘.

#1Brackets

const brakets:set of char = [ ‘)’, ‘(‘, ‘{’, ‘}’, ‘]’, ‘[’,’*’,’, ]

2.2.5.6 Alphanumerics

Alpha numeric characters can occur in two contexts: in numbers, and in identifiers. They are split into subsets according to whether they are staring characters of identifiers and of numbers. Note that the character ‘.’ can occur in either an identifier or a (real) number.

#1Alphanumerics

const
alpha:set of char = [’a’..’z’,’A’..’Z’];
digits:set of char = [’0’..’9’,’.’];
alphanum:set of char = [’a’..’z’,’A’..’Z’,’.’,’0’..’9’,’#’];
spacechars:set of char =[’ ‘]

2.2.6 Transition table

This table which is stored as a two dimensional array encodes the behaviour of the low level finite state machine for PS-algol lexical analysis. The behaviour of the machine is indicated in the state transition diagram shown in fig .

const transtable:array [fsmstate,charclass] of fsmstate =

state	operator	bracket	alpha	digits	exponen
	“	‘	!	;	whitespace
startstate	opstate	brackstate	idstate	numstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate,
opstate	opstate	brackstate	idstate	numstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate,
idstate	opstate	brackstate	idstate	idstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate,
numstate	opstate	brackstate	idstate	numstate	expstate
	stringstate	startstate	commentstate	sepstate	startstate,
expstate	numstate	brackstate	idstate	numstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate,
commentstate	commentstate	commentstate	commentstate	commentstate	commentstate
	commentstate	commentstate	commentstate	sepstate	commentstate,
stringstate	stringstate	stringstate	stringstate	stringstate	stringstate
	lastquotestate	escapestate	stringstate	stringstate	stringstate,
escapestate	stringstate	stringstate	stringstate	stringstate	stringstate
	stringstate	stringstate	stringstate	stringstate	stringstate,
lastquotestate	opstate	brackstate	idstate	numstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate,
sepstate	opstate	brackstate	idstate	numstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate,
brackstate	opstate	brackstate	idstate	numstate	idstate
	stringstate	startstate	commentstate	sepstate	startstate)

);

type action =(last,skip,include);

const emit:array [fsmstate,charclass] of action = (

state	operator	bracket	alpha	digits	exponen
	“	‘	!	;	whitespace
startstate	skip	skip	skip	skip	skip
	skip	skip	skip	skip	skip,
opstate	include	last	last	last	last
	last	last	last	last	last,
idstate	last	last	include	include	include
	last	last	last	last	last,
numstate	last	last	last	include	include
	last	last	last	last	last,
expstate	include	last	last	include	last
	last	last	last	last	last,
commentstate	skip	skip	skip	skip	skip
	skip	skip	skip	last	skip,
stringstate	include	include	include	include	include
	include	include	include	include	include,
escapestate	include	include	include	include	include
	include	include	include	include	include,
lastquotestate	last	last	last	last	last
	last	last	last	last	last,
sepstate	last	last	last	last	last
	last	last	last	last	last,
brackstate	last	last	last	last	last
	last	last	last	last	last)

);

2.2.7 Recognise and print

This is the basic loop that recognises a lexeme. Between invocations the state of the FSM is stored in NEWSTATE.

#1recognise and print

repeat

S:=NEWSTATE;

‘sec ‘Get the next character‘

write(listfile,chr(i));

if I= 10 then begin

t.linenumber:=t.linenumber+1;

end ;

‘sec ‘Compute new state‘

if A= skip then t.start:=t.finish ; mark start of lexeme

‘sec ‘Handle end of buffer‘

until( A=last)

2.2.8 Recognise

This is the recognition loop when printing is disabled. Two variants of the loop are provided to prevent a test for printing being enabled within the inner loop.

#1recognise

repeat

S:=NEWSTATE;

‘sec‘Get the next character‘

if I= 10 then begin

t.linenumber:=t.linenumber+1;

end ;

‘sec ‘Compute new state‘

if A= skip then t.start:=t.finish ; mark start of lexeme

‘sec ‘Handle end of buffer‘

until( A=last)

2.3 Interface to FSM

These are the functions and variables exported from the finite state machine module to the second level lexical analyser. If Turbo Pascal allowed a more modular structure one would like the declaration of the type of the Textbuffer to be hidden.

This can not be done under Turbo Pascal V4.0.

#1interface to FSM

2.3.0.1 Compute new state

This uses the class table to determine the class of the input character and then, using 2d array indexing determines the newstate and the action to perform given the current state.

#1compute new state

C:=classtab[I and 127];

NEWSTATE:= transtable[S,C];

A:=emit[S,C];

2.3.0.2 Handle end of buffer

Check of the end of the lexeme goes pas the end of the buffer. If it does then terminate the lexeme and pop the file buffer.

#1Handle end of buffer

if t.finish >= t.textlen then begin
                { HANDLE RUNNING OUT OF THE BUFFER }
                    if pop_buffer then goto 1 ;
                    a:=last;
                 end;

2.3.0.3 Get next character

Increment the finish pointer for the lexeme and then fetch the character at this position in the text buffer as an integer.

#1get next character

t.finish := t.finish+1;
I:=ord(T.thetextt.finish] );

2.3.1 TEXTBUFFER

Program text is represented to the lexical analyser by the type TEXTBUFFER. This is a record which has the form shown in figure , and is declared in listing .

textbuffer
thetext	pCharSeq	Pointer to an array of char
start	word	first character of lexeme
finish	word	character after the lexeme
textlen	word	number of chars in the buffer
linenumber	integer	the line number we are at

#1textbuffer

type textbuffer = record

thetext: pCharSeq;

start,finish,textlen: word;

linenumber :integer;

{point at the start and finish of lexeme and give length of buffer }

end;

2.3.2 Rewind

When a program is to be recompiled for whatever reason there needs to be the facility for the compiler to move its compilation point back to the start of the buffer. This sequence reintialises the low level finite state machine. The constant textbuflo is declared in the editor package.

The variable include_sp is used to keep track of the level of nesting of ‘include’ files in the compilation process.

#1rewind

begin

NEWSTATE:=startstate;

the_buffer.start:=textbuflo;

the_buffer.finish:=textbuflo;

include_sp:=0;

end

2.3.3 Push buffer

This operation pushes the current text buffer onto a stack and advances the include_sp . This will be done whenever an include file is encountered and our position in a previous buffer has to be saved.

#1push buffer

begin

if include_sp<includedepth then begin

push_buffer:=true;

include_sp:=include_sp+1;

buffstack[include_sp]:=the_buffer;

end else push_buffer:=false;

end

2.3.4 Pop buffer

This is the obverse operation to pushing a buffer. It is performed whenever an include file ends. The space occupied by the buffer can then be freed and the finite state machine switched back to obtaining its input from the old buffer.

#1pop buffer

begin

if include_sp>0 then begin

pop_buffer:=true;

with the_buffer do freemem(thetext,textlen);

the_buffer:=buffstack[include_sp];

include_sp:=include_sp-1;

end else pop_buffer:=false;

end

2.4 Identifier management

The first level of lexical analysis picks out the start and finish of lexemes. At this point the lexemes are sequences of characters. This is not a convenient form to deal with computationally. A common operation that has to be performed in the higher levels of a compiler is to compare two lexemes to see if they match. String comparison is slow. A representation more suited to fast manipulation is needed. Computers carry out atomic operations on words of binary data. The fastest type of comparison is between two binary words. This can generally be done in one or two instructions. For speed, the representation of lexemes should be changed from strings into words. Can this be done?

Can strings of perhaps a dozen characters be represented in a 32 bit or 16 bit word?

It all depends upon how many of them there are. An 8 character string contains 64 bits. It is only possible to represent this in a 16 bit word if the string actually contains less than 64 bits of information. With a 64 bit binary number 2⁶⁴ different values can be represented. There are 2⁶⁴ possible 8 character strings. If longer strings are considered the number of possible strings will rise exponentially in the length of the string. A 16 bit word can only encode 2¹⁶ values. It seems impossible to cram into this all the possible strings.

In principle a program could be written that would use the billions of billions of different identifiers that are theoretically permitted in programming languages. In practice, in our finite world such a program will never occur. All we have to worry about is how many distinct identifiers will actually occur in a program. This number is unlikely to rise above a few hundreds, so a 16 bit word can easily range over them. One simply has to assign ascending integers to the lexemes in the order in which they are encountered. The second level lexical analyzer can be seen as a black box that takes in strings, numbers them and then outputs the numbers.

strings ® 2^nd level lexical analysis ® numbers

In functional notation we can characterize it as a function

identify(string® number)

such that

identify(A) = identify(B) iff A = B

In words this means that the mapping from strings into numbers preserves string equality. If we knew beforehand what strings were to be encountered this would be trivial. We need only construct a fixed table with two columns, strings in one numbers in the other. When we needed to perform a conversion we would scan the first column for a match and output the corresponding number. For the reserved words of a language this is feasible and it is indeed the technique used by some compilers. A fixed table is set up containing the reserved words, which is searched for a match each time a lexeme is produced. If you do this however, you end up with the reserved words of the language embedded in your compiler. This may be a disadvantage when you have to convert the compiler to handle a new language. Furthermore, a compiler will always have to handle additional symbols over and above the reserved words. It has to handle the user defined names of variables. These can not be known before hand, so a compiler is forced to have an extendible data structure - often called the symbol table - to deal with these. An alternative approach is to deal with all lexemes other than literal values ( integers, reals etc) in a homogeneous way. The lexical analyzer of the Compiler Toolbox contains no built in knowledge of what identifiers are used in the language. It works entirely with dynamic data structures. The simplest way to build a mapping from strings to numbers would be something like the algorithm shown in program SIMTRANS .

This algorithm is simple and reliable but suffers from two serious disadvantages. The first is excessive space consumption. The second is slowness.

In order to be able to handle the longest identifier that you are going to encounter the constant idmaxlen is going to have to be set to something like 40 or 50. The type idrange may have to go up to 1000 to handle the number of identifiers that might occur in a big program. Already you have reserved 40 to 50 kilobytes of store. A compiler will need several other tables. Is it wise to use up so much space that you may not need? Most identifiers will be closer to 8 characters than 40 characters in length. Perhaps 80 per cent of the table will be wasted. This kind of pressure tempts compiler writers to restrict the length of identifiers that the compiler will accept. When computers had less memory, the temptation to only allow short identifiers was severe. Fortran had only 6 characters in identifiers. Early versions of C ignored all but the first 8 characters of a name. This kind of thing is no longer acceptable. Some sort of dynamic data structure should be chosen for an identifier table that does not restrict how long the identifiers can be.

The simple table used in SIMTRANS will be slow to search. On average you will have to scan through half the table to find an identifier. This can mean hundreds of string comparisons for each symbol processed. In order to prevent identifier translation becoming a bottleneck it should be made as fast as possible.

#1program SIMTRANS

program simtrans;
type idrange=1..idmax;
     idstring=string[idmaxlen];
var idpntr:idrange;
      idtab:array [idrange] of idstring;
procedure pushid(var id:idstring);
begin
  idtab[idpntr]:=id;
  idpntr:=succ(idpntr);
end;
function identify(var id:idstring):idrange;
var i:idrange;
label found;
begin
  for i:=1 to pred(idpntr) do if id= idtab[i] then
  begin
   identify:=i; goto found
  end;
  pushid(id);
  identify:=identify(id);
  found:
end;
begin
  idpntr:=1;
{ rest of program goes here }
end.

2.5 Tries

There are a variety of dynamic data structures that could be used based upon trees or hashing. What the Compiler Toolbox uses is a Trie. This is a special sort of tree that makes use of the observation that many of the names used in a program will be similar to one another. Consider the user defined identifiers in what follows.

idrange

idmax

idstring

idmaxlen

idpntr

idtab

pushid

id

identify

found

There is a great deal of commonality here. A trie takes advantage of the fact that many identifiers share common prefixes with other identifiers. We can reorganise the identifiers to show this in figure .

Figure 2.6: Identifiers organised as a trie

founD

IDentifY

maXleN

pntR

rangE

strinG

taB

pushiD

In this arrangement the prefixes are only written down once. The last letter of an identifier is written in capitals. Although this arrangement is more compact, it contains all of the names. A trie achieves this arrangement in computer memory using a linked list.

2.5.1 Trie data structure

A possible record format for a trie is shown in listing .

Figure 2.7: the trie data structure

Figure

The type lexnode points at a delabrandis record. This has a character in it and two further lexnodes. The follower pointer corresponds to moving horizontally in figure 2.6, the alternates pointer corresponds to moving vertically. The number final will be non zero if this is the last letter in a word, corresponding to the use of bold characters in figure 2.6. The identifiers are further indexed on their first character using the type lexarray.

type

lexnode = ^delabrandis;

lexarray = array[minchar..maxchar] of lexnode;

delabrandis = record

final:integer;

pref:char;

follower,alternates:lexnode;

end;

2.5.2 Trie insertion

The dynamic trie structure places no limits on the the number of characters that can be in an identifier. Despite this, no unnecessary space is allocated for short identifiers. Very short, one or two letter identifiers are likely to fall entirely within other longer ones, and effectively use no space. Search is fast. Using the index the algorithm restricts its search space to just those identifiers that start with the right letter. With each further letter matched, the search space is restricted.

Figure 2.8: Use of index to speed up trie access

Figure

The trie is manipulated by the function insert_token which updates the trie as a side effect of searching it for an identifier. Because the trie is a recursive datatype the actual insertion is done by a recursive procedure ins . The function maintains the alternates in alphabetical order. The presence of an order reduces the number of unnecessary comparisons that have to be made if you are inserting a new string.

Note how access to the trie is accelerated by using an array n indexed on characters. This means that there is essentially a sub trie for each possible first letter of the identifiers. This produces a cheap speedup in the search performance. If this were not the case the first letter would have to be matched using a linear search.

#1Insert token

function insert_token(var B:textbuffer; var n:lexarray):lextoken;

{  inserts the string into the tree }

{$S-}

var p      : lexnode;

charno : integer;

    c      : char;

    hit ,inserted   :boolean;

procedure newnode(var next:lexnode;c:char);

‘sec ‘Create a new node‘

procedure ins(var n:lexnode;charno: word );

var t:lexnode;

c:char;

‘sec ‘Recursive Insert‘

begin

{$r-}

ins(  n[B.thetext^[B.start]], B.start);

end;

2.5.2.1 Create a new node

A new node is created on the heap. Its character field is initialised to the current character. Its final field is set to 0 to indicate that this is not the last character on the list. It has no follower or alternates as yet.

begin

new(next);

with next^ do begin

pref:=c; final:=0;

follower:=nil;

alternates:=nil;

end;

end;

2.5.3 Recursive insert

The recursive insert procedure has to deal with 3 alternatives.

We are at a leaf node of the trie with a nil pointer.

We are at a node that matches the current character.

The current character is lexx than the character in the node.

begin

c:=b.thetext^[charno];

if charno <B.finish then with B do

if n=nil then

‘sec ‘Add another letter‘

else with  n ^ do

if c = pref then begin

if charno=finish -1 then

‘sec ‘Last letter of word‘

else ins(follower,charno+1);

end

else if c< pref then ins(alternates,charno)

else

‘sec ‘Char less than prefix‘ ;

end;

2.5.3.1 Add another letter

A new node is attached to the currenly nil node pointer. The insert procedure is invoked recursively to append any further characters in the word to the trie.

begin

newnode(t,c);

n:=t;

ins(n,charno)

end

2.5.4 Last letter of a word

We are on the last character of the word. Either the word has been encountered before or it has not.

If it has been met previously then a token will have been stored for the word in the final field of the node. Alternatively, the word is new and the final field indicates this by holding 0. It is then replaced by the value of lasttoken , which is itself then incremented. After this the value in final must be the appropriate code for the word and can be returned from insert_token .

begin

{ a  hit }

hit:=true;

if final = 0 then

{ first time we have encountered this word}

begin

final:=lasttoken;

lasttoken:=succ(lasttoken);

end;

insert_token:=final;

end

2.5.4.1 Char less than prefix

A copy of the pointer to the current node is made in t . A new node is then attached to the current node pointer. The old node is then made the first alternative for the new node. Then the insert operation is invoked recursively.

begin

t:=n;

newnode(n,c);

n^.alternates:=t;

ins(n,charno);

end

2.6 Lexeme Definition

The identifier table is initially loaded with a list of all of the reserved symbols of the language. This includes not only the reserved words, but also the operators and brackets. At startup the procedure init_lexanal gets invoked to read these in from a file: lexemes.def. Because the insertion procedure assigns ascending integers to the identifiers read in, the lexemes in the file will be given internal integer representations in the order in which they occur in the file. Within this file they are organised as list of symbols one per line.

Figure 2.9: A section of a lexeme definition file.

then

to

TRACEON

TRACEOFF

true

upb

vector

while

write

~=

..INT.LIT

..REAL.LIT

..STRING.LIT

..IDENTIFIER

Figure 2.10: Corresponding section of the type lexemes

THEN_SY,

TO_SY,

TRACEON_SY,

TRACEOFF_SY,

TRUE_SY,

UPB_SY,

VECTOR_SY,

WHILE_SY,

WRITE_SY,

CUR_SY,

LEY_SY,

TILDE_SY,

NEQ_SY,

SLASH_SY,

INT_LIT,

REAL_LIT,

STRING_LIT,

IDENTIFIER,

The entries in the lexeme definition file are put in one to one correspondence with an enumerated data type: lexeme. This type provides the interface between the lexical analyzer and the syntax analyzer. The context free grammar of the language will be defined in terms of these lexeme values.

2.7 Interface to syntax analyser

FORDOCUMENTATION

The lexical analyzer communicates with the syntax analyzer via a small group of procedures and variables. The most important of these is the procedure next_symbol . Each time it is called it processes one symbol. The finite state machine in the level one syntax analyzer is invoked to delimit the symbol. The terminating state of the machine determines whether the symbol was a name, an operator, a number or a string.

If a number is found then a numeric conversion function is invoked to convert the decimal representation of the number into a binary format. In Turbo pascal this is easily achieved using the val function. This takes a string and returns an integer or real equivalent. If your compiler is in another dialect of pascal, it may become necessary to write your own numeric conversion functions.

Next_symbol leaves the results of lexical analysis in a set of global variables:

symbol     :lextoken;

lexsymbol  :lexeme;

the_string :stringv;

the_real   :real;

the_integer:integer;

Lexsymbol will hold the lexeme that has been matched. If it was an integer, real or string then it will take on the values INT_LIT,REAL_LIT or STRING_LIT and the corresponding value of the literal will be stored in the_integer,the_real or the_string . If the lexeme is a reserved word then the corresponding enumerated type value is stored in lexsymbol. If the lexeme is a user defined identifier, then the value of lexsymbol will be IDENTIFIER and the number associated with that identifier will be returned in symbol as a lextoken.

The writer of a syntax analyzer can choose to call next_symbol and perform tests on these global variables. In the process of analysis certain actions have to be carried out repeatedly. A common sequence is to test to see if the current lexeme has a particular value, and, if it has to call next_symbol to see what follows. This combination of actions is bundled up in :

function have( t: lexeme) : boolean;

If the lexical analyzer has t then it says yes and grabs the next lexeme.

{ ------------------------------------------------------------------- }

{           HAVE                                                      }

{           conditionally matches a token                             }

{ ------------------------------------------------------------------- }

function have( t: lexeme ) : boolean;

begin

if t = lexsymbol then

begin next_symbol(the_buffer);

have:=true

end

else  have:=false;

end;

2.7.1 Mustbe

The stricter version of have is mustbe. This is called when the syntax stipulates the presence of a particular symbol. If the symbol is not found then an error is reported. The error message specifies what symbol was found and what was expected. This can generate such messages as ‘begin found instead of then’.

Note that as presently configured, mustbe will skip over newlines until it comes to a symbol. This is appropriate for most modern languages, but would not be appropriate for parsing assembly language for instance.

{ ------------------------------------------------------------------- }

{           MUSTBE                                                    }

{           compulsorily matches a token                              }

{ ------------------------------------------------------------------- }

procedure mustbe(t : lexeme );

begin

if not have(t) then  begin

if have(newline_sy) then mustbe(t) else syntax(t);

end;

end;

2.7.2 Syntax errors

{ ------------------------------------------------------------------- }

{          SYNTAX                                                     }

{          report error and stop                                      }

{ ------------------------------------------------------------------- }

procedure syntax( t : lexeme);

var m :stringv;

begin

m:= currentid +’ found instead of ‘+ psym(ord(t));

ReportError(m);

end;

2.7.3 Current Id

The code generator may need to know the current identifier’s printable form in order to be able to plant information for a linker. It is thus convenient to have a function that will return this. This avoids the code generator having to know anything about the data format of the text buffer.

{ -------------------------------------------------------------- }

{       CURRENTID                                                }

{       returns the identifier as a string                       }

{ -------------------------------------------------------------- }

function currentid :stringv;

var n:stringv;

i,p:integer;

begin

with the_buffer do begin

n:=”;

for i:=start+1 to finish do begin

n:=n+thetext^[i];

end;

currentid:=n;

end;

end;

TOPICS

2.7.4 Converting lexemes to strings

For error reporting purposes it is often convenient to be able to convert from lexemes back into printable strings.

The technique that you chose for this will depend upon how frequently you want to make the conversions each way. If conversions back into strings are very rare: just when the compiler stops with an error, then it is not necessary for the conversion process to be very fast. In this case one could perform a depth first traversal of the trie looking for a node marked with the current lexeme. When one is found, one knows that this node represents the last character of the identifier. By tracing ones path back through the trie the original identifier can be extracted.

If speed is more important then it may be worth explicitly storing each identifier in ram. This of course brings with it all the problems of space overheads mentioned earlier. If you use a 2 dimensional character array, you have to agree upon the maximum length of character that you can store.

An alternative storage mechanism shown in figure uses an array starts indexed on the lexemes to find the starting positions of identifiers in a one dimensional character array: pool . The lexeme l will then occupy positions

Figure 2.11: Use of a character pool

pool[starts[l]]..pool[starts[l+1]-1]

2.7.5 NextSymbol

The lowest level interface to a lexical analyser is provided by a procedure that gets a symbol. We call this procedure NEXTSYMBOL . It reads in a lexeme and stores the token in the variable symbol .

If the symbol turns out to have been a literal value, for instance a number or a string then the actual value of the literal must be stored for the subsequent stages of the compiler. The value can be stored in one of several global variables:

the.integer,the.real,the.string .

In order to do this some processing will be necessary to convert between source character strings and numbers or strings.

{ ------------------------------------------------------------------ }

{      NEXTSYMBOL                                                    }

{ ------------------------------------------------------------------ }

procedure next_symbol;

var S:fsmstate;

    function numconv:lexeme;

    var n:stringv;

        i,p:integer;

        isint:boolean;

    begin

‘sec‘Convert string to number‘

    end;

    procedure printsymbol;

    var i:integer;

        c: char;

    begin

         with the_buffer  do

         if lexsymbol <> newline_sy then begin

for i:=start to finish -1 do write(thetext^[i]);

write(‘ ‘);

end else   writeln;

end;

function stringconv:lexeme;

var n:stringv;

i,p:integer;

escape:boolean;

c:char;

procedure append(c:char);

begin    if length(n)<MAXSTRING then n:=n+c; end;

begin

‘sec ‘Convert string to internal form‘

end;

‘sec ‘Type coercion operation‘

begin

coerce.dummy:=0;

compilercursor:=the_buffer.start;

S:=newlexeme(the_buffer);

with coerce do

if s in [opstate,brackstate,idstate] then

l1:=insert_token(the_buffer,predefined)

else if s in[numstate,expstate]  then l2:=numconv

else if s in [stringstate,lastquotestate] then l2:=stringconv

else  symbol:=coerce.l1;

if symbol >maxpredefined then lexsymbol:=identifier

else lexsymbol:=coerce.l2;

‘sec ‘Detect run time error location‘

end;

2.7.5.1 Convert string to number

The string is searched to see if it contains any non-digit characters.

If it does then the flag isint will be set. On the basis of this flag the turbo pascal function val is called to convert the string either into an integer or a real. Then the appropriate lexeme is returned from numconv.

isint:=true;

with the_buffer do begin

n:=”;

for i:=start to finish-1 do begin

n:=n+thetext^[i];

isint:=isint and (thetext^[i] in [’0’..’9’]);

end;

if isint then begin val(n,the_integer,p); numconv:=INT_LITend

else begin val(n,the_real,p);  numconv:=REAL_LITend;

end;

2.7.5.2 Convert string to internal form

Some characters that one may want to place in a string have no printable representation. Most of these are format characters like carriage return, tab, or backspace. Some computer languages provide special means of including these into a string by prefixing a printable character with an escape character. When so prefixed, the printable character means something else.

In C the escape prefix is \ and in S-algol it is ‘ .

The conversion from the printable to the internal form of a string can be performed by a simple finite state machine that has two states:

Processing normal characters.

Processing the character after an escape.

In the first case, the characters in the source string are just copied over to the internal form. In the second case, the translation rule appropriate to the language must be applied to get the internal non-printable character.

begin

escape:=false;

with the_buffer do begin

n:=”;

for i:=start+1 to finish -2 do begin

c:=thetext^[i];

if not escape then begin

escape:=classtab[ord© and 127]=quote;

if not escape then    append©;

end

else begin

‘sec ‘Convert prefixed characters‘

escape:=false;

end;

end;

the_string:=n;

stringconv:=string_lit;

end;

end;

2.7.5.3 Convert prefixed characters

Most languages provide a mechanism for embedding control characters in strings, typically by providing a prefix character and then some code that follows. This example obviously applies to the language S-algol, but other languages would require similar code. S-algol has the following mapping from escape sequnces to non-printable characters:

¢n® LineFeed = 10

¢t® Tab = 9

¢o® CarriageReturn = 13

¢b® BackSpace = 8

¢p® VerticalTab = 11

Anything else preceded by a ‘ stands for itself. Hence “ stands for ‘ and ‘” stands for “ .

FORCOMPILER

case c of

‘n’ : append(chr(NEWLINE));

‘t’ : append(chr(TAB));

‘o’ : append(chr(CR));

‘b’ : append(chr(BS));

‘p’ : append(chr(VTAB));

else append©;

end ;

2.7.5.4 Type coercion operation

We wish to have an enumerated type lexeme for the reserved words of the language. Subsequent identifiers are assigned ordinal values as their lexical tokens that start up where the predefined lexemes finish. What we obtain from our identifier conversion routine is a lextoken which is an ordinal type. We have to convert this to a lexeme. There is no built in way of converting an ordinal to an enumerated type in standard pascal, so we cheat by using a variant record. We assign the lextoken to field l1 and then read it back as a lexeme from field l2 .

var coerce:record

case boolean of

true:(l1:lextoken);

false:(l2:lexeme;dummy:byte);

end

2.7.6 Detect run time error location

If an error occurs during the execution of an S-algol program, the line number on which this occured is passed back to the compiler. The program is then searched using the lexical analyser to find the appropriate position in the text where the error happened. The lexical analyser knows it has been called to find an error if the variable stopline is non zero.

if stopline >0 then

if coerce.l2=newline_sy then

begin

if the_line > stopline then

begin

if  errorfree then

error(‘Run time Error’);

end

end;

2.8 Exercises

1. Write numeric conversion functions to convert from the decimal representation of integers and reals to binary.

2. Modify these to handle numbers in any base from 2 to 16. Use the notation defined by the grammar:

anybase ® base # number

base ® decimalnumber

number ® anydigit anydigit*

decimalnumber ® decimaldigit decimaldigit*

anydigit ® [0-9°-F]

decimaldigit ® [0-9]

An example in this notation might be

16#01°

3. Alter the string conversion procedure to handle pascal strings. These are deliminated by single quotes.

4. Modify the data type lexeme to correspond to the lexemes needed for pascal. Create a new version of the file lexemes.def called lexemes.pas that contains the pascal lexicon. Test the ability of your reconfigured lexical analyzer to accept pascal input. This may require you to write a driver program that will print the output of the lexical analyzer.

5. How would you make the lexical analyzer insensitive to the differences between upper and lower case keywords and identifiers whilst retaining the cases of letters in strings?

View My Stats