Vector Pascal an Array Language for Multimedia code Paul Cockshott Imaging Faraday Partnership University of Glasgow, Glasgow G12 8QQ, Scotland e-mail: wpc@dcs.gla.ac.uk Main topics: image-processing, Pascal, Code Generator Technology, SIMD

post 16.1.06 hits
View My Stats

Introduction

The introduction of SIMD instruction sets  to Personal Computers potentially provides substantial performance increases, but the ability of most programmers to harness this performance is held back by two factors. The first is the limited availability of compilers that make effective use of these instruction sets in a machine independent manner. This remains the case despite the research efforts to develop compilers for multi-media instruction sets . The second is the fact that most popular programming languages were designed on the word at a time model of the classic von Neumann computer rather than the SIMD model.

Data parallelism in Vector Pascal

Vector Pascal incorporates Iverson's approach to data parallelism. Its aim is to provide a notation that allows the natural and elegant expression of data parallel algorithms within a base language that is already familiar to a considerable body of programmers and combine this with modern compilation techniques.

Assignment maps

Standard Pascal allows assignment of whole arrays. Vector Pascal extends this to allow consistent use of mixed rank expressions on the right hand side of an assignment. Given

r0:real; r1:array[0..7] of real; r2:array[0..7,0..7] of real

then we can write

r1:= r2; { in standard Pascal }

r1:= 1/2; { 0.5 to each element of r1 }

r2:= r1*3; { 1.5 to every element of r2}

r2:= r2+r1 ; { self explanatory }

The variable on the left hand side of an assignment defines an array context within which expressions on the right hand side are evaluated. Each array context has a rank given by the number of dimensions of the array on the left hand side. A scalar variable has rank 0. Variables occurring in expressions with an array context of rank r must have r or fewer dimensions. The n bounds of any n dimensional array variable, with n £ r occurring within an expression evaluated in an array context of rank r must match with the rightmost n bounds of the array on the left hand side of the assignment statement.

Where a variable is of lower rank than its array context, the variable is replicated to fill the array context. Because the rank of any assignment is constrained by the variable on the left hand side, no temporary arrays, other than machine registers, need be allocated to store the intermediate array results of expressions.

Maps are implicitly and promiscuously defined on both monadic operators and unary functions. If f is a function or unary operator mapping from type r to type t then if x is an array of r then a:=f(x) assigns an array of t such that a[i]=f(x[i]).

Functions can return any data type whose size is known at compile time, including arrays and records. A consistent copying semantics is used.

Operator Reduction

Maps take operators and arrays and deliver array results. The reduction abstraction takes a dyadic operator and an array and returns a scalar result. The functional form \ denotes it. Thus if a is an array, \ + a denotes the sum over the array. More generally \Fx for some dyadic operator F means x0F(x1F..(xnFi)) where i is the identity element for the operator and the type. The dot product of two vectors can thus be written as : x:=\+ y*z;

Reduction is always performed along the last array dimension of its argument.

Implicit indices

Assignment maps and reductions involve implicit indices. It can be useful to have access to these.

iota

The syntactic form  iota i returns the ith current implicit index. Thus given the definitions

v1: array[1..3]of integer; v2: array[0..4] of integer;

the program fragment v1:=iota 0; v2:=iota 0 *2;

would set v1 and v2 as follows:

v1=123

v2=02468

trans

The syntactic form trans x transposes a vector matrix, or tensor. It achieves this by cyclic rotation of the implicit indices. Thus if trans e is evaluated in a context with implicit indices

iota 0.. iota n

then the expression e is evaluated in a context with implicit indices

iota'0.. iota'n

where

iota'x = iota ( (x+1)mod n+1)

It should be noted that transposition is generalized to arrays of rank greater than 2.

Unary operations

The unary operators supported are +, -, *, /, div, not, round, sqrt, sin, cos, tan, abs, ln, ord, chr, succ, pred and @.

The dyadic operators are all extended to unary context by the insertion of an identity element under the operation. For sets the notation -s means the complement of the set s.

Dyadic operators supported are +, +:, -:, -, *, /, div, mod , **, pow, <, >, >=, <=, =, <>, shr, shl, and, or, in, min, max. All of these are consistently extended to operate over arrays. The operators **, pow denote exponentiation and raising to an integer power as in ISO Extended Pascal. The operators +: and -: exist to support saturated arithmetic on bytes as supported by the MMX instruction set.

Performance

As an example of Vector Pascal we will look at an image filtering algorithm. In particular we will look at applying a separable 3 element convolution kernel to an image. We shall initially present the algorithm in standard Pascal and then look at how one might re-express it in Vector Pascal.

Convolution of an image by a matrix of real numbers can be used to smooth or sharpen an image, depending on the matrix used. If A is an output image, K a convolution matrix, then if B is the convolved image

 By,x = å i å j Ay+i,x+jKi,j

A separable convolution kernel is a vector of real numbers that can be applied independently to the rows and columns of an image to provide filtering. It is a specialisation of the more general convolution matrix, but is algorithmically more efficient to implement. If k is a convolution vector, then the corresponding matrix K is such that Ki,j = kikj .

Given a starting image A as a two dimensional array of pixels, and a three element kernel c1,c2,c3 , the algorithm first forms a temporary array T whose whose elements are the weighted sum of adjacent rows Ty,x = c1Ay-1,x+c2Ay,x+c3Ay+1,x . Then in a second phase it sets the original image to be the weighted sum of the columns of the temporary array: Ay,x = c1Ty,x-1+c2Ty,x+c3Ty,x+1 . Clearly the outer edges of the image are a special case, since the convolution is defined over the neighbours of the pixel, and the pixels along the boundaries a missing one neighbour. A number of solutions are available for this, but for simplicity we will perform only vertical convolutions on the left and right edges and horizontal convolutions on the top and bottom lines of the image. The Standard Pascal implementation of the convolution was:

type

pixel = -128..127;

tplain = array[0..maxpix ,0..maxpix] of pixel;

procedure conv(var theim:tplain;c1,c2,c3:real);

var tim:array[0..maxpix,0..maxpix]of pixel;

temp:real;

i,j:integer;

begin

for i:=1 to maxpix-1 do

for j:= 0 to maxpix do begin

temp:= theim[i-1][j]*c1+theim[i][j]*c2+theim[i+1][j]*c3;

if temp>127 then temp :=127 else

if temp<-128 then temp:=-128;

tim[i][j]:=round(temp);

end;

for j:= 0 to maxpix do begin

tim[j]:=theim[j]; tim[maxpix][j]:=theim[maxpix][j];

end;

for i:=0 to maxpix do begin

for j:= 1 to maxpix-1 do begin

temp:= tim[i][j-1]*c1+tim[i][j+1]*c3+tim[i][j]*c2;

if temp>127 then temp :=127 else

if temp<-128 then temp:=-128;

tim[i][j]:=round(temp);

end;

theim[i]:=tim[i]; theim[i][maxpix]:=tim[i][maxpix];

end;

end;

` `

The pixel data type has to be explicitly introduced as the subrange -128..127. Explicit checks have to be in place to prevent range errors, since the result of a convolution may, depending on the kernel used, be outside the bounds of valid pixels. Arithmetic is done in floating point and then rounded.

The Vector Pascal implementation of the convolution was:

`procedure pconv(var theim:tplain;c1,c2,c3:real);`
`var tim:array[0..maxpix,0..maxpix]of pixel;`
`    p1,p2,p3:array[0..maxpix]of  pixel;`
`begin`
`   p1:=c1; p2:=c2; p3:=c3;        `
`   tim [1..maxpix-1] :=  theim[0..maxpix-2]*p1 +`
`                         theim[1..maxpix-1]*p2+theim[2..maxpix]*p3;`
`   tim:=theim; tim[maxpix]:=theim[maxpix];`
`   theim[][1..maxpix-1]:=tim[][0..maxpix-2]*p1+`
`                         tim[][2..maxpix]*p3+tim[][1..maxpix-1]*p2;`
`   theim[]:=tim[]; theim[][maxpix]:=tim[][maxpix];`
`end;`

Note that all explicit loops disappear in this version, being replaced by assignments of array slices. The first line of the algorithm initialises three vectors p1, p2, p3 of pixels to hold the replicated copies of the kernel coefficients c1, c2, c3 in fixed point format. These vectors are then used to multiply rows of the image to build up the convolution. The notation theim[][1..maxpix-1] denotes columns 1..maxpix-1 of all rows of the image. Because the built in pixel data type is used, all range checking is handled by the compiler. The type Pixel is a fixed point signed binary fraction 1>p>= -1 held to 8 bit precision. Since fixed point arithmetic is used throughout, there will be slight rounding errors not encountered with the previous algorithm, but these are acceptable in most image processing applications. Fixed point pixel arithmetic has the advantage that it can be efficiently implemented in parallel using multi-media instructions.

It is clear that the data-parallel implementation is somewhat more concise than the sequential one, 12 lines with 505 characters compared to 26 lines with 952 characters. It also runs considerably faster, as shown in table 1. This expresses the performance of different implementations in millions of effective arithmetic operations per second. It is assumed that the basic algorithm requires 6 multiplications and 6 additions per pixel processed. The data parallel algorithm runs 12 times faster than the serial one when both are compiled using Vector Pascal and targeted at the MMX instruction-set. The pconv also runs a third faster than conv when it is targeted at the 486 instruction-set, which in effect, serializes the code. For comparison conv was run on other Pascal Compilers. It can be seen that Vector Pascal is roughly comparable in speed to other compilers on sequential code, but when using the MMX instruction set it performs roughly 10 times faster. Further performance comparisons are given in table 2. The tests here involve vector arithmetic on vectors of length 640 and take the general form v1 = v2fv3 for some operator f and some vectors v1,v2,v3 . The exception being the dot product operation coded as

`r:= \ + r2*r3`

in Vector Pascal, and using conventional loops for the other compilers. Even targeted on a 486 the performance of the Vector Pascal compiler for vector arithmetic compares very favorably with that of other compilers. When the target machine is a K6 which incorporates both the MMX and the 3DNow SIMD instruction sets, the acceleration is broadly in line with the degree of parallelism offered by the architecture.  For data types where saturated arithmetic is used, the acceleration is most marked, a 21 fold acceleration being achieved for saturated unsigned byte additions and a 23 fold acceleration on pixel additions compared to the best commercial Pascal compiler tested. All measurements were on a 1GHz Athlon.

Table 1: Comparative Compiler Performances on Convolution

 Program Implementation Target Processor Million Ops Per Second conv Vector Pascal Pentium + MMX 61 Borland Pascal 286 + 287 5.5 Delphi 4 486 86 DevPascal 486 62 pconv Vector Pascal 486 80 Vector Pascal Pentium + MMX 817

Implementation

The Vector Pascal compiler is implemented in Java. It uses the ILCG(Intermediate Language for Code Generators) portable code generator generator system. A Vector Pascal program is translated into an ILCG abstract semantic tree implemented as a Java data-structure. The tree is passed to a machine generated Java class corresponding to the code generator for the target machine.   Output is assembler code which is assembled using the NASM assembler and linked using the gcc loader.

Code generators in Java are produced from formal specifications of the semantics of the instruction sets. Code generator classes currently exist for the Intel 486, Pentium with MMX, P3, P4 and also the AMD K6. The code generators follow the pattern matching approach described inand , and are automatically generated from machine specifications written in ILCG .  This describes the syntax and semantics of an instructionset as a collection of patterns, for instance:

pattern regindirf(reg r) means[ ­ (r) ] assembles[ r ];

pattern baseplusoffsetf(reg r, signed s) means[+( ­ (r) ,const s)]

assembles[ r '+' s ];

means[(ref int32 vector(2)m:=

(int32 vector(2))+( (int32 vector(2))(m),(int32 vector(2))(ma)))]

Here vector casts are used to specify that the result register will hold the type int32 vector(2), and to constrain the types of the arguments and results of the + operator. The ILCG semantics for such vector patterns have the concept of vector maps built in.

Selection of target machines is by a compile time switch, which causes the appropriate code generator class to be dynamically loaded. ILCG is a strongly typed language which supports vector data types and the mapping of operators over vectors. It is well suited to describing SIMD instruction sets. The code generator classes export from their interfaces details about the degree of parallelism supported for each data-type. This is used by the front end compiler to iterate over arrays longer than those supported by the underlying machine. Where supported parallelism is unitary, this defaults to iteration over the whole array. The translation process  shown in figure 1, goes through the following phases:

i.                     The source file (1) is parsed by a java class PascalCompiler.class (2) a hand written, recursive descent parser, and results in a Java data structure (3), an ILCG tree, which is basically a semantic tree for the program.

ii.                   The resulting tree is transformed (4) from sequential to parallel form and machine independent optimisations are performed. Since ILCG trees are java objects, they can contain methods to self-optimise. Each class contains for instance a method eval which attempts to evaluate a tree at compile time. Another method simplify applies generic machine independent transformations to the code. Thus the simplify method of the class For can perform loop unrolling, removal of redundant loops etc. Other methods allow tree walkers to apply context specific transformations.

iii.                  The resulting ILCG tree (7) is walked over by a class that encapsulates the semantics of the target machine's instruction-set (10); for example Pentium.class. During code generation the tree is further transformed, as machine specific register optimisations are performed. The output of this process is an assembler file (11).

iv.                 This is then fed through an appropriate assembler and linker, assumed to be externally provided to generate an executable program.

Given the declaration var v1,v2,v3:array[1..9]of integer; then the statement v1:=v2+v3;  would first be translated to the ILCG :

{ var  i;

for  i=1 to 9 step 1 do {

v1[^i]:= +(^(v2[^i]),^(v3[^i]));

};

}

The code generator is queried as to the parallelism available on the type int32 and, since it is a Pentium with MMX, returns 2. The loop is then split into two, a portion that can be executed in parallel and a residual sequential component, resulting in the ILCG :

{ var i;

for i=    1 to     8 step     2 do {

(ref int32 vector ( 2 ))mem(+(@v1,*(-(^i,1),4))):=

+(^((ref int32 vector ( 2 ))mem(+(@v2,*(-(^i,1),4)))),

^((ref int32 vector ( 2 ))mem(+(@v3,*(-(^i,1),4)))));

};

for i=    9 to     9 step     1 do {

v1[^i]:= +(^(v2[^i]),^(v3[^i]));

};

}

In the parallel part of the code, the array subscriptions have been replaced by explictly cast memory addresses. This coerces the locations from their original types to the type required by the vectorisation. Applying the simplify method of the For class the following generic transformations are performed:

1.       The second loop is replaced by a single statement.

2.       The parallel loop is unrolled twofold.

3.       The For class is replaced by a sequence of statements with explicit gotos.

The simplified vectorised ILCG tree is passed to an instance of the machine generated Java class Pentium, which performs pattern matching resolution on it to derive the assembly code for the parallel loop ( the residual code is elided):

`           mov DWORD ecx,     1`
`leb4b08729615: `
`           cmp DWORD ecx,      8`
`           jg near  leb4b08729616`
`           lea edi,[  ecx-1] `
`           movq MM1, [  ebp+edi* 4+     -1620]`
`           paddd MM1, [  ebp+edi* 4+     -1640]`
`           movq  [  ebp+edi* 4+     -1600],MM1`
`           lea ecx,[  ecx+     2]`
`           lea edi,[  ecx-1]`
`           movq MM1, [  ebp+edi* 4+     -1620]`
`           paddd MM1, [  ebp+edi* 4+     -1640]`
`           movq  [  ebp+edi* 4+     -1600],MM1`
`           lea ecx,[  ecx+     2]`
`           jmp  leb4b08729615`
`leb4b08729616:`

Figure 1. The translation process

Conclusions

Vector Pascal provides a new approach to providing a programming environment for multi-media instruction sets. It borrows abstraction mechanisms that have a long history of successful use in interpretive programming languages, combining these with modern compiler techniques to target SIMD instruction sets. It provides a uniform source language that can target multiple different processors without the programmer having to think about the target machine. Use of Java as the implementation language aids portability of the compiler across operating systems. Vector Pascal currently runs under Windows98, Windows2000 and Linux. Separate compilation using Turbo Pascal style units is supported. C calling conventions allow use of existing libraries.  Work is underway to port the BLAS library to Vector Pascal, and to develop an IDE and literate programming system for it.

References

 Advanced Micro Devices, 3DNow! Technology Manual, 1999.

 Aho, A.V., Ganapathi, M, TJiang S.W.K., Code Generation Using Tree Matching and Dynamic Programming, ACM Trans, Programming Languages and Systems 11, no.4, 1989, pp.491-516.

 Cattell R. G. G., Automatic derivation of code generators from machine descriptions, ACM Transactions on Programming Languages and Systems, 2(2), pp. 173-190, April 1980.

 Cheong, G., and Lam, M., An Optimizer for Multimedia Instruction Sets, 2nd SUIF Workshop, Stanford University, August 1997.

 Cockshott, Paul, Direct Compilation of High Level Languages for Multi-media Instruction-sets, Department of Computer Science, University of Glasgow, Nov 2000.

 Graham, S.L., Table Driven Code Generation, IEEE Computer, Vol 13, No. 8, August 1980, pp 25..37.

 Intel, Intel Architecture Software Developers Manual Volumes 1 and 2, 1999.

 Intel, Willamette Processor Software Developer's Guide, February, 2000.

 Krall, A., and Lelait, S., Compilation Techniques for Multimedia Processors, International Journal of Parallel Programming, Vol. 28, No. 4, pp 347-361, 2000.

 Leupers, R., Compiler Optimization for Media Processors, EMMSEC 99/Sweden 1999.

 Peleg, A., Wilke S., Weiser U., Intel MMX for Multimedia PCs, Comm. ACM, vol 40, no. 1 1997.

 Srereman, N., and Govindarajan, G., A Vectorizing Compiler for Multimedia Extensions, International Journal of Parallel Programming, Vol. 28, No. 4, pp 363-400, 2000.

Table 2: Pascal Compiler  Performances in Million Ops per Second on vector kernels

 TEST DevP TMT BP 286 DP 4 VP 486 VP K6 unsigned byte additions 71 80 46 166 333 2329 saturated unsigned byte additions 55 57 38 110 225 2329 32 bit integer additions 85 59 47 285 349 635 16 bit integer additions 66 74 39 124 367 1165 real additions 47 10 33 250 367 582 pixel additions 49 46 23 98 188 2330 pixel multiplications 67 14 39 99 158 998 real dot product 47 10 32 161 164 665 integer dot product 79 58 33 440 517 465

Abbreviations DevP - Dev Pascal version 1.9, TMT - TMT Pascal version 3, BP 286 - Borland Pascal compiler with 287 instructions enabled range checks off., DP 4 - Delphi version 4, VP 486 - Vector Pascal targeted at a 486 , VP K6 - Vector Pascal targeted at an AMD K6 . Measurements on a 1Ghz Athlon.