Vector Pascal an Array Language for Multimedia code Paul Cockshott Imaging Faraday Partnership University of Glasgow, Glasgow G12 8QQ, Scotland email: wpc@dcs.gla.ac.uk Main topics: imageprocessing, Pascal, Code Generator Technology,
SIMD 
post 16.1.06 hits
View
My Stats
Introduction
The introduction of SIMD instruction sets [7][1][11][8] to Personal Computers potentially provides substantial performance increases, but the ability of most programmers to harness this performance is held back by two factors. The first is the limited availability of compilers that make effective use of these instruction sets in a machine independent manner. This remains the case despite the research efforts to develop compilers for multimedia instruction sets [4][10][9][12]. The second is the fact that most popular programming languages were designed on the word at a time model of the classic von Neumann computer rather than the SIMD model.
Data parallelism in Vector Pascal
Vector Pascal incorporates Iverson's approach to data parallelism. Its aim is to provide a notation that allows the natural and elegant expression of data parallel algorithms within a base language that is already familiar to a considerable body of programmers and combine this with modern compilation techniques.
Assignment maps
Standard Pascal allows assignment of whole arrays. Vector Pascal extends this to allow consistent use of mixed rank expressions on the right hand side of an assignment. Given
r0:real; r1:array[0..7] of real; r2:array[0..7,0..7] of real
then we can write
r1:= r2[3]; { in standard Pascal }
r1:= 1/2; { 0.5 to each element of r1 }
r2:= r1*3; { 1.5 to every element of r2}
r2[3]:= r2[4]+r1 ; { self explanatory }
The variable on the left hand side of an assignment defines an array context within which expressions on the right hand side are evaluated. Each array context has a rank given by the number of dimensions of the array on the left hand side. A scalar variable has rank 0. Variables occurring in expressions with an array context of rank r must have r or fewer dimensions. The n bounds of any n dimensional array variable, with n £ r occurring within an expression evaluated in an array context of rank r must match with the rightmost n bounds of the array on the left hand side of the assignment statement.
Where a variable is of lower rank than its array context, the variable is replicated to fill the array context. Because the rank of any assignment is constrained by the variable on the left hand side, no temporary arrays, other than machine registers, need be allocated to store the intermediate array results of expressions.
Maps are implicitly and promiscuously defined on both monadic operators and unary functions. If f is a function or unary operator mapping from type r to type t then if x is an array of r then a:=f(x) assigns an array of t such that a[i]=f(x[i]).
Functions can return any data type whose size is known at compile time, including arrays and records. A consistent copying semantics is used.
Operator Reduction
Maps take operators and arrays and deliver array results. The reduction abstraction takes a dyadic operator and an array and returns a scalar result. The functional form \ denotes it. Thus if a is an array, \ + a denotes the sum over the array. More generally \Fx for some dyadic operator F means x_{0}F(x_{1}F..(x_{n}Fi)) where i is the identity element for the operator and the type. The dot product of two vectors can thus be written as : x:=\+ y*z;
Reduction is always performed along the last array dimension of its argument.
Implicit indices
Assignment maps and reductions involve implicit indices. It can be useful to have access to these.
The syntactic form iota i returns the ith current implicit index. Thus given the definitions
v1: array[1..3]of integer; v2:
array[0..4] of integer;
the program fragment v1:=iota 0; v2:=iota 0 *2;
would set v1 and v2 as follows:
v1=123
v2=02468
trans
The syntactic form trans x transposes a vector matrix, or tensor. It achieves this by cyclic rotation of the implicit indices. Thus if trans e is evaluated in a context with implicit indices
iota 0.. iota n
then the expression e is evaluated in a context with implicit indices
iota'0.. iota'n
where
iota'x = iota ( (x+1)mod n+1)
It should be noted that transposition is generalized to arrays of rank greater than 2.
Unary operations
The unary operators supported are +, , *, /, div, not, round, sqrt, sin, cos, tan, abs, ln, ord, chr, succ, pred and @.
The dyadic operators are all extended to unary context by the insertion of an identity element under the operation. For sets the notation s means the complement of the set s.
Dyadic Operations
Dyadic operators supported are +, +:, :, , *, /, div, mod , **, pow, <, >, >=, <=, =, <>, shr, shl, and, or, in, min, max. All of these are consistently extended to operate over arrays. The operators **, pow denote exponentiation and raising to an integer power as in ISO Extended Pascal. The operators +: and : exist to support saturated arithmetic on bytes as supported by the MMX instruction set.
Performance
As an example of Vector Pascal we will look at an image filtering algorithm. In particular we will look at applying a separable 3 element convolution kernel to an image. We shall initially present the algorithm in standard Pascal and then look at how one might reexpress it in Vector Pascal.
Convolution of an image by a matrix of real numbers can be
used to smooth or sharpen an image, depending on the matrix used. If A is an
output image, K a convolution matrix, then if B is the convolved image

A separable convolution kernel is a vector of real numbers that can be applied independently to the rows and columns of an image to provide filtering. It is a specialisation of the more general convolution matrix, but is algorithmically more efficient to implement. If k is a convolution vector, then the corresponding matrix K is such that K_{i,j} = k_{i}k_{j} .
Given a starting image A as a two dimensional array of pixels, and a three element kernel c_{1},c_{2},c_{3} , the algorithm first forms a temporary array T whose whose elements are the weighted sum of adjacent rows T_{y,x} = c_{1}A_{y}_{1,x}+c_{2}A_{y,x}+c_{3}A_{y+1,x} . Then in a second phase it sets the original image to be the weighted sum of the columns of the temporary array: A_{y,x} = c_{1}T_{y,x}_{1}+c_{2}T_{y,x}+c_{3}Ty,x+1 . Clearly the outer edges of the image are a special case, since the convolution is defined over the neighbours of the pixel, and the pixels along the boundaries a missing one neighbour. A number of solutions are available for this, but for simplicity we will perform only vertical convolutions on the left and right edges and horizontal convolutions on the top and bottom lines of the image. The Standard Pascal implementation of the convolution was:
type
pixel = 128..127;
tplain = array[0..maxpix ,0..maxpix] of pixel;
procedure conv(var theim:tplain;c1,c2,c3:real);
var tim:array[0..maxpix,0..maxpix]of pixel;
temp:real;
i,j:integer;
begin
for i:=1 to maxpix1 do
for j:= 0 to maxpix do begin
temp:= theim[i1][j]*c1+theim[i][j]*c2+theim[i+1][j]*c3;
if temp>127 then temp :=127 else
if temp<128 then temp:=128;
tim[i][j]:=round(temp);
end;
for j:= 0 to maxpix do begin
tim[0][j]:=theim[0][j]; tim[maxpix][j]:=theim[maxpix][j];
end;
for i:=0 to maxpix do begin
for j:= 1 to maxpix1 do begin
temp:= tim[i][j1]*c1+tim[i][j+1]*c3+tim[i][j]*c2;
if temp>127 then temp :=127 else
if temp<128 then temp:=128;
tim[i][j]:=round(temp);
end;
theim[i][0]:=tim[i][0]; theim[i][maxpix]:=tim[i][maxpix];
end;
end;
The pixel data type has to be explicitly introduced as the subrange 128..127. Explicit checks have to be in place to prevent range errors, since the result of a convolution may, depending on the kernel used, be outside the bounds of valid pixels. Arithmetic is done in floating point and then rounded.
The Vector Pascal implementation of the convolution was:
procedure pconv(var theim:tplain;c1,c2,c3:real);
var tim:array[0..maxpix,0..maxpix]of pixel;
p1,p2,p3:array[0..maxpix]of pixel;
begin
p1:=c1; p2:=c2; p3:=c3;
tim [1..maxpix1] := theim[0..maxpix2]*p1 +
theim[1..maxpix1]*p2+theim[2..maxpix]*p3;
tim[0]:=theim[0]; tim[maxpix]:=theim[maxpix];
theim[][1..maxpix1]:=tim[][0..maxpix2]*p1+
tim[][2..maxpix]*p3+tim[][1..maxpix1]*p2;
theim[][0]:=tim[][0]; theim[][maxpix]:=tim[][maxpix];
end;
Note that all explicit loops disappear in this version, being replaced by assignments of array slices. The first line of the algorithm initialises three vectors p1, p2, p3 of pixels to hold the replicated copies of the kernel coefficients c1, c2, c3 in fixed point format. These vectors are then used to multiply rows of the image to build up the convolution. The notation theim[][1..maxpix1] denotes columns 1..maxpix1 of all rows of the image. Because the built in pixel data type is used, all range checking is handled by the compiler. The type Pixel is a fixed point signed binary fraction 1>p>= 1 held to 8 bit precision. Since fixed point arithmetic is used throughout, there will be slight rounding errors not encountered with the previous algorithm, but these are acceptable in most image processing applications. Fixed point pixel arithmetic has the advantage that it can be efficiently implemented in parallel using multimedia instructions.
It is clear that the dataparallel implementation is somewhat more concise than the sequential one, 12 lines with 505 characters compared to 26 lines with 952 characters. It also runs considerably faster, as shown in table 1. This expresses the performance of different implementations in millions of effective arithmetic operations per second. It is assumed that the basic algorithm requires 6 multiplications and 6 additions per pixel processed. The data parallel algorithm runs 12 times faster than the serial one when both are compiled using Vector Pascal and targeted at the MMX instructionset. The pconv also runs a third faster than conv when it is targeted at the 486 instructionset, which in effect, serializes the code. For comparison conv was run on other Pascal Compilers. It can be seen that Vector Pascal is roughly comparable in speed to other compilers on sequential code, but when using the MMX instruction set it performs roughly 10 times faster. Further performance comparisons are given in table 2. The tests here involve vector arithmetic on vectors of length 640 and take the general form v_{1} = v_{2}fv_{3} for some operator f and some vectors v_{1},v_{2},v_{3} . The exception being the dot product operation coded as
r:= \ + r2*r3
in Vector Pascal, and using conventional loops for the other compilers. Even targeted on a 486 the performance of the Vector Pascal compiler for vector arithmetic compares very favorably with that of other compilers. When the target machine is a K6 which incorporates both the MMX and the 3DNow SIMD instruction sets, the acceleration is broadly in line with the degree of parallelism offered by the architecture. For data types where saturated arithmetic is used, the acceleration is most marked, a 21 fold acceleration being achieved for saturated unsigned byte additions and a 23 fold acceleration on pixel additions compared to the best commercial Pascal compiler tested. All measurements were on a 1GHz Athlon.
Table 1: Comparative
Compiler Performances on Convolution
Implementation 
Target Processor 
Million Ops Per Second 

conv 
Vector Pascal 
Pentium + MMX 
61 

Borland Pascal 
286 + 287 
5.5 

Delphi 4 
486 
86 

DevPascal 
486 
62 
pconv 
Vector Pascal 
486 
80 

Vector Pascal 
Pentium + MMX 
817 
Implementation
The Vector Pascal compiler is implemented in Java. It uses the ILCG[5](Intermediate Language for Code Generators) portable code generator generator system. A Vector Pascal program is translated into an ILCG abstract semantic tree implemented as a Java datastructure. The tree is passed to a machine generated Java class corresponding to the code generator for the target machine. Output is assembler code which is assembled using the NASM assembler and linked using the gcc loader.
Code generators in Java are produced from formal specifications of the semantics of the instruction sets. Code generator classes currently exist for the Intel 486, Pentium with MMX, P3, P4 and also the AMD K6. The code generators follow the pattern matching approach described in[2][3]and [6], and are automatically generated from machine specifications written in ILCG . This describes the syntax and semantics of an instructionset as a collection of patterns, for instance:
pattern regindirf(reg r) means[ (r) ] assembles[ r ];
pattern baseplusoffsetf(reg r, signed s) means[+( (r) ,const s)]
assembles[ r '+' s ];
pattern addrform means[baseplusoffsetf  regindirf];
pattern maddrmode(addrform f) means[mem(f) ] assembles[ '[' f ']' ];
instruction pattern PADDD(mreg m, mrmaddrmode ma)
means[(ref int32 vector(2)m:=
(int32 vector(2))+( (int32 vector(2))(m),(int32 vector(2))(ma)))]
assembles ['paddd 'm ',' ma];
Here vector casts are used to specify that the result register will hold the type int32 vector(2), and to constrain the types of the arguments and results of the + operator. The ILCG semantics for such vector patterns have the concept of vector maps built in.
Selection of target machines is by a compile time switch, which causes the appropriate code generator class to be dynamically loaded. ILCG is a strongly typed language which supports vector data types and the mapping of operators over vectors. It is well suited to describing SIMD instruction sets. The code generator classes export from their interfaces details about the degree of parallelism supported for each datatype. This is used by the front end compiler to iterate over arrays longer than those supported by the underlying machine. Where supported parallelism is unitary, this defaults to iteration over the whole array. The translation process shown in figure 1, goes through the following phases:
i. The source file (1) is parsed by a java class PascalCompiler.class (2) a hand written, recursive descent parser, and results in a Java data structure (3), an ILCG tree, which is basically a semantic tree for the program.
ii. The resulting tree is transformed (4) from sequential to parallel form and machine independent optimisations are performed. Since ILCG trees are java objects, they can contain methods to selfoptimise. Each class contains for instance a method eval which attempts to evaluate a tree at compile time. Another method simplify applies generic machine independent transformations to the code. Thus the simplify method of the class For can perform loop unrolling, removal of redundant loops etc. Other methods allow tree walkers to apply context specific transformations.
iii. The resulting ILCG tree (7) is walked over by a class that encapsulates the semantics of the target machine's instructionset (10); for example Pentium.class. During code generation the tree is further transformed, as machine specific register optimisations are performed. The output of this process is an assembler file (11).
iv. This is then fed through an appropriate assembler and linker, assumed to be externally provided to generate an executable program.
Given the declaration var v1,v2,v3:array[1..9]of integer; then the statement v1:=v2+v3; would first be translated to the ILCG :
{ var i;
for i=1 to 9 step 1 do {
v1[^i]:= +(^(v2[^i]),^(v3[^i]));
};
}
The code generator is queried as to the parallelism available on the type int32 and, since it is a Pentium with MMX, returns 2. The loop is then split into two, a portion that can be executed in parallel and a residual sequential component, resulting in the ILCG :
{ var i;
for i= 1 to 8 step 2 do {
(ref int32 vector ( 2 ))mem(+(@v1,*((^i,1),4))):=
+(^((ref int32 vector ( 2 ))mem(+(@v2,*((^i,1),4)))),
^((ref int32 vector ( 2 ))mem(+(@v3,*((^i,1),4)))));
};
for i= 9 to 9 step 1 do {
v1[^i]:= +(^(v2[^i]),^(v3[^i]));
};
}
In the parallel part of the code, the array subscriptions have been replaced by explictly cast memory addresses. This coerces the locations from their original types to the type required by the vectorisation. Applying the simplify method of the For class the following generic transformations are performed:
1. The second loop is replaced by a single statement.
2. The parallel loop is unrolled twofold.
3. The For class is replaced by a sequence of statements with explicit gotos.
The simplified vectorised ILCG tree is passed to an instance of the machine generated Java class Pentium, which performs pattern matching resolution on it to derive the assembly code for the parallel loop ( the residual code is elided):
mov DWORD ecx, 1
leb4b08729615:
cmp DWORD ecx, 8
jg near leb4b08729616
lea edi,[ ecx1]
movq MM1, [ ebp+edi* 4+ 1620]
paddd MM1, [ ebp+edi* 4+ 1640]
movq [ ebp+edi* 4+ 1600],MM1
lea ecx,[ ecx+ 2]
lea edi,[ ecx1]
movq MM1, [ ebp+edi* 4+ 1620]
paddd MM1, [ ebp+edi* 4+ 1640]
movq [ ebp+edi* 4+ 1600],MM1
lea ecx,[ ecx+ 2]
jmp leb4b08729615
leb4b08729616:
Figure 1. The
translation process
Vector Pascal provides a new approach to providing a programming environment for multimedia instruction sets. It borrows abstraction mechanisms that have a long history of successful use in interpretive programming languages, combining these with modern compiler techniques to target SIMD instruction sets. It provides a uniform source language that can target multiple different processors without the programmer having to think about the target machine. Use of Java as the implementation language aids portability of the compiler across operating systems. Vector Pascal currently runs under Windows98, Windows2000 and Linux. Separate compilation using Turbo Pascal style units is supported. C calling conventions allow use of existing libraries. Work is underway to port the BLAS library to Vector Pascal, and to develop an IDE and literate programming system for it.
References
[1] Advanced Micro Devices, 3DNow! Technology Manual, 1999.
[2] Aho, A.V., Ganapathi, M, TJiang S.W.K., Code Generation Using Tree Matching and Dynamic Programming, ACM Trans, Programming Languages and Systems 11, no.4, 1989, pp.491516.
[3] Cattell R. G. G., Automatic derivation of code generators from machine descriptions, ACM Transactions on Programming Languages and Systems, 2(2), pp. 173190, April 1980.
[4] Cheong, G., and Lam, M., An Optimizer for Multimedia Instruction Sets, 2nd SUIF Workshop, Stanford University, August 1997.
[5] Cockshott, Paul, Direct Compilation of High Level Languages for Multimedia Instructionsets, Department of Computer Science, University of Glasgow, Nov 2000.
[6] Graham, S.L., Table Driven Code Generation, IEEE Computer, Vol 13, No. 8, August 1980, pp 25..37.
[7] Intel, Intel Architecture Software Developers Manual Volumes 1 and 2, 1999.
[8] Intel, Willamette Processor Software Developer's Guide, February, 2000.
[9] Krall, A., and Lelait, S., Compilation Techniques for Multimedia Processors, International Journal of Parallel Programming, Vol. 28, No. 4, pp 347361, 2000.
[10] Leupers, R., Compiler Optimization for Media Processors, EMMSEC 99/Sweden 1999.
[11] Peleg, A., Wilke S., Weiser U., Intel MMX for Multimedia PCs, Comm. ACM, vol 40, no. 1 1997.
[12] Srereman, N., and Govindarajan, G., A Vectorizing Compiler for Multimedia Extensions, International Journal of Parallel Programming, Vol. 28, No. 4, pp 363400, 2000.
Table
2: Pascal Compiler Performances in
Million Ops per Second on vector kernels
DevP 
TMT 
BP 286 
DP 4 
VP 486 
VP K6 

unsigned
byte additions 
71 
80 
46 
166 
333 
2329 
saturated unsigned byte additions 
55 
57 
38 
110 
225 
2329 
32 bit integer additions 
85 
59 
47 
285 
349 
635 
16 bit integer additions 
66 
74 
39 
124 
367 
1165 
real additions 
47 
10 
33 
250 
367 
582 
pixel additions 
49 
46 
23 
98 
188 
2330 
pixel multiplications 
67 
14 
39 
99 
158 
998 
real dot
product 
47 
10 
32 
161 
164 
665 
integer dot product 
79 
58 
33 
440 
517 
465 
Abbreviations DevP  Dev Pascal version 1.9, TMT  TMT Pascal version 3, BP 286  Borland Pascal compiler with 287 instructions enabled range checks off., DP 4  Delphi version 4, VP 486  Vector Pascal targeted at a 486 , VP K6  Vector Pascal targeted at an AMD K6 . Measurements on a 1Ghz Athlon.