on the SICSA multicore challenge.
Text file containing
English text in ASCII encoding. An integer N.
For all sequences of
words, up to length N, occurring in the input file, the number of
occurrences of this sequence in the text, together with a list of start
indices. Optionally, sequences with only 1 occurrence should be omitted.
set up a good sequential version. I was
initially doubtfull that this challenge would provide an effective
basis for parallelisation because it seemed such a simple problem.
Intutively it seems like a problem that is likely to be of either
linear or at worst log linear complexity, and for such problems,
especially ones involving text files, the time taken to read in the
file and print out the results can easily come to dominate the total
time taken. If a problem is disk bound, then there is little advantage
in expending effort to run it on multiple cores.
However that was only an hypothesis and
needed to be verified by
In line with our school motto of programming to an interface not an
implementation, the interface above rather than the Haskell
implementation was chosen as the starting point. In order to get a bog
standard implementation, C was chosen as the implementation language.
The source code is linked to below.
The initial hypothesis was confirmed. The serial C implementation runs
sufficiently fast that there is no plausible reason to seek to
accelerate it by parallelisation.
Significantly faster than the Haskell version. This is not
surprising as one would expect C to be more efficient.
Appears to have a lower complexity order than the Haskell
version. This would indicate that the Haskell version is not a good
starting point for others attempting to produce a fast parallel version.
Its run time is dominated by the time to ouput the results to
file. This is shown by comparing the run times with printing enabled to
those with printing disabled.
The test files provided in the initial proposal were too short to
get a realistic estimate of performance, although they are useful for
programme debugging. We thus did our main tests using the World English
Bible, a link to whose source is given below.
Linux implementations run substantially faster than Windows on
the same processor and with gcc used in both cases. Probably file i/o
is worse under windows.
The concordance problem is hard to parallelise efficiently. You can not
just split a book into two halves, prepare a concordance for each half
and then merge, since we only have to print out repeated words and
phrases. A repeated word might be missed in this case if it was
mentioned once in the first half and once in the second half. A more
complicated approach is needed. We have chosen to parallelise by
getting all threads to read the entire book, since reading turns out to
be relatively fast. The words themselves are then divided into disjoint
sets - one obvious split would be into 26 sets on the first letter. We
then allocate each thread to do the concordance for a disjoint subset
of the words. A large part of the time is also taken up with output --
the printed concordance can be 5 times as large as the input file. If
distinct cores are producing this, there is an inevitable serial phase
in which the outputs of the different cores are merged into a single
First Parallel Experiment
As a first parallel experiment a dual core version of the C programme
was produced using the pthreads library and it was tested on the same
dual processor machine as the original serial version of the algorithm.
Conclusions are derived from the results shown above:
There was no gain using multithreading on windows. It looks as if
the pthreads library under windows simply multi threads operations on a
single core rather than using both cores.
On Linux there was a small gain in performance due to
multithreading - about 17% faster in elapsed time using 2 cores.
Since a large part of the program execution is spent writing the
results out, this proves a challenge to multicore. First parallel
version adopted the strategy of allowing each thread to write its part
of the results to a different file. It is probably a good idea to
investigate pre buffereing the formated output in memory before writing
to disk as there is a danger of enforced serialisation during the
system calls to write parts of the results to disk if the individual
threads each write a small part of the data.
Second Parallel Experiment
This follows the same basic strategy as the previous one, but uses the
shell, rather than pthreads to fork parallel processes off and
communicates via files using the following command:
>WEB1.con wait cat WEB1.con
This has the best performance of the lot as seen in the attached
We have also tried the programme out on the Intel SCC with
The SCC is configured as a host processor, which is a conventional
modern intel x86 chip. Attached to this is the experimental 48 core SCC
chip, each of whose cores runs a discrete copy of Linux. A major worry
here was the problem of file i/o for the multiple cores. The source
file and the output files were placed on a shared NFS system, and
accessed from there. Looking at the time for one core doing the full
task one can see that it is much slower than a single core on the host
doing the same task. It is unclear how much of this slowdown is due to
the slow access to files from the daughter copies of Linux and how much
is due to the poorer performance of the individual cores on the SCC.
The top 3 lines of the table show the effects of trying to do
portions of the workload in an individual core.
Timinings of SCC doing the concordance test
elapsed time in seconds
doing full concordance
doing half concordance
doin 1/8 concordance
doing half each
doing 1/8 each
cores doing 1/32 each
processor doing it all
processor using 2 cores on host
We dispatched 32 tasks using the pssh command as follows
# shell script to run on host to run
concordance on 32
pssh -t 800 -h hosts32 -o /shared/stdout
cat /shared/stdout/* |sort > WEB.con
The first line removes any temporary output from a previous run. We
then use pssh to run the script sccConcordance32 in a shared directory,
sending the output to the /shared/stdout directory/
When all tasks have run the output from all the tasks is
and sorted to yield the final file.
The script sccConcordance32 invokes the actual concordance task
./l1concord WEB.txt 4 P 31 $(hostname)
The hostname is used to derive a process id which is then used to
select which words will be handled by the task. the 4th parameter to
l1concord is the mask that is applied to give the number of significant
bits in the process id, 5 in this case.
It is clear that on a file i/o bound task like this, the SCC is a
this is the serial program I tested
USAGE concordance textfile N [P]
the optional P indicates whether printing is to be switched on.
concordance2.c this is a dual core
version of the above programme, it generates its output in two files
which later have to be concatenated.
l1concordance.c this is a version
of the programme that is designed to be run from the shell as two
cores. This is also the version run on the SCC
is the World English Bible that I used as test data.
for each unique word encountered in the file.
Strings were tokenised using a finite state recogniser. Prefix trees
were a modified form of Trie
with sorting at each level by hashcode.
The algorithm is four pass.
Read the file into a buffer
Produce a tokenised version of the buffer
Build the hash table and prefix trees.
If the concordance is to be printed out, perform a traversal of
the trees printing out the word sequences in the format suggested by
the Haskell version.