PDHRS Abstracts
Satnam Singh, Microsoft Research Cambridge
TBA
Mohamed Menaa, Birmingham University
Compositional Round Abstraction
Round abstraction is a technique devised by Alur and Henzinger within their specification language "Reactive Modules". It allows an arbitrary number of consecutive computational steps (rounds) to be viewed as a single macro-step.
We revisit this idea under a new light as a solution to the problem of building low-latency synchronous systems from asynchronous specifications. Our approach is informed by game semantics and aims at studying the compositionality of round abstracted systems.
We use a trace-semantic setting akin to Abramsky’s Interaction Categories, which is also a generalisation of pointer-free game semantic models. We define partial and total correctness for round abstraction relative to composition and note that in its most general case, round abstracted behaviours do not compose well. We then identify sufficient conditions to guarantee totally correct composition for round abstraction when applied to asynchronous behaviours. We apply this procedure to Ghica's Geometry of Synthesis, a technique for compiling higher-order imperative programming languages into digital circuits via game semantics, demonstrating how round abstraction is used to reduce latency.
Alastair Donaldson, Oxford University
Automatic Analysis of Scratch-Pad Memory Code for Heterogeneous Multicore Processors
Modern multicore processors, such as Cell, achieve high performance by equipping accelerator cores with scratch-pad. This comes at the expense of programming complexity – the programmer must manually orchestrate data movement using error-prone direct memory access (DMA) operations. We show how formal verification techniques based on Bounded Model Checking and k-induction can be used to detect or prove absence of DMA races in multicore software.
Luke Cartey, Oxford University
Implementing a Domain Specific Language for Hidden Markov Models on GPUs
GPUs can provide an impressive performance boost for suitable application areas. However, the time and skill required for implementation is often prohibitive for the domain focused user. In this talk we will discuss the implementation of a domain specific language for one such area (HMMs), and the wider implications.
Ross Mcilroy, Microsoft Research Cambridge
Hera-JVM: A Runtime System for Heterogeneous Multi-Core Architectures
Heterogeneous multi-core architectures (HMAs), such as the Cell processor in the Playstation 3, incorporate different core types on a single CPU. These HMAs have the potential to significantly increase application performance, however, they are notoriously difficult to exploit effectively and thus their use is currently restricted to specialist domains.
This talk will present Hera-JVM, a runtime system aimed at enabling non-specialist programmers to exploit HMAs. Hera-JVM supports the execution of standard multi-threaded Java applications on the disparate core types of the the Cell processor. The runtime system deals with migration of threads between cores with different instruction sets and the movement of data between different levels of the explicit memory hierarchy in the Cell processor.
George Russell, Codeplay
Programming Heterogeneous Multicore Systems using Threading Building Blocks
Intel’s Threading Building Blocks (TBB) provide a high-level abstraction which allows programmers to express parallelism in their applications without having to write multi- threaded code. However, TBB is only available for shared- memory, homogeneous multicore processors. Codeplay’s Of- fload C++ provides a single-source, POSIX threads-like ap- proach to programming heterogeneous multicore devices where cores are equipped with private, local memories - code to move data between memory spaces is generated automatically. In this paper, we show that the strengths of TBB and Offload C++ can be combined, by implementing part of the TBB headers in Offload C++. This allows applications written using TBB to run, without source-level modifications, across the PPE and SPE cores of the Cell BE processor. We present experimental results applying our method to a set of TBB programs. To our knowledge, this work marks the first demonstration of TBB for a heterogeneous multicore architecture.
Ali Ahmadinia, Caledonian University
High Level Modelling and Automated Generation of Heterogeneous SoC Architectures with Optimized Custom Reconfigurable Cores and On-Chip Communication Media
In this talk, we propose a framework for modelling and automated generation of heterogeneous SoC architectures with emphasis on reconfigurable component integration and optimized communication media. In order to facilitate rapid development of SoC architectures, communication centric platforms for data intensive applications, high level modelling of reconfigurable components for quick simulation and a tool for generation of complete SoC architectures is presented. Four different communication centric platforms based on traditional bus, crossbar, hierarchical bus and novel hybrid communication media are proposed. These communication centric platforms are proposed to cater for the different communication requirement of future SoC architectures. Multi-Standard telecommunication application is chosen as our target application domain to demonstrate the effectiveness of our approach. Behaviour of system with different communication platforms is analyzed for its throughput and power characteristics with different reconfigurable scenarios to show the effectiveness of our approach.
Paul Keir, Glasgow University
A Parallel Array Compiler for the Cell Broadband Engine
Heterogeneous multicore systems are increasingly applied to scientific computing. However, good performance expects optimised temporal and spatial data locality. Array languages encourage users to engineer using elements which are inherently divisible, and often parallelisable. This talk will describe extensions to a Fortran array language, and their implementation within a new CellBE compiler.
Thomas Perry, Xilinx
A dataflow framework for Xilinx system designs
With the recent announcement by Xilinx of their new "Extensible Processing Platform" which integrates high-performance ARM cores and FPGA fabric, comes the question: "that's great, but how do we program it?"
One option is to adapt existing software design methodologies by, for example, transforming sequential C code to highly parallel HDL, but parallelising sequential code tends to be challenging.
What if, instead, we were to make use of existing libraries of highly optimised "IP cores" and design our software from the outset as a sequence of independent tasks that are driven by the arrival of data messages on well-defined interfaces? Then, united by a common model of computation, our software components and FPGA cores are interchangeable and can be interconnected in a heterogeneous system as 'equals'.
In this talk, I'll discuss my work on:
a framework that provides the scheduling, communication interfaces and data types that allow software components to interact according to the semantics of dataflow actors;
a system for automatically generating dataflow wrappers for existing software, driven by a high-level XML description of an actor's ports and the data types that are sent over them;
a (near-future) demonstration of this methodology through the production of an LTE telecommunications system.
John O'Donnell, Glasgow University
Supporting Functional Arrays with Reconfigurable Hardware
Pure functional arrays are hard to implement efficiently on a von Neumann architecture. Even in functional languages, the usual approach to handling arrays is to make them single threaded, using monads or unique types, which allows imperative arrays to be used instead of functional ones. However, an alternative approach is a hardware memory architecture that uses massive parallelism to implement functional arrays with unit time access. A custom VLSI implementation of such a memory is not cost effective, but current reconfigurable chips provide a practical alternative. The functional array architecture will be presented, as well as a discussion of work underway on implementing it using GPU and FPGA chips.
Khaled Benkrid, Edinburgh University
Reconfigurable general purpose computing, has the time finally come?
In this talk, the speaker will evaluate the state-of-the-art of FPGA technology and assess its efficacy and efficiency in high performance general purpose computing. The talk will position FPGA technology in relation to other computer technologies and speculate on future directions of the technology.
Marcin Orczyk, Glasgow University
Design and Implementation of a General Purpose Language for GPU Computations
Currently Moore's Law has shifted from the increase of processors' clock speed to the increase of the number of processing cores on a chip. This trend can be best seen in modern Graphic Processing Units (GPUs) which have become programmable data-parallel devices, frequently containing over a hundred processors. GPUs are currently available in almost every desktop computer. However, techniques for programming such devices are still immature. In this talk I will present a general purpose language for programming such systems. It uses the JVM and OpenCL as its execution platforms and explores the applicability of the functional programming model to parallel (especially data parallel) computations, with the aim of finding a simpler and more intuitive approach to leverage the computational power of modern, parallel and heterogeneous architectures.
Dan Ghica, Birmingham University
Geometry of Synthesis: Semantics-directed hardware compilation
The problem of synthesis of gate-level descriptions of digital circuits from behavioural specifications written in higher-level programming languages (hardware compilation) has been studied for a long time, yet a definitive solution has not been forthcoming. In this talk I will describe a new technique based on recent advances in programming language theory: affine type systems, monoidal categories and game semantics. We argue that one of the major obstacles in the way of a useful and mature hardware compiler is the lack of a well defined function interface model, i.e. a canonical way in which functions can communicate with their arguments. We will show how digital circuits exhibit an inherent structure which is a model of affine higher-order type systems. Game semantics can provide interpretations for common language constants which are concretely representable in this setting, for both synchronous and asynchronous digital circuits. The key issue of sharing of resources and the additional structure required will also be addressed using game-semantic techniques. We will illustrate these theoretical considerations with a prototype compiler from Syntactic Control of Interference (an affine dialect of Idealised Algol) into digital circuits.
Wim Vanderbauwhede, Glasgow University
MORA-C++, High-Level FPGA Programming through a Soft Processor Network
MORA is a novel platform for high-level FPGA programming of streaming vector and matrix operations, aimed at multimedia applications. It consists of soft array of pipelined low-complexity SIMD processors-in-memory (PIM). We present a Domain-Specific Language (DSL) for high-level programming of the MORA soft processor array. The DSL is embedded in C++, providing designers with a familiar language framework and the ability to compile designs using a standard compiler for functional testing before generating the FPGA bitstream using the MORA toolchain. The paper discusses the MORA-C++ DSL and the compilation route into the assembly for the MORA machine and provides examples to illustrate the programming model and performance.
Hans-Wolfgang Loidl, Heriot-Watt University
Managing Cluster Heterogeneity in Glasgow parallel Haskell
Heterogeneity is a key issue for parallelism on collections of Grid-enabled clusters. We investigate whether a programming language with high-level parallel coordination and a Distributed Shared Memory model (DSM) can deliver good, and scalable, performance on heterogeneous Grids. The high-level language, Glasgow parallel Haskell (GpH), abstracts over the architectural complexities of the computational Grid, and we have developed GridGUM2, a sophisticated grid-specific implementation of GpH, to produce the first high-level DSM parallel language implementation for computational Grids. We report a systematic performance evaluation of GridGUM2 on combinations of high/low latency and homo/hetero-geneous computational Grids.