A new kind of parallelism and its programming in the Explicitly Many-Processor Approach (original) (raw)
Related papers
Multi- and Many-Cores, Architectural Overview for Programmers
2017
Parallelism has been used since the early days of computing to enhance performance. From the first computers to the most modern sequential processors (also called uniprocessors), the main concepts introduced by von Neumann [20] are still in use. However, the ever-increasing demand for computing performance has pushed computer architects toward implementing different techniques of parallelism. The von Neumann architecture was initially a sequential machine operating on scalar data with bit-serial operations [20]. Word-parallel operations were made possible by using more complex logic that could perform binary operations in parallel on all the bits in a computer word, and it was just the start of an adventure of innovations in parallel computer architectures.
Advanced Compilers, Architectures and Parallel Systems
1994
Abstract Multithreaded node architectures have been proposed for future multiprocessor systems. However, some open issues remain: can e cient multithreading support be provided in a multiprocessor machine such that it is capable of tolerating the synchronization and communication latencies, without intruding on the performance of sequentially-executed code?
A microthreaded architecture and its compiler
2006
A different approach to ILP based on code fragmentation, first proposed some 10 years ago, is being used for novel CMP processor designs. The technique, called microthreading, enables binary compatibility across arbitrary schedules. Chip architectures have been proposed that contain many simple pipelines with hardware support for ultra-fast context switching. The concurrency described in the binary code is parametric and a typical microthread is an iteration of a loop. The ISA contains instructions to create a family of micro-threads, i.e., the collection of all loop iterations. In case a microthread encounters a (possibly) long latency operation (e.g., a load that may miss in the cache) this thread is switched out and another thread is switched in under program control. In this way, latencies can effectively be hidden, if there are a sufficient number of threads available. The creation of families of threads is the responsibility of the compiler. In this presentation, we give an overview of the microthreaded model of computation and we show by some small examples that it provides an efficient way of executing loops. Moreover, we show that this model has excellent scaling properties. Finally, we discuss the compiler support required and propose some compiler transformations that can be used to expose large families of threads.
Flexible instruction processors
2000
This paper introduces the notion of a Flexible Instruction Processor (FIP) for systematic customisation of instruction processor design and implementation. The features of our approach include: (a) a modular framework based on \processor templates" that capture various instruction processor styles, such as stack-based or register-based styles (b) enhancements of this framework to improve functionality and performance, such a s h ybrid processor templates and superscalar operation (c) compilation strategies involving standard compilers and FIP-speci c compilers, and the associated design ow (d) technology-independent and technology-speci c optimisations, such a s t e c hniques for e cient resource sharing in FPGA implementations. Our current implementation of the FIP framework is based on a highlevel parallel language called Handel-C, which can be compiled into hardware. Various customised Java Virtual Machines and MIPS style processors have been developed using existing FPGAs to evaluate the e ectiveness and promise of this approach.
An operating system accelerator
Journal of Systems Architecture, 1998
A RISC-style hardware accelerator for operating systems (OS), named the mechanism of multiprocessing (MMP) processor. is presented. The MMP processor implements the set of MMP primitives of the MMP mechanism with which the ES operating was enhanced. The architecture and organisation of this processor were devised to facilitate fast execution of the MMP primitives. The architecture adopted ensures the efficient mapping of typical operations and data structures used in the MMP primitives. The hardware resources were selected and interconnecied in such a way that. with pipelined control, the organisation envisaged can support their fast execution. The prototype of the MMP proccssor was developed and put into operation with the extensive usage of the interactive development and testing (INDAT) system, a specially designed tool for the development and testing of the MMP processor.
The Computer Journal, 2002
The instruction-level parallelism found in a conventional instruction stream is limited. Studies have shown the limits of processor utilization even for today's superscalar microprocessors. One solution is the additional utilization of more coarse-grained parallelism. The main approaches are the (single) chip multiprocessor and the multithreaded processor which optimize the throughput of multiprogramming workloads rather than single-thread performance. The chip multiprocessor integrates two or more complete processors on a single chip. Every unit of a processor is duplicated and used independently of its copies on the chip. In contrast, the multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. Unused instruction slots, which arise from pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor.
How to extend the Single-Processor Paradigm to the Explicitly Many-Processor Approach
ArXiv, 2020
The computing paradigm invented for processing a small amount of data on a single segregated processor cannot meet the challenges set by the present-day computing demands. The paper proposes a new computing paradigm (extending the old one to use several processors explicitly) and discusses some questions of its possible implementation. Some advantages of the implemented approach, illustrated with the results of a loosely-timed simulator, are presented.