Seven-O'clock: a new distributed GVT algorithm using network atomic operations (original) (raw)

ORCHESTRA: An asynchronous wait-free distributed GVT algorithm

2017

Taking advantage of computing capabilities offered by modern parallel and distributed architectures is fundamental to run large-scale simulation models based on the Parallel Discrete Event Simulation (PDES) paradigm. By relying on this computing organization, it is possible to effectively overcome both the power and the memory wall, which are core limiting aspects to deliver high-performance simulations. This is even more the case when relying on the speculative Time Warp synchronization protocol, which could be particularly memory greedy. At the same time, some form of coordination, such as the computation of the Global Virtual Time (GVT), is required by Time Warp Systems. These coordination points could easily become the bottleneck of large-scale simulations, hindering an efficient exploitation of the computing power offered by large supercomputing facilities. In this paper we present ORCHESTRA, a coordination algorithm which is both wait-free and asynchronous. The nature of this ...

A fast asynchronous GVT algorithm for shared memory multiprocessor architectures

1995

The computation of Global Virtual Time is of fundamental importance in Time Warp based Parallel Discrete Event Simulation Systems. Shared memory multiprocessor architectures can support interprocess communication with much smaller overheads than distributed memory systems. This paper presents a new, completely asynchronous, Gvt algorithm which provides very fast and accurate Gvt estimation with signi cantly lower overhead than previous approaches. The algorithm presented is able to support more efcient memory management, termination, and other global control mechanisms

A hypercube algorithm for GVT computation and its application in optimistic parallel simulation

Proceedings of Simulation Symposium, 1995

In this paper we present an algorithm for computing the global virtual time (GVT) in an optimistic parallel discrete event simulation, on the distributed-memory hypercube architecture. Our algorithm uses only 3N messages and runs in O(1og N) time where N is the number of logical processors (Ip's) representing components of the simulation system. It is based on the construction of a spanning binomial tree in the hypercube. In most simulation systems, there is an lp designated for GVT computation, called the GVT-manager. Failure of the physical processor running this lp causes the simulation process to stop, and in such a case reorganization of Ips is necessary so that another logical processor take the roll of the GVT-manager. In our algorithm, any lp in the system can elect itself to be the GVT-manager and hence such reorganization is not necessary. We show how our algorithm can be used for memory management and hierarchical load balancing in a hypercube machine, and suggest a new technique to handle transient messages.

Scalable, accurate multicore simulation in the 1000-core era

(IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2011

We present HORNET, a parallel, highly configurable, cycle-level multicore simulator based on an ingress-queued wormhole router NoC architecture. The parallel simulation engine offers cycle-accurate as well as periodic synchronization; while preserving functional accuracy, this permits tradeoffs between perfect timing accuracy and high speed with very good accuracy. When run on 6 separate physical cores on a single die, speedups can exceed a factor of over 5, and when run on a two-die 12-core system with 2-way hyperthreading, speedups exceed 11×.

A use of matrix with GVT computation in optimistic time warp algorithm for parallel simulation

One of the most common optimistic synchronization protocols for parallel simulation is the Time Warp algorithm proposed by Jefferson . Time Warp algorithm is based on the virtual time paradigm that has the potential for greater exploitation of parallelism and, perhaps more importantly, greater transparency of the synchronization mechanism to the simulation programmer. It is widely believe that the optimistic Time Warp algorithm suffers from large memory consumption due to frequent rollbacks. In order to achieve optimal memory management, Time Warp algorithm needs to periodically reclaim the memory. In order to determine which event-messages have been committed and which portion of memory can be reclaimed, the computation of global virtual time (GVT) is essential. Mattern [2] uses a distributed snapshot algorithm to approximate GVT which does not rely on first in first out (FIFO) channels. Specifically, it uses ring structure to establish cuts C1 and C2 to calculate the GVT for distinguishing between the safe and unsafe event-messages. Although, distributed snapshot algorithm provides a straightforward way for computing GVT, more efficient solutions for message acknowledging and delaying of sending event messages while awaiting control messages are desired. This paper studies the memory requirement and time complexity of GVT computation. The main objective of this paper is to implement the concept of matrix with the original Mattern's GVT algorithm to speedups the process of GVT computation while at the same time reduce the memory requirement. Our analysis shows that the use of matrix in GVT computation improves the overall performance in terms of memory saving and latency.

Computing global virtual time in shared-memory multiprocessors

ACM Transactions on Modeling and Computer Simulation, 1997

ABSTRACT echanism on a Kendall Square Research KSR-2 machine demonstrate that these techniques enable frequent GVT and fossil collections, e.g., every millisecond, without incurring a significant performance penalty. Categories and Subject Descriptors: B.3.2 [Memory Structures]: Shared Memory; B.6.1 [Logic Design]: Design Styles---memory control and access, memory used as logic; C.1.2 [Process Architectures]: Multiprocessors (MIMD); D.4.1 [Operating Systems]: Process Management---concurrency, mutual exclusion; D.4.4 [Operating Systems]: Communications Management---message sending; I.6.1 [Simulation and Modeling]: Simulation Theory; I.6.7 [Simulation and Modeling]: Simulation Support Systems; I.6.8 [Simulation and Modeling ]: Types of Simulation---discrete event, parallel General Terms: Algorithms, Performance This

Graphite: A distributed parallel simulator for multicores

2010

This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multicore processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-theshelf pthread applications with no source code modification.

Proceeding of National Conference on High Performance Computing & Simulation 2013

2013

Cross-chip latencies now make multi-core architectures resemble distributed systems. The design of distributed protocols is notoriously error-prone, particularly when their analysis is based on the use of global time. Classical memory consistency models for parallel programming, such as linearizability, uses such a global ordering. This talkexamines the reformulation, without global time, of these consistency models.

Controlled Asynchronous GVT

Proceedings of the 48th International Conference on Parallel Processing

In this paper, we investigate the performance of Parallel Discrete Event Simulation (PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) algorithms in this environment and contribute three significant results. First, we show that it is essential to isolate the thread performing MPI communications from the task of processing simulation events, otherwise the simulation is significantly imbalanced and performs poorly. This applies to both synchronous and asynchronous GVT algorithms. Second, we demonstrate that synchronous GVT algorithm based on barrier synchronization is a better choice for communication-dominated models, while asynchronous GVT based on Mattern's algorithm performs better for computation-dominated scenarios. Third, we propose Controlled Asynchronous GVT (CA-GVT) algorithm that selectively adds synchronization to Mattern-style GVT based on simulation conditions. We demonstrate that CA-GVT outperforms both barrier and Mattern's GVT and achieves about 8% performance improvement on mixed computation-communication models. This is a reasonable improvement for a simple modification to a GVT algorithm.