Alistair Rendell | The Australian National University (original) (raw)

Papers by Alistair Rendell

and the good times ahead. Acknowledgments If you were successful, somebody along the line gave yo... more and the good times ahead. Acknowledgments If you were successful, somebody along the line gave you some help. There was a great teacher somewhere in your life.

2007 IEEE International Conference on Cluster Computing, 2007

Lecture Notes in Computer Science, 2008

Lecture Notes in Computer Science, 2009

The Intel Cluster OpenMP (CLOMP) compiler and associated runtime environment offer the potential ... more The Intel Cluster OpenMP (CLOMP) compiler and associated runtime environment offer the potential to run OpenMP applications over a few nodes of a cluster. This paper reports on our efforts to use CLOMP with the Gaussian quantum chemistry code. Sample results on a four node quad core Intel cluster show reasonable speedups. In some cases it is found preferable to use multiple nodes compared to using multiple cores within a single node. The performances of the different benchmarks are analyzed in terms of page faults and by using a critical path analysis technique.

Lecture Notes in Computer Science, 2005

The OpenMP shared memory programming paradigm has been widely embraced by the computational scien... more The OpenMP shared memory programming paradigm has been widely embraced by the computational science community, as has distributed memory clusters. What are the prospects for running OpenMP applications on clusters? This paper gives an overview of the SCore cluster enabled OpenMP environment, provides performance data for some of the fundamental underlying operations, and reports overall performance for a model computational science application (the finite difference solution of the 2D Laplace equation).

2009 Sixth IFIP International Conference on Network and Parallel Computing, 2009

uDAPL is a portable and platform independent communication library that provides RDMA as well as ... more uDAPL is a portable and platform independent communication library that provides RDMA as well as send/recv operations. Some well-known software has attempted to take advantage of uDAPL's portability, such as Open MPI, MVAPICH2, Intel MPI, and Cluster OpenMP. However, network bandwidth limitations can still be a bottleneck for applications using these software. Engaging a "Multirail" network is a method to bypass this. In this paper, we design a non-threaded and a threaded approach to improve the performance of uDAPL over multirail configured clusters. The two approaches are evaluated on an InfiniBand cluster with different multirail configurations. The results show that the threaded approach improves by 33% and 148% the uni-directional bandwidth on the multi-port and the multi-HCA configured network respectively, and the nonthreaded approach improves ∼90% of the uni-directional bandwidth on the multi-HCA configured network. A similar improvement is achieved for the bi-directional bandwidth.

J. W. Larson, P. E. Strazdins, M. Hegland, B. Harding, S. Roberts , L. Stals , A. P. Rendell, Md.... more J. W. Larson, P. E. Strazdins, M. Hegland, B. Harding, S. Roberts , L. Stals , A. P. Rendell, Md. M. Ali , and J. Southern

Journal of computational chemistry, Jan 30, 2014

Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange... more Use of the modern parallel programming language X10 for computing long-range Coulomb and exchange interactions is presented. By using X10, a partitioned global address space language with support for task parallelism and the explicit representation of data locality, the resolution of the Ewald operator can be parallelized in a straightforward manner including use of both intranode and internode parallelism. We evaluate four different schemes for dynamic load balancing of integral calculation using X10's work stealing runtime, and report performance results for long-range HF energy calculation of large molecule/high quality basis running on up to 1024 cores of a high performance cluster machine.

High Performance Computing - HiPC 2006, 2006

Modern shared memory multiprocessor systems commonly have non-uniform memory access (NUMA) with a... more Modern shared memory multiprocessor systems commonly have non-uniform memory access (NUMA) with asymmetric memory bandwidth and latency characteristics. Operating systems now provide application programmer interfaces allowing the user to perform specific thread and memory placement. To date, however, there have been relatively few detailed assessments of the importance of memory/thread placement for complex applications. This paper outlines a framework for performing memory and thread placement experiments on Solaris and Linux. Thread binding and location specific memory allocation and its verification is discussed and contrasted. Using the framework, the performance characteristics of serial versions of lmbench, Stream and various BLAS libraries (ATLAS, GOTO, ACML on Opteron/Linux and Sunperf on Opteron, UltraSPARC/Solaris) are measured on two different hardware platforms (UltraSPARC/FirePlane and Opteron/HyperTransport). A simple model describing performance as a function of memory distribution is proposed and assessed for both the Opteron and UltraSPARC.

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010

Lecture Notes in Computer Science, 2004

2008 13th Asia-Pacific Computer Systems Architecture Conference, 2008

A key issue for Cluster-enabled OpenMP implementations based on software Distributed Shared Memor... more A key issue for Cluster-enabled OpenMP implementations based on software Distributed Shared Memory (sDSM) systems, is maintaining the consistency of the shared memory space. This forms the major source ofoverhead for these systems, and is driven by the detection and servicing ofpage faults. This paper investigates how application performance can be modelled based on the number ofpage faults. Two simple models are proposed, one based on the number ofpage faults along the critical path of the computation, and one based on the aggregated numbers of page faults. Two different sDSM systems are considered. The models are evaluated using the OpenMP NAS Parallel Benchmarks on an 8-node AMD-based Gigabit Ethernet cluster. Both models gave estimates accurate to within 10% in most cases, with the critical path model showing slightly better accuracy; accuracy is lost if the underlying page faults cannot be overlapped, or ifthe application makes extensive use ofthe OpenMPflush directive.

2011 IEEE International Parallel & Distributed Processing Symposium, 2011

ABSTRACT The parallel performance of applications running on Non-Uniform Memory Access (NUMA) pla... more ABSTRACT The parallel performance of applications running on Non-Uniform Memory Access (NUMA) platforms is strongly influenced by the relative placement of memory pages to the threads that access them. As a consequence there are Linux application programmer interfaces (APIs) to control this. For large parallel codes it can, however, be difficult to determine how and when to use these APIs. In this paper we introduce the NUMAgrind profiling tool which can be used to simplify this process. It extends the Val grind binary translation framework to include a model which incorporates cache coherency, memory locality domains and interconnect traffic for arbitrary NUMA topologies. Using NUMAgrind, cache misses can be mapped to memory locality domains, page access modes determined, and pages that are referenced by multiple threads quickly determined. We show how the NUMAgrind tool can be used to guide the use of Linux memory and thread placement APIs in the Gaussian computational chemistry code. The performance of the code before and after use of these APIs is also presented for three different commodity NUMA platforms.

Interval analysis is an alternative to conventional floating-point computations that offers guara... more Interval analysis is an alternative to conventional floating-point computations that offers guaranteed error bounds. Despite this advantage, interval methods have not gained widespread use in large scale computational science applications. This paper addresses this issue from a performance perspective, comparing the performance of floating point and interval operations for some small computational kernels. Particularly attention is given to the Sun Fortran interval implementation, although the strategies introduced here to enhance performance are applicable to other interval implementations. Fundamental differences in the operation counts and memory references requirements of interval and floating point codes are discussed.

X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase signific... more X10 is an emerging Partitioned Global Address Space (PGAS) language intended to increase significantly the productivity of developing scalable HPC applications. The language has now matured to a point where it is meaningful to consider writing large scale scientific application codes in X10. This paper reports our experiences writing three codes from the chemistry/material science domain: Fast Multipole Method (FMM), Particle Mesh Ewald (PME) and Hartree-Fock (HF), entirely in X10. Performance results are presented for up to 256 places on a Blue Gene/P system. During the course of this work our experiences have been shared with the X10 development team, so that application requirements could inform language design discussions as the language capabilities influenced algorithm design. This resulted in improvements in the language implementation and standard class libraries, including the design of the array API and support for complex math. Data constructs in X10 such as places and distributed arrays, and parallel constructs such as finish and async, simplify implementation of the applications in comparison with MPI. However, current implementation limitations in X10 2.1.2 make it difficult to achieve scalable performance using the most natural expressions of the algorithms. The most serious limitation is the use of point-to-point communication patterns, rather than collectives, to implement parallel constructs and array operations. This issue will be addressed in future releases of X10.

The use of ghost regions is a common feature of many distributed grid applications. A ghost regio... more The use of ghost regions is a common feature of many distributed grid applications. A ghost region holds local read-only copies of remotely-held boundary data which are exchanged and cached many times over the course of a computation. X10 is a modern parallel programming language intended to support productive development of distributed applications. X10 supports the "active message" paradigm, which combines data transfer and computation in one-sided communications. A central feature of X10 is the distributed array, which distributes array data across multiple places, providing standard read and write operations as well as powerful high-level operations. We used active messages to implement ghost region updates for X10 distributed arrays using two different update algorithms. Our implementation exploits multiple levels of parallelism and avoids global synchronization; it also supports split-phase ghost updates, which allows for overlapping computation and communication. We compare the performance of these algorithms on two platforms: an Intel x86-64 cluster over QDR InfiniBand, and a Blue Gene/P system, using both stand-alone benchmarks and an example computational chemistry application code. Our results suggest that on a dynamically threaded architecture, a ghost region update using only pairwise synchronization exhibits superior scaling to an update that uses global collective synchronization.

Procedia Computer Science, 2013

A key issue confronting petascale and exascale computing is the growth in probability of soft and... more A key issue confronting petascale and exascale computing is the growth in probability of soft and hard faults with increasing system size. A promising approach to this problem is the use of algorithms that are inherently fault tolerant. We introduce such an algorithm for the solution of partial differential equations, based on the sparse grid approach. Here, the solution of multiple component grids are efficiently combined to achieve a solution on a full grid. The technique also lends itself to a (modified) MapReduce framework on a cluster of processors, with the map stage corresponding to allocating each component grid for solution over a subset of the processors, and the reduce stage corresponding to their combination. We describe how the sparse grid combination method can be modified to robustly solve partial differential equations in the presence of faults. This is based on a modified combination formula that can accommodate the loss of one or two component grids. We also discuss accuracy issues associated with this formula. We give details of a prototype implementation within a MapReduce framework using the dynamic process features and asynchronous message passing facilities of MPI. Results on a two-dimensional advection problem show that the errors after the loss of one or two sub-grids are within a factor of 3 of the sparse grid solution in the presence of no faults. They also indicate that the sparse grid technique with four times the resolution has approximately the same error as a full grid, while requiring (for a sufficiently high resolution) much lower computation and memory requirements. We finally outline a MapReduce variant capable of responding to faults in ways other than rescheduling of failed tasks. We discuss the likely software requirements for such a flexible MapReduce framework, the requirements it will impose on users' legacy codes, and the system's runtime behavior.

Procedia Computer Science, 2011

This paper explores the use of a simple linear performance model, that determines execution time ... more This paper explores the use of a simple linear performance model, that determines execution time based instruction and cache miss counts, for describing the behaviour of two-electron integral evaluation algorithm in the Gaussian computational chemistry package. Four different microarchitecture platforms are considered with a total of seven individual microprocessors. Both Hartree-Fock and hybrid Hartree-Fock/Density Functional Theory electronic structure methods are assessed. In most cases the model is found to be accurate to within 3%. Least agreement is for an Athlon64 system (ranging from 1.8% to 6.5%) and a periodic boundary computation on an Opteron where errors of up to 6.8% are observed. These errors arise as the model does not account for the intricacies of out-of-order execution, on-chip write-back buffers and prefetch techniques that modern microprocessors implement. The parameters from the linear performance model are combined with instruction and cache miss counts obtained from functional cache simulation to predict the effect of cache modification on total execution time. Variations in level 1 and 2 linesize and level 2 total size are considered, we find there is some benefit if linesizes are increased (L1: 8%, L2: 4%). Increasing the level 2 cache size is also predicted to be beneficial, although the cache blocking approach already implemented in the Gaussian integral evaluation code was found to be working well.

The Journal of the Acoustical Society of America, 2012

Model-based treatment planning and exposimetry for high-intensity focused ultrasound requires the... more Model-based treatment planning and exposimetry for high-intensity focused ultrasound requires the numerical simulation of nonlinear ultrasound propagation through heterogeneous and absorbing media. This is a computationally demanding problem due to the large distances travelled by the ultrasound waves relative to the wavelength of the highest frequency harmonic. Here, the k-space pseudospectral method is used to solve a set of coupled partial differential equations equivalent to a generalised Westervelt equation. The model is implemented in C++ and parallelised using the message passing interface (MPI) for solving large-scale problems on distributed clusters. The domain is partitioned using a 1D slab decomposition, and global communication is performed using a sparse communication pattern. Operations in the spatial frequency domain are performed in transposed space to reduce the communication burden imposed by the 3D fast Fourier transform. The performance of the model is evaluated using grid sizes up to 4096 3 2048 3 2048 grid points, distributed over a cluster using up to 1024 compute cores. Given the global nature of the gradient calculation, the model shows good strong scaling behaviour, with a speed-up of 1.7x whenever the number of cores is doubled. This means large-scale simulations can be distributed across high numbers of cores on a cluster to minimise execution times with a relatively small overhead. The efficacy of the model is demonstrated by simulating the ultrasound beam pattern for a high-intensity focused ultrasound sonication of the kidney.

The Journal of the Acoustical Society of America, 2012

2007 IEEE International Conference on Cluster Computing, 2007

Lecture Notes in Computer Science, 2008

Lecture Notes in Computer Science, 2009

Lecture Notes in Computer Science, 2005

2009 Sixth IFIP International Conference on Network and Parallel Computing, 2009

Journal of computational chemistry, Jan 30, 2014

High Performance Computing - HiPC 2006, 2006

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, 2010

Lecture Notes in Computer Science, 2004

2008 13th Asia-Pacific Computer Systems Architecture Conference, 2008

2011 IEEE International Parallel & Distributed Processing Symposium, 2011

Procedia Computer Science, 2013

Procedia Computer Science, 2011

The Journal of the Acoustical Society of America, 2012