J. Mohd-yusof - Academia.edu (original) (raw)
Papers by J. Mohd-yusof
Parallel Computing, 2007
The first part of this paper surveys co-processor approaches for commodity based clusters in gene... more The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.
Parallel Computing, 2007
The first part of this paper surveys co-processor approaches for commodity based clusters in gene... more The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.
Parallel Computing, 2007
The first part of this paper surveys co-processor approaches for commodity based clusters in gene... more The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.
A numerical method is presented which can simulate flows in complex geometries with moving bounda... more A numerical method is presented which can simulate flows in complex geometries with moving boundaries while still retaining all the advantages and the efficiency of solving the Navier-Stokes equations on cylindrical grids. The boundary conditions are applied independently of the grid by assigning body forces over surfaces that need not coincide with coordinate lines. The method has been validated by a large-eddy simulation of the flow in a motored axisymmetric piston-cylinder assembly for which experimental measurements are available. The comparison of the results has shown a very good agreement for mean and rms velocity profiles, thus confirming the accuracy of the present approach. This numerical method, in addition, runs on a small PC-like workstation ten times faster than corresponding simulations on supercomputers. In large-eddy simulations the dynamic subgrid-scale model is very efficient in combination with the body force procedure because it automatically accounts for the wa...
Computer Physics Communications, 2014
In a common approach to multiscale simulation, an incomplete set of macroscale equations must be ... more In a common approach to multiscale simulation, an incomplete set of macroscale equations must be supplemented with constitutive data provided by fine-scale simulation. Collecting statistics from these fine-scale simulations is typically the overwhelming computational cost. We reduce this cost by interpolating the results of fine-scale simulation over the spatial domain of the macro-solver. Unlike previous adaptive sampling strategies, we do not interpolate on the potentially very high dimensional space of inputs to the fine-scale simulation. Our approach is local in space and time, avoids the need for a central database, and is designed to parallelize well on large computer clusters. To demonstrate our method, we simulate one-dimensional elastodynamic shock propagation using the Heterogeneous Multiscale Method (HMM); we find that spatial adaptive sampling requires only ≈50 × N 0.14 fine-scale simulations to reconstruct the stress field at all N grid points. Related multiscale approaches, such as Equation Free methods, may also benefit from spatial adaptive sampling.
We present a brief overview of the Roadrunner hybrid architecture, followed by a summary of the o... more We present a brief overview of the Roadrunner hybrid architecture, followed by a summary of the original code to be adapted. The coding challenges of porting the CFDNS compressible Navier-Stokes solver to the hybrid architecture will be discussed, along with the code modification predicated by the serial speedup on the CBE. Performance results from the initial stages, performed on the single-precision development system, through to the current full simulations, which are ongoing at the time of the submittal, are presented. We observe that the overall coding effort and benefits realized are considerable, but many of the global modifications to the code will be applicable to a variety of future directions in high-performance computing.
For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and th... more For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and the ability to handle complex geometries. Spectral methods offer the highest accuracy but are limited to relatively simple geometries. In order to accommodate more complex geometries, finite-difference or finite-element methods are generally used. However, these methods suffer from relatively low accuracy, requiring fine meshes to obtain good results. Finite element schemes, while able to handle complex geometries, often require significant computational time for grid generation. Spectral element methods can be used for complex geometries, but the grid stretching inherent in these methods leads to time-step limitations and clustering of grid-points in an inefficient manner. In general, any computational scheme which requires regridding to accommodate changes in geometry will incur significant penalties in simulating time-varying geometries. For relatively simple motions, it is possible to ...
A numerical method is presented which can simulate flows in complex geometries with moving bounda... more A numerical method is presented which can simulate flows in complex geometries with moving boundaries while still retaining all the advantages and the efficiency of solving the Navier-Stokes equations on cylindrical grids. The boundary conditions are applied independently of the grid by assigning body forces over surfaces that need not coincide with coordinate lines. The method has been validated by a large-eddy simulation of the flow in a motored axisymmetric piston-cylinder assembly for which experimental measurements are available. The comparison of the results has shown a very good agreement for mean and rms velocity profiles, thus confirming the accuracy of the present approach. This numerical method, in addition, runs on a small PC-like workstation ten times faster than corresponding simulations on supercomputers. In large-eddy simulations the dynamic subgrid-scale model is very efficient in combination with the body force procedure because it automatically accounts for the wa...
Journal of chemical theory and computation, Jan 13, 2015
We present an algorithm for the calculation of the density matrix that for insulators scales line... more We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) arc...
International Journal of Computational Science and Engineering, 2008
This article explores the coupling of coarse and fine-grained parallelism for Finite Element simu... more This article explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes.
A numerical method is presented which can simulate flows in complex geometries with moving bounda... more A numerical method is presented which can simulate flows in complex geometries with moving boundaries while still retaining all the advantages and the eciency of solving the Navier-Stokes equations on cylindrical grids. The boundary conditions are applied independently of the grid by assigning body forces over surfaces that need not coincide with coordinate lines. The method has been validated by
CTR annual research briefs, 1997
Journal of Computational Physics, 2000
A second-order accurate, highly efficient method is developed for simulating unsteady three-dimen... more A second-order accurate, highly efficient method is developed for simulating unsteady three-dimensional incompressible flows in complex geometries. This is achieved by using boundary body forces that allow the imposition of the boundary conditions on a given surface not coinciding with the computational grid. The governing equations, therefore, can be discretized and solved on a regular mesh thus retaining the advantages and the efficiency of the standard solution procedures. Two different forcings are tested showing that while the quality of the results is essentially the same in both cases, the efficiency of the calculation strongly depends on the particular expression. A major issue is the interpolation of the forcing over the grid that determines the accuracy of the scheme; this ranges from zeroth-order for the most commonly used interpolations up to second-order for an ad hoc velocity interpolation. The present scheme has been used to simulate several flows whose results have been validated by experiments and other results available in the literature. Finally in the last example we show the flow inside an IC piston/cylinder assembly at high Reynolds number; to our knowledge this is the first example in which the immersed boundary technique is applied to a full three-dimensional complex flow with moving boundaries and with a Reynolds number high enough to require a subgrid-scale turbulence model.
International Journal of Computational Science and Engineering, 2009
Feast is a hardware-oriented MPI based Finite Element solver toolkit. With the extension FeastGPU... more Feast is a hardware-oriented MPI based Finite Element solver toolkit. With the extension FeastGPU the authors have previously demonstrated that significant speed-ups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific co-processors to a commodity based cluster. In this paper we put the more general claim to the test: Applications based on Feast, that ran only on CPUs so far, can be successfully accelerated on a co-processor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/co-processor interaction than the Poisson example, and is thus better suited to assess the practicability of our acceleration approach. We present accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, we demonstrate in detail that the single precision execution of the co-processor does not affect the final accuracy. We establish how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6to 2.6-fold total speed-up. Subsequent analysis reveals which measures will increase these factors further.
Center for Turbulence Research Annual …, 1998
For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and th... more For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and the ability to handle complex geometries. Spectral methods offer the highest accuracy but are limited to relatively simple geometries. In order to accommodate more complex ...
… Technical Report No. LA- …, 2009
The discrete-time immersed boundary method is applied to the large- eddy simulation (LES) of comp... more The discrete-time immersed boundary method is applied to the large- eddy simulation (LES) of complex flows with moving boundaries. The method is adapted to a staggered finite-difference numerical scheme incorporating the dynamic subgrid-scale model. A test case of an IC cylinder with a moving piston and a single fixed valve is simulated and the results are compared with experiments and
Parallel Computing, 2007
The first part of this paper surveys co-processor approaches for commodity based clusters in gene... more The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.
Parallel Computing, 2007
The first part of this paper surveys co-processor approaches for commodity based clusters in gene... more The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.
Parallel Computing, 2007
The first part of this paper surveys co-processor approaches for commodity based clusters in gene... more The first part of this paper surveys co-processor approaches for commodity based clusters in general, not only with respect to raw performance, but also in view of their system integration and power consumption. We then extend previous work on a small GPU cluster by exploring the heterogeneous hardware approach for a large-scale system with up to 160 nodes. Starting with a conventional commodity based cluster we leverage the high bandwidth of graphics processing units (GPUs) to increase the overall system bandwidth that is the decisive performance factor in this scenario. Thus, even the addition of low-end, out of date GPUs leads to improvements in both performance-and power-related metrics.
A numerical method is presented which can simulate flows in complex geometries with moving bounda... more A numerical method is presented which can simulate flows in complex geometries with moving boundaries while still retaining all the advantages and the efficiency of solving the Navier-Stokes equations on cylindrical grids. The boundary conditions are applied independently of the grid by assigning body forces over surfaces that need not coincide with coordinate lines. The method has been validated by a large-eddy simulation of the flow in a motored axisymmetric piston-cylinder assembly for which experimental measurements are available. The comparison of the results has shown a very good agreement for mean and rms velocity profiles, thus confirming the accuracy of the present approach. This numerical method, in addition, runs on a small PC-like workstation ten times faster than corresponding simulations on supercomputers. In large-eddy simulations the dynamic subgrid-scale model is very efficient in combination with the body force procedure because it automatically accounts for the wa...
Computer Physics Communications, 2014
In a common approach to multiscale simulation, an incomplete set of macroscale equations must be ... more In a common approach to multiscale simulation, an incomplete set of macroscale equations must be supplemented with constitutive data provided by fine-scale simulation. Collecting statistics from these fine-scale simulations is typically the overwhelming computational cost. We reduce this cost by interpolating the results of fine-scale simulation over the spatial domain of the macro-solver. Unlike previous adaptive sampling strategies, we do not interpolate on the potentially very high dimensional space of inputs to the fine-scale simulation. Our approach is local in space and time, avoids the need for a central database, and is designed to parallelize well on large computer clusters. To demonstrate our method, we simulate one-dimensional elastodynamic shock propagation using the Heterogeneous Multiscale Method (HMM); we find that spatial adaptive sampling requires only ≈50 × N 0.14 fine-scale simulations to reconstruct the stress field at all N grid points. Related multiscale approaches, such as Equation Free methods, may also benefit from spatial adaptive sampling.
We present a brief overview of the Roadrunner hybrid architecture, followed by a summary of the o... more We present a brief overview of the Roadrunner hybrid architecture, followed by a summary of the original code to be adapted. The coding challenges of porting the CFDNS compressible Navier-Stokes solver to the hybrid architecture will be discussed, along with the code modification predicated by the serial speedup on the CBE. Performance results from the initial stages, performed on the single-precision development system, through to the current full simulations, which are ongoing at the time of the submittal, are presented. We observe that the overall coding effort and benefits realized are considerable, but many of the global modifications to the code will be applicable to a variety of future directions in high-performance computing.
For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and th... more For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and the ability to handle complex geometries. Spectral methods offer the highest accuracy but are limited to relatively simple geometries. In order to accommodate more complex geometries, finite-difference or finite-element methods are generally used. However, these methods suffer from relatively low accuracy, requiring fine meshes to obtain good results. Finite element schemes, while able to handle complex geometries, often require significant computational time for grid generation. Spectral element methods can be used for complex geometries, but the grid stretching inherent in these methods leads to time-step limitations and clustering of grid-points in an inefficient manner. In general, any computational scheme which requires regridding to accommodate changes in geometry will incur significant penalties in simulating time-varying geometries. For relatively simple motions, it is possible to ...
A numerical method is presented which can simulate flows in complex geometries with moving bounda... more A numerical method is presented which can simulate flows in complex geometries with moving boundaries while still retaining all the advantages and the efficiency of solving the Navier-Stokes equations on cylindrical grids. The boundary conditions are applied independently of the grid by assigning body forces over surfaces that need not coincide with coordinate lines. The method has been validated by a large-eddy simulation of the flow in a motored axisymmetric piston-cylinder assembly for which experimental measurements are available. The comparison of the results has shown a very good agreement for mean and rms velocity profiles, thus confirming the accuracy of the present approach. This numerical method, in addition, runs on a small PC-like workstation ten times faster than corresponding simulations on supercomputers. In large-eddy simulations the dynamic subgrid-scale model is very efficient in combination with the body force procedure because it automatically accounts for the wa...
Journal of chemical theory and computation, Jan 13, 2015
We present an algorithm for the calculation of the density matrix that for insulators scales line... more We present an algorithm for the calculation of the density matrix that for insulators scales linearly with system size and parallelizes efficiently on multicore, shared memory platforms with small and controllable numerical errors. The algorithm is based on an implementation of the second-order spectral projection (SP2) algorithm [ Niklasson, A. M. N. Phys. Rev. B 2002 , 66 , 155115 ] in sparse matrix algebra with the ELLPACK-R data format. We illustrate the performance of the algorithm within self-consistent tight binding theory by total energy calculations of gas phase poly(ethylene) molecules and periodic liquid water systems containing up to 15,000 atoms on up to 16 CPU cores. We consider algorithm-specific performance aspects, such as local vs nonlocal memory access and the degree of matrix sparsity. Comparisons to sparse matrix algebra implementations using off-the-shelf libraries on multicore CPUs, graphics processing units (GPUs), and the Intel many integrated core (MIC) arc...
International Journal of Computational Science and Engineering, 2008
This article explores the coupling of coarse and fine-grained parallelism for Finite Element simu... more This article explores the coupling of coarse and fine-grained parallelism for Finite Element simulations based on efficient parallel multigrid solvers. The focus lies on both system performance and a minimally invasive integration of hardware acceleration into an existing software package, requiring no changes to application code. Because of their excellent price performance ratio, we demonstrate the viability of our approach by using commodity graphics processors (GPUs) as efficient multigrid preconditioners. We address the issue of limited precision on GPUs by applying a mixed precision, iterative refinement technique. Other restrictions are also handled by a close interplay between the GPU and CPU. From a software perspective, we integrate the GPU solvers into the existing MPI-based Finite Element package by implementing the same interfaces as the CPU solvers, so that for the application programmer they are easily interchangeable. Our results show that we do not compromise any software functionality and gain speedups of two and more for large problems. Equipped with this additional option of hardware acceleration we compare different choices in increasing the performance of a conventional, commodity based cluster by increasing the number of nodes, replacement of nodes by a newer technology generation, and adding powerful graphics cards to the existing nodes.
A numerical method is presented which can simulate flows in complex geometries with moving bounda... more A numerical method is presented which can simulate flows in complex geometries with moving boundaries while still retaining all the advantages and the eciency of solving the Navier-Stokes equations on cylindrical grids. The boundary conditions are applied independently of the grid by assigning body forces over surfaces that need not coincide with coordinate lines. The method has been validated by
CTR annual research briefs, 1997
Journal of Computational Physics, 2000
A second-order accurate, highly efficient method is developed for simulating unsteady three-dimen... more A second-order accurate, highly efficient method is developed for simulating unsteady three-dimensional incompressible flows in complex geometries. This is achieved by using boundary body forces that allow the imposition of the boundary conditions on a given surface not coinciding with the computational grid. The governing equations, therefore, can be discretized and solved on a regular mesh thus retaining the advantages and the efficiency of the standard solution procedures. Two different forcings are tested showing that while the quality of the results is essentially the same in both cases, the efficiency of the calculation strongly depends on the particular expression. A major issue is the interpolation of the forcing over the grid that determines the accuracy of the scheme; this ranges from zeroth-order for the most commonly used interpolations up to second-order for an ad hoc velocity interpolation. The present scheme has been used to simulate several flows whose results have been validated by experiments and other results available in the literature. Finally in the last example we show the flow inside an IC piston/cylinder assembly at high Reynolds number; to our knowledge this is the first example in which the immersed boundary technique is applied to a full three-dimensional complex flow with moving boundaries and with a Reynolds number high enough to require a subgrid-scale turbulence model.
International Journal of Computational Science and Engineering, 2009
Feast is a hardware-oriented MPI based Finite Element solver toolkit. With the extension FeastGPU... more Feast is a hardware-oriented MPI based Finite Element solver toolkit. With the extension FeastGPU the authors have previously demonstrated that significant speed-ups in the solution of the scalar Poisson problem can be achieved by the addition of GPUs as scientific co-processors to a commodity based cluster. In this paper we put the more general claim to the test: Applications based on Feast, that ran only on CPUs so far, can be successfully accelerated on a co-processor enhanced cluster without any code modifications. The chosen solid mechanics code has higher accuracy requirements and a more diverse CPU/co-processor interaction than the Poisson example, and is thus better suited to assess the practicability of our acceleration approach. We present accuracy experiments, a scalability test and acceleration results for different elastic objects under load. In particular, we demonstrate in detail that the single precision execution of the co-processor does not affect the final accuracy. We establish how the local acceleration gains of factors 5.5 to 9.0 translate into 1.6to 2.6-fold total speed-up. Subsequent analysis reveals which measures will increase these factors further.
Center for Turbulence Research Annual …, 1998
For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and th... more For fluid dynamics simulations, the primary issues are accuracy, computational efficiency, and the ability to handle complex geometries. Spectral methods offer the highest accuracy but are limited to relatively simple geometries. In order to accommodate more complex ...
… Technical Report No. LA- …, 2009
The discrete-time immersed boundary method is applied to the large- eddy simulation (LES) of comp... more The discrete-time immersed boundary method is applied to the large- eddy simulation (LES) of complex flows with moving boundaries. The method is adapted to a staggered finite-difference numerical scheme incorporating the dynamic subgrid-scale model. A test case of an IC cylinder with a moving piston and a single fixed valve is simulated and the results are compared with experiments and