Function Based Benchmarks to Abstract Parallel Hardware and Predict Efficient Code Partitioning (original) (raw)
Related papers
Genetic algorithms for parallel code optimization
2004
Determining the optimum data distribution, degree of parallelism and the communication structure on Distributed Memory machines for a given algorithm is not a straightforward task. Assuming that a parallel algorithm consists of consecutive stages, a Genetic Algorithm is proposed to find the best number of processors and the best data distribution method to be used for each stage of the parallel algorithm. Steady state genetic algorithm is compared with transgenerational genetic algorithm using different crossover operators. Performance is evaluated in terms of the total execution time of the program including communication and computation times. A computation intensive, a communication intensive and a mixed implementation are utilized in the experiments. The performance of GA provides satisfactory results for these illustrative examples.
Comparative Study of Parallel Variants for a Particle Swarm Optimization
2009
The Particle Swarm Optimization (PSO) algorithm is a well known alternative for global optimization based on a bio-inspired heuristic. PSO has good performance, low computational complexity and few parameters. Heuristic techniques have been widely studied in the last twenty years and the scientific community is still interested in technological alternatives that accelerate these algorithms in order to apply them to bigger and more complex problems. This article presents an empirical study of some parallel variants for a PSO algorithm, implemented on a Graphic Process Unit (GPU) device with multi-thread support and using the most recent model of parallel programming for these cases. The main idea is to show that, with the help of a multithreading GPU, it is possible to significantly improve the PSO algorithm performance by means of a simple and almost straightforward parallel programming, getting the computing power of cluster in a conventional personal computer.
Lecture Notes in Computer Science, 2012
GPU-based parallel implementations of algorithms are usually compared against the corresponding sequential versions compiled for a single-core CPU machine, without taking advantage of the multicore and SIMD capabilities of modern processors. This leads to unfair comparisons, where speed-up figures are much larger than what could actually be obtained if the CPU-based version were properly parallelized and optimized. The availability of OpenCL, which compiles parallel code for both GPUs and multi-core CPUs, has made it much easier to compare execution speed of different architectures fully exploiting each architecture's best features. We tested our latest parallel implementations of Particle Swarm Optimization (PSO), compiled under OpenCL for both GPUs and multi-core CPUs, and separately optimized for the two hardware architectures. Our results show that, for PSO, a GPU-based parallelization is still generally more efficient than a multi-core CPU-based one. However, the speed-up obtained by the GPU-based with respect to the CPU-based version is by far lower than the orders-of-magnitude figures reported by the papers which compare GPU-based parallel implementations to basic single-thread CPU code.
A TOOL FOR CREATING PARALLEL SWARM ALGORITHMS AUTOMATICALLY ON MULTI-CORE COMPUTERS
Learning & Non-Linear Models, 2019
Meta-heuristics are usually bio-inspired algorithms (based on genes or social behaviors) that are used for solving optimization problems in a variety of fields and applications. The basic principle of a meta-heuristic, such as genetic algorithms, differential evolutions, particle swarm optimization, etc., is to simulate the pressure that the environment applies to individuals resulting in the survival of the best ones. Regardless of which meta-heuristic is being used, the more complex the problem, the more time consuming the algorithm. In this context, parallel computing represents an attractive way of tackling the necessity for computational power. On the other hand, parallel computing introduces new issues that the programmers have to deal with, such as synchronization and the proper exploration of parallel algorithms/models. To avoid these problems, and at the same time, to provide a fast development of parallel swarm algorithms, this work presents a tool for creating parallel code using Parallel Particle Swarm Optimization (PSO) Algorithms in Java. The generator considers three models of parallelism: master-slaves, island and hierarchical. Experiments in the created code showed that a speedup of 5.3 could be reached in the Island model with 2000 iterations using Griewank's function. Moreover, using a cost estimation model (COCOMO) we showed that our tool could save from 4.4 to 14.5 person/month on programming effort.
Evaluation of Parallel Particle Swarm Optimization Algorithms within the CUDA Architecture
Particle swarm optimization (PSO), like other population-based meta-heuristics, is intrinsically parallel and can be effectively implemented on Graphics Processing Units (GPUs), which are, in fact, massively parallel processing architectures. In this paper we discuss possible approaches to parallelizing PSO on graphics hardware within the Compute Unified Device Architecture (CUDA™), a GPU programming environment by nVIDIA™ which supports the company’s latest cards. In particular, two different ways of exploiting GPU parallelism are explored and evaluated. The execution speed of the two parallel algorithms is compared, on functions which are typically used as benchmarks for PSO, with a standard sequential implementation of PSO (SPSO), as well as with recently published results of other parallel implementations. An in-depth study of the computation efficiency of our parallel algorithms is carried out by assessing speed-up and scale-up with respect to SPSO. Also reported are some results about the optimization effectiveness of the parallel implementations with respect to SPSO, in cases when the parallel versions introduce some possibly significant difference with respect to the sequential version.
GAPS: Genetic Algorithm Optimised Parallelisation
1998
The compilation of FORTRAN programsfor SPMD execution on parallel architecturesoften requires the application of program restructuringtransformations such as loop interchange, loopdistribution, loop fusion, loop skewing and statementreordering. Determining the optimal transformationsequence that minimises execution time fora given program is an NP-complete problem. Thehypothesis of the research described here is that geneticalgorithm (GA) techniques can be used to determinethe...
International Journal of Research in Engineering and Science, 2022
This work presents a comparative study of performance between 3 different implementation methods, which are applied to the same problem. Initially, a sequential implementation was developed that served as the basis for the following 2, which made use of the OpenMp and POSIX Thread libraries for parallelization of their algorithms. An interesting fact to be highlighted, and which may explain the low performance of the Pthread library is that, when executing the sequential algorithm, it used 100% one core, OpenMP used close to 80% to each core, whereas Pthread used around 60% of each allocated core.
Evaluation of Parallel Particle Swarm Optimization Algorithms within the CUDA(TM) Architecture
Information Sciences, 2010
Particle Swarm Optimization (PSO), as other population-based meta–heuristics, is intrinsically well suited for parallel implementation on Graphic Processing Units (GPUs), which are, in fact, massively parallel processing architectures. In this paper we discuss possible approaches to parallelizing PSO on graphics hardware by means of the Compute Unified Device Architecture (CUDA), a GPU programming environment by nVIDIA which supports its latest cards. In particular, two different ways of exploiting GPU parallelism are explored and evaluated. The execution speed of the parallel algorithms is compared with a standard sequential implementation of PSO (SPSO), as well as with recently-published results of other parallel implementations, on functions which are typically used as benchmarks for PSO. An in-depth study of the computation efficiency of our parallel algorithms is made analyzing speed-up and scale-up with respect to sequential SPSO. Some results about the optimization effectiveness of the parallel implementations with respect to SPSO, in cases when the parallel versions introduce some possibly significant difference with respect to SPSO, are also reported.
On the use of a genetic algorithm in high performance computer benchmark tuning
2008
The High-Performance Linpack (HPL) package is a reference benchmark used worldwide to evaluate highperformance computing platforms. Adjustment of HPL's seventeen tuning parameters to achieve maximum performance is a time-consuming task that must be performed by hand. In this paper, we show how a genetic algorithm may be exploited to automatically determine the best parameters possible to maximize the future results of the benchmark. Indeed we propose a GA based approach, even if we do not really specify a particular GA as our investigation relies on the Acovea framework , which managed repeated runs of the benchmark to explore the very large space of parameter combinations on the test-case cluster. This work opens the possibility of creating a fully-automatic benchmark tuning tool.