OpenMP Research Papers - Academia.edu (original) (raw)
This paper presents a parallel implementation of geodesic distance transform using OpenMP. We show how a sequential-based chamfer distance algorithm can be executed on parallel processing units with shared memory such as multiple cores on... more
This paper presents a parallel implementation of geodesic distance transform using OpenMP. We show how a sequential-based chamfer distance algorithm can be executed on parallel processing units with shared memory such as multiple cores on a modern CPU. Experimental results show a speedup of 2.6 times on a quad-core machine can be achieved without loss in accuracy. This work forms part of a C implementation for geodesic superpixel segmentation of natural images.
In this paper, we present OMP2MPI a tool that generates automatically MPI source code from OpenMP. With this transformation the original program can be adapted to be able to exploit a larger number of processors by surpassing the limits... more
In this paper, we present OMP2MPI a tool that generates automatically MPI source code from OpenMP. With this transformation the original program can be adapted to be able to exploit a larger number of processors by surpassing the limits of the node level on large HPC clusters. The transformation can also be useful to adapt the source code to execute in distributed memory many-cores with message passing support. In addition, the resulting MPI code can be used as an starting point that still can be further optimized by software engineers. The transformation process is focused on detecting OpenMP parallel loops and distributing them in a master/worker pattern. A set of micro-benchmarks have been used to verify the correctness of the the transformation and to measure the resulting performance. Surprisingly not only the automatically generated code is correct by construction , but also it often performs faster even when executed with MPI.
- by David Castells-Rufas and +1
- •
- Parallel Computing, Compilers, Parallel Programming, MPI
Nowadays the typical desktop computer processors have four or more independent CPU core, which are called as multi-core processors to execute instructions. So parallel programming language come into play to execute instructions... more
Nowadays the typical desktop computer processors have four or more independent CPU core, which are called as multi-core processors to execute instructions. So parallel programming language come into play to execute instructions concurrently for multi core architecture using openMP. Users prefer cryptographic algorithms to encrypt and decrypt data in order to send it securely over an unsafe environment like the internet. This paper describes the implementation and test results of Caesar cipher and RSA cryptographic algorithms in parallelization are done using OpenMP API 3.1 Standard and performance Analysis. According to our test results, the parallel design approach for security algorithm exhibits improved performance over the sequential approach in terms of execution of time
In this work, we have implemented the classic algorithms to compute the eigenvalues of a n x n symmetric matrix A with the characteristic of being an sparse matrix, besides which its order surpasses the thousands, which does necessary to... more
In this work, we have implemented the classic algorithms to compute the eigenvalues of a n x n symmetric matrix A with the characteristic of being an sparse matrix, besides which its order surpasses the thousands, which does necessary to use an efficient structure for the handling of sparse matrices and to adapt classic numerical methods to this structure, algorithms will allow to obtain all the eigenvalues and possible eigenvectors from the matrix. Parallelism is used in the implementation of the algorithms, in search of reducting the execution times. Francis's QR or QL algorithms use similarity transformations to convert the matrix in diagonal form, reason why the eigenvalues are preserved in each iteration, in addition to be a robust method for the computing of eigenvalues and its associated eigenvectors. In this context, the study is carried out in modern supercomputers that allow to execute more than one instruction and that simultaneously allows to process manifold data and altogether with the UCSparseLib library of Universidad de Carabobo wich already has the necessary and eficient structures for the handling of sparse structures, we search to improve the execution time of the serial algorithm applying multithreading using OpenMP library.
This book “Multi-Core Architectures and Programming” is about an introductory conceptual idea about Multicore Processor with Architecture and programming using OpenMP API. It gives an outline on Multicore Architecture and its functional... more
This book “Multi-Core Architectures and Programming” is about an introductory conceptual idea about Multicore Processor with Architecture and programming using OpenMP API. It gives an outline on Multicore Architecture and its functional blocks like Intercommunication, Cache and Memory. It provides an ideology of working mechanism process scheduling in Operating System is performed in a Multicore processor. Memory programming in core processor using OpenMP API and its libraries for C language is discussed.
- by P Krishna Sankar and +1
- •
- OpenMP, Multicore processors, Multicore Programming
It is important to obtain the results of methods that are used in solving scientific and engineering problems rapidly for users and application developers. Parallel programming techniques have been developed alongside serial programming... more
It is important to obtain the results of methods that are used in solving scientific and engineering problems rapidly for users and application developers. Parallel programming techniques have been developed alongside serial programming because the importance of performance has been increasing day by day while developing computer applications.Various methods such as Gauss Elimination (GE) Method, Gauss-Jordan Elimination (GJE) Method, Thomas Method, etc. have been used in solution of Linear Equation System (LES). In this study, performance comparison is done using Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) for nxn matrix via GJE Method. GJE Method is a variant of GE which is used in solving linear system equations (Ax=B). Each step of GJE Method solution algorithm is independent from each other and also the method is appropriate for parallel computing structure; therefore, this method is preferred within the scope of this study. Application coded in C programming language is developed using OpenMP and CUDA. OpenMP is an Application Program Interface that allows parallel programming using compiler directives on Central Processing Unit (CPU). CUDA is known as NVIDIA's parallel computing architecture and it enables significant increases in computing performance by the Graphics Processing Unit (GPU).Application is realized on Intel Core 2 Quad CPU Q8200 2.33 GHz processor, GeForce 9500 GT graphic card. It is observed that application using Grid-Block-Thread structure and optimized with CUDA displays higher performance than OpenMP in terms of time.
Algoritma genetika sekuensial untuk menyelesaikan Container Loading Problem bekerja hanya dengan menggunakan satu processor saja meskipun dijalankan pada sistem multicore. Tujuan dari penelitian ini adalah untuk mengoptimalkan kinerja... more
Algoritma genetika sekuensial untuk menyelesaikan Container Loading Problem bekerja hanya dengan menggunakan satu processor saja meskipun dijalankan pada sistem multicore. Tujuan dari penelitian ini adalah untuk mengoptimalkan kinerja sistem multicore. Untuk mengoptimalkan kinerja sistem multicore diperlukan adanya paralelisasi algoritma genetika untuk menyelesaikan Container Loading Problem. Dalam penelitian ini paralelisasi yang digunakan adalah paralelisasi shared memory menggunakan OpenMP. Paralelisasi dalam OpenMP dilakukan melalui penyisipan pragma OMP paralel. Dari algoritma genetika serial Container Loading Problem akan diparalelisasi menggunakan OpenMP menjadi algoritma genetika paralel Container Loading Problem. Selanjutnya akan dilakukan pengujian waktu eksekusi algoritma dan perhitungan speedup. Hasil dari penelitian ini berupa hasil pengujian kinerja algoritma paralel yang menunjukkan bahwa waktu komputasi algoritma paralel lebih kecil dari algoritma sekuensial sebelum diparalelisasi menggunakan OpenMP. Waktu komputasi yang lebih kecil ini menyatakan efisiensi kinerja algoritma paralel yang lebih baik dari algoritma sekuensial. Peningkatan efisiensi ini juga dapat diidentifikasi dari speedup paralelisasi yang semakin besar. Kesimpulan yang penulis peroleh dari penelitian ini adalah bahwa paralelisasi terhadap algoritma genetika untuk menyelesaikan Container Loading Problem akan meningkatkan efisiensi waktu komputasi melalui pemanfaatan sistem multicore
The current multi-core architectures have become popular due to performance, and efficient processing of multiple tasks simultaneously. Today’s the parallel algorithms are focusing on multi-core systems. The design of parallel algorithm... more
The current multi-core architectures have become popular due to performance, and efficient processing of multiple tasks simultaneously. Today’s the parallel algorithms are focusing on multi-core systems. The design of parallel algorithm and performance measurement is the major issue on multi-core environment. If one wishes to execute a single application faster, then the application must be divided into subtask or threads to deliver desired result. Numerical problems, especially the solution of linear system of equation
have many applications in science and engineering. This paper describes and analyzes the parallel algorithms for computing the solution of dense system of linear equations, and to approximately compute the value of using OpenMP interface. The performances (speedup) of parallel algorithms on multi-core system have been presented. The experimental results on a multi-core processor show that the proposed parallel algorithms achieves good performance (speedup) compared to the sequential.
In this document, I present progress toward self-consistent evolutions for Extreme Mass Ratio inspirals of binary black hole systems, using the scalar self-force approximation with an effective source. Self-consistent evolutions are a... more
In this document, I present progress toward self-consistent evolutions for Extreme Mass Ratio inspirals of binary black hole systems, using the scalar self-force approximation with an effective source. Self-consistent evolutions are a critical step toward calculating gravitational wave templates for LISA, in order to resolve late times in the inspiral where the timescale of the orbits evolution is shorter than a period, as the small black hole approaches the plunge.
First I describe the preliminary development of a code base in C++ based on Peter Diener’s Fortran scalar self-force simulation [20]. I have successfully implemented the wave equation in flat spacetime, reproduced the wave equation in Schwarzschild spacetime without a source, and reproduced circular orbits in Schwarzschild spacetime with an effective source at roundoff precision. However, the effort to implement an independent, state-of-the-art code base was aban- doned in favor of working together on Peter Diener’s existing simulation when the time scale for development of the C++ code proved too long.
In the second half of this document, I present preliminary results for estimates of the errors
from Peter Diener’s and Barry Wardell’s time domain scalar self-force simulation for eccentric
orbits using Niels Warburton’s frequency domain initial conditions. First I perform a first or-
der Richardson Extrapolation to obtain the self-force at infinite Discontinuous Galkerin order,
Fin f , then sum the spherical harmonic modes to obtain the total self-force as a function of time.
Preliminary best choice Discontinuous Galerkin evolution orders, 36 to 40, are estimated based
on the absolute and relative differences of the radial self force for each DG order and Fin f over
one radial oscillation. Preliminary best choice mode-fit parameters, lmin = 14 and lmax = 25, are
estimated based on residuals of the l-mode fit and discontinuities attributed to roundoff error in
a surface plot of the total radial self force as a function of lmin and lmax. The contributions to the
error from the number of terms in the l-mode fit, the choice of end points in the l-mode fit, and the
inability to calculate the Richardson extrapolation in real time are comparable and at the level of
4
one part in 10 . The contribution to the error from the use of constant weight factors in the fit is
insignificant. These conclusions are preliminary and will need to be evaluated more thoroughly.
—The paper presents a parallel implementation of a Dynamic Itemset Counting (DIC) algorithm for many-core systems, where DIC is a variation of the classical Apriori algorithm. We propose a bit-based internal layout for transactions and... more
—The paper presents a parallel implementation of a Dynamic Itemset Counting (DIC) algorithm for many-core systems, where DIC is a variation of the classical Apriori algorithm. We propose a bit-based internal layout for transactions and itemsets with the assumption that such a representation of the transaction database fits in main memory. This technique reduces the memory space for storing the transaction database and also simplifies support counting and candidate itemsets generation via logical bitwise operations. Implementation uses OpenMP technology and thread-level parallelism. Experimental evaluation on the platforms of Intel Xeon CPU and Intel Xeon Phi coprocessor with large synthetic database showed good performance and scalability of the proposed algorithm.
In this paper, we present a new parallel granularity called " tiling " to parallelize the H.264 codec. The new parallel granularity, which has the same granularity level as the parallel slice-level H.264 codec, is based on decomposing the... more
In this paper, we present a new parallel granularity called " tiling " to parallelize the H.264 codec. The new parallel granularity, which has the same granularity level as the parallel slice-level H.264 codec, is based on decomposing the entire video frame into tiles by utilizing a new inherently parallel 2D domain decomposition method. To assess the proposed approach, its parallel scalability, bit rate, and parallel impact on visual quality (peak signal-to-noise ratio) are compared with those of other approaches. Empirical results show significant improvements in encoding time as compared to the serial and the parallel slice-level approaches. In terms of peak signal-to-noise ratio and bit rate, certain results improved, a few were comparable, while a few were discouraging, when compared to the results of the other approaches. However, addressing the limitations of the proposed method is highlighted as future work.
In this work, two-dimensional numerical simulations are carried out to investigate the unsteady mixed convection heat transfer in a laminar cross-flow from two equal-sized isothermal in-line cylinders confined inside a vertical channel.... more
In this work, two-dimensional numerical simulations are carried out to investigate the unsteady mixed convection heat transfer in a laminar cross-flow from two equal-sized isothermal in-line cylinders confined inside a vertical channel. The governing equations are solved using the vorticity-stream function formulation of the incompressible Navier–Stokes and energy equations using the control-volume method on a non-uniform orthogonal Cartesian grid. The numerical scheme is validated for the standard case of a symmetrically confined isothermal circular cylinder in a plane channel. Calculations are performed for flow conditions with Reynolds number of ReD = 200, a fixed value of the Prandtl number of Pr = 0.744, values of the buoyancy parameter (Richardson number) in the range −1 ≤ Ri ≤ 4 , and a blockage ratio of BR = D/H = 0. 3. All possible flow regimes are considered by setting the pitch-to-diameter ratios (σ = L / D) to 2, 3 and 5. The interference effects and complex flow features are presented in the form of mean and instantaneous velocity, vorticity and temperature distributions. In addition, separation angles, time traces of velocity fluctuation, Strouhal number, characteristic times of flow oscillation, phase-space relation between the longitudinal and transverse velocity signals, wake structure, and recirculation length behind each cylinder have been determined. Local and space-averaged Nusselt numbers for the upstream and downstream cylinders have also been obtained. The results reported herein demonstrate how the flow and heat transfer characteristics are significantly modified by the wall confinement, tube spacing, and thermal effects for a wide range in the parametric space.
Photonic crystals are a novel class of optical materials that, almost certainly, will underpin major advances in future communication and computer technology. In a photonic crystal, the periodic distribution of refractive index gives rise... more
Photonic crystals are a novel class of optical materials that, almost certainly, will underpin major advances in future communication and computer technology. In a photonic crystal, the periodic distribution of refractive index gives rise to interferometric action which leads to band gaps, or frequency ranges for which light cannot propagate. Material or structural defects in the crystal can give rise to localised states, or field modes, that are the analogues of impurity modes in semiconductors, changing the radiation dynamics of the crystal and providing the ability to mould the flow of light in a variety of ways. The radiation dynamics are characterised by the local density of states (LDOS) and, in this paper, we describe a new, highly accurate and efficient technique based on field multi pole methods for computing the LDOS. Its implementation on SMP systems using the OpenMP and MPI protocols is discussed and we illustrate its applicability in studies of ordered and disordered cr...
This report is a result of a study about computational improvement in DD3IMP software package through High Performance Computing. DD3IMP is a software package for Finite Element Method (FEM) based on numerical simulation. The program... more
This report is a result of a study about computational improvement in DD3IMP software package through High Performance Computing. DD3IMP is a software package for Finite Element Method (FEM) based on numerical simulation. The program simulates the forming processes of sheet metals and elastoplasticity through deep drawing using FEM. The performance of this software is directly based on the performance of the linear equations system solver it uses. In the currently version of DD3IMP, the main solver is Direct Sparse Solver (DSS) from Intel R Math Kernel Library (MKL). This is an optimized solver wich has revealed the best performance for solving the linear system of equations in DD3IMP using Intel's processor's based machines. The entire program is written in Fortran programming language with about 500 routines and 60k lines of code. The entire program is already parallelized in shared memory paradigm using OpenMP directives. In this work we're going to explore the DD3IMP program and using some profilling tools to detect were the program is more computational expensive and explore the possibilities of increasing their performance. The program will be analysed using SeARCH Cluster nodes based on Intel R Xeon R Processor with Ivy-Bridge microarchitecture, and a team laptop based on Intel R Core R i7 based on Haswell microarchitecture. I. The package DD3IMP (Deep-Drawing 3D Implicit FE Solver) The program DD3IMP is a software package for conforma-tion and elastoplasticity simulation for sheet metals through deep drawing using finite element methods. The program was developed and implemented in Fortran 95 has more than 500 routines and 60k lines of code. The part of DD3IMP wich performs more work is related to solving a linear equations system multiple times. Solving a linear equation system can be a task computation-ally intensive. Since DD3IMP solves this kind of linear equation system multiple times, it's resolution can configure a bottleneck for performance scalability. As we'll see in the next sections, since the most computational heavy regions of this software corresponds to solving a linear equation system, the global performance is directly affected by the solver it uses. The equation system involved on DD3IMP is a matrix-vector multiplication: Ax = b where A is a non-symmetric sparse matrix (symmetric in structure but non-symmetric in values) in a CSR format that represents the mesh structure, x is the displacements vector and b is the vector of external forces. The actual solver implemented in DD3IMP is DSS (Direct Sparse Solver) from Intel's Math Kernel Library (MKL). The previous one was a solver based on conjugate gradient method-the conjugate gradient squared (CGS) combined with ILU pre-conditioner-wich was substituted by DSS due to performance reasons. However, these two solver are currently available on DD3IMP package and user can select wich one to use. The main differences between these two solvers is that CGS is an iterative method and DSS is a direct method. The iterative methods are commonly known for being computacionally efficient and fast convergence. However previous studies in this software reveals that DSS was the faster solver running on Intel's processors based machines in OpenMP implementation of DD3IMP. This optimized library is particularly efficient when having a large sized problems. The scalability of DSS allow the program to scale almost linearly as we'll see. II. Starting Point and Case Studies The starting point is DD3IMP sequential and a parallel version with OpenMP directives. As we'll see on the profiling section , in actual paralelized version, more than 97% of DD3IMP execution time is running in parallel. We also have three different case studies. Since DD3IMP is a finite element method package, the program uses numerical techniques for finding approximate solutions to boundary value problems using differential equations.
Single-ISA heterogeneous multicore systems are emerging as a promising direction to achieve a more suitable balance between performance and energy consumption. However, a proper utilization of these architectures is essential to reach the... more
Single-ISA heterogeneous multicore systems are emerging as a promising direction to achieve a more suitable balance between performance and energy consumption. However, a proper utilization of these architectures is essential to reach the energy benefits. In this paper, we demonstrate the ineffectiveness of popular OpenMP scheduling policies executing Rodinia benchmark on the Exynos 5 Octa (5422) SoC, which integrates the ARM big.LITTLE architecture.
esys. escript is a python-based environment for implementing mathematical models, in particular those based on coupled, non-linear, time-dependent partial differential equations. It consists of four major components:< br>• esys.... more
esys. escript is a python-based environment for implementing mathematical models, in particular those based on coupled, non-linear, time-dependent partial differential equations. It consists of four major components:< br>• esys. escript core library< br>• finite element solver esys. finley (which uses fast vendor-supplied solvers or our paso linear solver library)< br>• the meshing interface esys. pycad< br>• a model library.< br> The current version supports parallelization through both MPI for distributed memory and OpenMP for ...
Transient laminar opposing mixed convection in a gravity driven downward flow confined inside a vertical rectangular channel has been investigated, with both walls suddenly subjected to symmetrical isothermal heat sources over a finite... more
Transient laminar opposing mixed convection in a gravity driven downward flow confined inside a vertical rectangular channel has been investigated, with both walls suddenly subjected to symmetrical isothermal heat sources over a finite portion of the channel walls. The unsteady two-dimensional Navier-Stokes and energy equations have been solved numerically for a wide parametric set. Studies are carried out for Reynolds numbers of 100 and 200 and several values of buoyancy strength or Richardson number. The effect of Reynolds number and opposing buoyancy on the temporal evolution of the overall flow structure, temperature field, and Nusselt number from the heated surfaces is investigated using fixed geometrical parameters and considering heat losses to the channel walls. In this parameter space, for a given Reynolds number and relatively small values of the buoyancy parameter, the transient process leads to a final symmetric or asymmetric steady-state. However, as the value of buoyancy strength increases, the flow and temperature fields become more complex and an oscillatory flow with a fundamental frequency sets in when a critical value of the Richardson number is reached. Numerical predictions show that the critical value of the Richardson number between the two regimes strongly depends on the value of the Reynolds number, and the time scales, natural frequencies, and phasespace portraits of flow oscillation are presented and discussed in detail. Stability of the symmetric response has been analyzed. The results include the effects of Prandtl number and heat losses to the channel walls on the evolution of the final flow and thermal responses.
In this paper, we study various parallelization schemes for the Variable Neighborhood Search (VNS) metaheuristic on a CPU-GPU system via OpenMP and OpenACC. A hybrid parallel VNS method is applied to recent benchmark problem instances for... more
In this paper, we study various parallelization schemes for the Variable Neighborhood Search (VNS) metaheuristic on a CPU-GPU system via OpenMP and OpenACC. A hybrid parallel VNS method is applied to recent benchmark problem instances for the multi-product dynamic lot sizing problem with product returns and recovery, that appears in reverse logistics and is known to be NP-hard. We report our findings regarding these parallelization approaches and present promising computational results.
... Samuel Thibault, François Broquedis, Brice Goglin, Raymond Namyst, and Pierre-André Wacrenier INRIA Futurs - LaBRI 351 cours de la libération 33405 Talence cedex, France {thibault,goglin,namyst,wacrenier}@labri.fr,... more
... Samuel Thibault, François Broquedis, Brice Goglin, Raymond Namyst, and Pierre-André Wacrenier INRIA Futurs - LaBRI 351 cours de la libération 33405 Talence cedex, France {thibault,goglin,namyst,wacrenier}@labri.fr, francois.broquedis@etu.u-bordeaux1.fr Abstract. ...
This study details the acceleration techniques and associated performance gains in the time integration of coupled poromechanical problems using the Discrete Element Method (DEM) and a Pore scale Finite Volume (PFV) scheme in Yade open... more
This study details the acceleration techniques and associated performance gains in the time integration of coupled poromechanical problems using the Discrete Element Method (DEM) and a Pore scale Finite Volume (PFV) scheme in Yade open DEM software. Specifically, the model is tailored for accuracy by reducing the frequency of costly matrix factorizations (matrix factor reuse), moving the matrix factorizations to background POSIX threads (multithreaded factorization), factorizing the matrix on a GPU (accelerated factorization), and running PFV pressure and force calculations in parallel to the DEM interaction loop using OpenMP threads (parallel task management). Findings show that these four acceleration techniques combine to accelerate the numerical poroelastic oedometer solution by 170x, which enables more frequent triangulation of large scale time-dependent DEM+PFV simulations (356 thousand+ particles, 2.1 million DOFs).
—Multicore embedded systems are rapidly emerging. Hardware designers are packing more and more features into their design. Introducing heterogeneity in these systems, i.e. adding cores of varying types does provide opportunities to solve... more
—Multicore embedded systems are rapidly emerging. Hardware designers are packing more and more features into their design. Introducing heterogeneity in these systems, i.e. adding cores of varying types does provide opportunities to solve problems in different aspects. However, this presents several challenges to embedded system programmers since software is still not mature enough to efficiently exploit the capabilities of the emerging hardware rich with cores of varying types. Programmers still rely on understanding and using low-level hardware-specific API. This approach is not only very time-consuming but also tedious and error-prone. Moreover, the solutions developed are very closely tied to a particular hardware raising significant concerns with software portability. What we need is an industry standard that will enable better programming practices for both current and future embedded systems. To that end, in our project, we have explored the possibility of using existing standards such as OpenMP that provides portable high-level programming constructs along with another industry-driven standard for multicore systems, MCA. For our work, we have considered the GNU compiler since it is the compiler that mostly used in the embedded system domain facilitating open source development. We target a platform consisting of twelve PowerPC e6500 64-bit dual-threaded cores. We create a portable software solution by studying the GNU OpenMP runtime library and extending it to incorporate MCA libraries. The solution abstracts the low-level details of the target platform and the results show that the additional MCA layer does not incur any overhead. The results are competitive when compared with a proprietary toolchain.
- by Peng Sun
- •
- Embedded Systems, OpenMP, MRAPI
As a programmer, one is aspired to solve ever larger, more memory intensive problems, or simply solve problems with greater speed than possible on a sequential computer. A programmer can turn to parallel programming and parallel computers... more
As a programmer, one is aspired to solve ever larger, more memory intensive problems, or simply solve problems with greater speed than possible on a sequential computer. A programmer can turn to parallel programming and parallel computers to satisfy these needs. Parallel programming methods on parallel computers gives access to greater memory and Central Processing Unit (CPU) resources which is not available on sequential computers. This paper discusses the benefits of developing 2D and 3D convex hull on mixed mode MPI, OpenMP applications on both single and clustered SMPs. In this experimentation for purpose of optimization of 3D convex hull we merged both MPI and OpenMP library which gives another mixed mode programming method to get optimized results. The job is divided into sub-jobs and are submitted to cluster of SMP nodes using MPI and these subjobs are computed in parallel using OpenMP threads in SMP nodes. Experiments on sequential, MPI, OpenMP and Hybrid programming models ...
OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory plat- forms. It is also a promising candidate to exploit the emerging multi- core, multi-threaded processors. In addition,... more
OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory plat- forms. It is also a promising candidate to exploit the emerging multi- core, multi-threaded processors. In addition, there is an increasing trend to port OpenMP to more specific architectures like General Purpose Graphic Processor Units (GPGPUs). However, these ccNUMA (cache coherent Non-Uniform Memory Access) architectures may present sev- eral hierarchical memory levels, which represent a serious performance issue for OpenMP applications. In this work, we present the initial re- sults from our eort to quantify and model the impact of memory access heterogeneity on the performance of the applications. Using a simplified performance model, we show how to identify a "performance signature" for a given platform, which allows us to predict the performance of sam- ple applications. Keywords. Network Contention, MPI, Collective Communications, Per- formance ...
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and many-core hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale Parser (MMP), a computationally... more
This paper studies the performance and energy consumption of several multi-core, multi-CPUs and many-core hardware platforms and software stacks for parallel programming. It uses the Multimedia Multiscale Parser (MMP), a computationally demanding image encoder application, which was ported to several hardware and software parallel environments as a benchmark. Hardware-wise, the study assesses NVIDIA's Jetson TK1 development board, the Raspberry Pi 2, and a dual Intel Xeon E5-2620/v2 server, as well as NVIDIA's discrete GPUs GTX 680, Titan Black Edition and GTX 750 Ti. The assessed parallel programming paradigms are OpenMP, Pthreads and CUDA, and a single-thread sequential version, all running in a Linux environment. While the CUDA-based implementation delivered the fastest execution, the Jetson TK1 proved to be the most energy efficient platform, regardless of the used parallel software stack. Although it has the lowest power demand, the Raspberry Pi 2 energy efficiency is hindered by its lengthy execution times, effectively consuming more energy than the Jetson TK1. Surprisingly, OpenMP delivered twice the performance of the Pthreads-based implementation, proving the maturity of the tools and libraries supporting OpenMP.
Multicore embedded systems are being widely used in telecommu-nication systems, robotics, medical applications and more. While they offer a high-performance with low-power solution, programming in an efficient way is still a challenge. In... more
Multicore embedded systems are being widely used in telecommu-nication systems, robotics, medical applications and more. While they offer a high-performance with low-power solution, programming in an efficient way is still a challenge. In order to exploit the capabilities that the hardware offers, software developers are expected to handle many of the low-level details of programming including utilizing DMA, ensuring cache coherency, and inserting synchronization primitives explicitly. The state-of-the-art involves solutions where the software toolchain is too vendor-specific thus tying the software to a particular hardware leaving no room for portability. In this paper we present a runtime system to explore mapping a high-level programming model, OpenMP, on to multicore embedded systems. A key feature of our scheme is that unlike the existing approaches that largely rely on POSIX threads, our approach leverages the Multicore Association (MCA) APIs as an OpenMP translation layer. The MCA APIs is a set of low-level APIs handling resource management, inter-process communications and task scheduling for multicore embedded systems. By deploying the MCA APIs, our runtime is able to effectively capture the characteristics of multicore embedded systems compared with the POSIX threads. Furthermore, the MCA layer enables our run-time implementation to be portable across various architectures. Thus programmers only need to maintain a single OpenMP code base which is compatible by various compilers, while on the other hand, the code is portable across different possible types of platforms. We have evaluated our runtime system using several embedded benchmarks. The experiments demonstrate promising and competitive performance compared to the native approach for the platform.
In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and... more
In this paper we report our experiences in porting the FEASTFLOW software infrastructure to the Intel Xeon Phi coprocessor. Our efforts involved both the evaluation of programming models including OpenCL, POSIX threads and OpenMP and typical optimization strategies like parallelization and vectorization. Since the straightforward porting process of the already existing OpenCL version of the code encountered performance problems that require further analysis, we focused our efforts on the implementation and optimization of two core building block kernels for FEASTFLOW: an axpy vector operation and a sparse matrix-vector multiplication (spmv). Our experimental results on these building blocks indicate the Xeon Phi can serve as a promising accelerator for our software infrastructure.
A eficiência computacional de um programa de simulação por elementos finitos é fortemente dependente dos algoritmos e do método de resolução do sistema de equações utilizado. Isto é particularmente importante nos programas implícitos,... more
A eficiência computacional de um programa de simulação por elementos finitos é fortemente dependente dos algoritmos e do método de resolução do sistema de equações utilizado. Isto é particularmente importante nos programas implícitos, como é o caso do código quasi-estático implícito DD3IMP, sobre o qual se debruça este estudo. Este trabalho descreve o procedimento adoptado para identificar os principais estrangulamentos computacionais, bem como as melhorias feitas e aplicação de directivas OpenMP para melhorar a sua eficiência computacional. As diferentes versões do programa foram testadas através de um exemplo bem conhecido de conformação plástica de uma taça quadrada, considerando diferentes discretizações do esboço. A análise dos resultados obtidos em termos de tempo de computação, demonstra que a adopção de técnicas de High Performing Computing (HPC), através do uso de um método directo de resolução de sistemas de equações e de directivas OpenMP permite: (i) resolver um problema num tempo mais curto que o tempo de resolução com o programa sequencial inicial; (ii) resolver um problema de maior dimensão no mesmo tempo que um problema de menor dimensão resolvido com o programa sequencial e, consequentemente (iii) obter uma solução mais rigorosa num determinado tempo computacional e; (iv) atingir um speed-up próximo do número de cores (em memória partilhada, e sem considerar o overheat associado à gestão das threads de paralelização).
In this work we present a highly efficient implementation of OpenMP tasks. It is based on a runtime infrastructure architected for data locality, a crucial prerequisite for exploiting the NUMA nature of modern multicore multiprocessors.... more
In this work we present a highly efficient implementation of OpenMP tasks. It is based on a runtime infrastructure architected for data locality, a crucial prerequisite for exploiting the NUMA nature of modern multicore multiprocessors. In addition, we employ fast work-stealing structures, based on a novel, efficient and fair blocking algorithm. Synthetic benchmarks show up to a 6-fold increase in throughput (tasks completed per second), while for a task-based OpenMP application suite we measured up to 87% reduction in execution times, as compared to other OpenMP implementations.
- by Nikolaos D Kallimanis and +1
- •
- Compilers, Scheduling, OpenMP
The emergence of multi-core processors has led to the expansion of parallel programming in all areas. OpenMP appears to be one of the most suitable API for new processor architectures. This choice is justified by its ease of use compared... more
The emergence of multi-core processors has led to the expansion of parallel programming in all areas. OpenMP appears to be one of the most suitable API for new processor architectures. This choice is justified by its ease of use compared to other alternatives of parallel programming. However, due to many factors, developing efficient OpenMP programs is a challenging task. In this work, we present a new performance model dealing with performance modeling of OpenMP programs on multi-core machines. Experimental results achieved on a matrix-matrix product prove the simplicity and the accuracy of the predicted performance given by the model.
OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory platforms. It is also a promising candidate to exploit the emerging multi-core, multi-threaded processors. In addition,... more
OpenMP has gained wide popularity as an API for parallel programming on shared memory and distributed shared memory platforms. It is also a promising candidate to exploit the emerging multi-core, multi-threaded processors. In addition, there is an increasing trend to port OpenMP to more specific architectures like General Purpose Graphic Processor Units (GPGPUs). However, these ccNUMA (cache coherent Non-Uniform Memory Access) architectures may present several hierarchical memory levels, which represent a serious performance issue for OpenMP applications. In this work, we present the initial results from our effort to quantify and model the impact of memory access heterogeneity on the performance of the applications. Using a simplified performance model, we show how to identify a "performance signature" for a given platform, which allows us to predict the performance of sam- ple applications.
—During the last decade, Heterogeneous systems are emerging for high performance computing [1]. In order to achieve high performance computing (HPC), existing technologies and programming models aims to see rapid growth toward intra-node... more
—During the last decade, Heterogeneous systems are emerging for high performance computing [1]. In order to achieve high performance computing (HPC), existing technologies and programming models aims to see rapid growth toward intra-node parallelism [2]. The current high computational system and applications demand for a massive level of computation power. In last few years, Graphical processing unit (GPU) has been introduced an alternative of conventional CPU for highly parallel computing applications both for general purpose and graphic processing. Rather than using the traditional way of coding algorithms in serial by single CPU, many multithreading programming models has been introduced such as CUDA, OpenMP, and MPI to make parallel processing by using multicores. These parallel programming models are supportive to data driven multithreading (DDM) principle [3]. In this paper, we have presented performance based preliminary evaluation of these programming models and compared with the conventional single CPU serial processing system. We have implemented a massive computational operation for performance evaluation such as complex matrix multiplication operation. We used data driven multithreaded HPC system for performance evaluation and presented the results with a comprehensive analysis of these parallel programming models for HPC parallelism.
Exascale computing refers to a computing system which is capable to at least one exaflop in next couple of years. Many new programming models, architectures and algorithms have been introduced to attain the objective... more
Exascale computing refers to a computing system which is capable to at least one exaflop in next
couple of years. Many new programming models, architectures and algorithms have been
introduced to attain the objective for exascale computing system. The primary objective is to
enhance the system performance. In modern/super computers, GPU is being used to attain the
high computing performance. However, it’s the objective of proposed technologies and
programming models is almost same to make the GPU more powerful. But these technologies
are still facing the number of challenges including parallelism, scale and complexity and also
many more that must be fixed to achieve make computing system more powerful and efficient. In
this paper, we have present a testing tool architecture for a parallel programming approach using
two programming models as CUDA and OpenMP. Both CUDA and OpenMP could be used to
program shared memory and GPU cores. The object of this architecture is to identify the static
errors in the program that occurred during writing the code and cause absence of parallelism. Our
architecture enforces the developers to write the feasible code through we can avoid from the
essential errors in the program and run successfully.