Load balancing in a heterogeneous world: CPU-Xeon Phi co-execution of data-parallel kernels (original) (raw)

Energy efficiency of load balancing for data-parallel applications in heterogeneous systems

The Journal of Supercomputing, 2016

The use of heterogeneous systems in supercomputing is on the rise as they improve both performance and energy efficiency. However, the programming of these machines requires considerable effort to get the best results in massively data-parallel applications. Maat is a library that enables OpenCL programmers to efficiently execute single data-parallel kernels using all the available devices on a heterogeneous system. It offers a set of load balancing methods, which perform the data partitioning and distribution among the devices, exploiting more of the performance of the system and consequently reducing execution time. Until now, however, a study of the implications of these on the energy consumption has not been made. Therefore, this paper analyses the energy efficiency of the different load balancing methods compared to a baseline system that uses just a single GPU. To evaluate the impact of the heterogeneity of the system, the GPUs were set to different frequencies. The obtained results show that in all the studied cases there is at least one load balancing method that improves energy efficiency.

Simplifying programming and load balancing of data parallel applications on heterogeneous systems

Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, 2016

Heterogeneous architectures have experienced a great development thanks to their excellent cost/performance ratio and low power consumption. But heterogeneity significantly complicates both programming and efficient use of the resources. As a result, programmers have ended up using fixed roles for each kind of device: CPUs for sequential and management tasks and GPUs for parallel work. This is a waste of computing power. Maat is a library for OpenCL programmers that allows for the efficient execution of a single dataparallel kernel using all the available devices. It provides the programmer with an abstract view of the system to enable the management of heterogeneous environments regardless of the underlying architecture, and a set of load balancing methods, which perform data distribution. With Maat, programmers only need to develop a data-parallel kernel, select a load balancing method, and run it on the whole system. Experimental results show that Maat efficiently utilizes all the resources, independently of their number and nature. Provided the most appropriate method is selected , Maat is able to achieve a speedup of up to 1.97 using two GPUs with respect to a single GPU and even over 2 when the CPUs, which are much less performant, come into play. CCS Concepts •Computer systems organization → Heterogeneous (hybrid) systems; •Software and its engineering → Parallel programming languages;

Auto-tuned OpenCL kernel co-execution in OmpSs for heterogeneous systems

Journal of Parallel and Distributed Computing, 2018

The emergence of heterogeneous systems has been very notable recently. Still their programming is a complex task. The co-execution of a single OpenCL kernel on several devices is a challenging endeavour, requiring considering the different computing capabilities of the devices and application behaviour. OmpSs is a framework for task based parallel applications, that does not support coexecution between several devices. This paper presents an extension of OmpSs that solves two main issues. First, the automatic distribution of datasets and the management of device memory address spaces. Second, the implementation of a set of load balancing algorithms to adapt to the particularities of applications and systems. All this is accomplished with negligible impact on the programming. Experimental results reveal that the use of all the devices in the system is beneficial in terms performance and energy consumption. Also, the Auto-Tune algorithm gives the best overall results without requiring manual parameter tuning.

A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster

Soft Computing, 2020

Nowadays, embedded systems are comprised of heterogeneous multi-core architectures, i.e., CPUs and GPUs. If the application is mapped to an appropriate processing core, then these architectures provide many performance benefits to applications. Typically, programmers map sequential applications to CPU and parallel applications to GPU. The task mapping becomes challenging because of the usage of evolving and complex CPU- and GPU-based architectures. This paper presents an approach to map the OpenCL application to heterogeneous multi-core architecture by determining the application suitability and processing capability. The classification is achieved by developing a machine learning-based device suitability classifier that predicts which processor has the highest computational compatibility to run OpenCL applications. In this paper, 20 distinct features are proposed that are extracted by using the developed LLVM-based static analyzer. In order to select the best subset of features, fe...

Static multi-device load balancing for OpenCL

—This paper presents the Load Balancing for OpenCL (lbcl) library, devoted to automatically solve load balancing issues on both multi-platform and heterogeneous environments. Using this library, a single kernel can be executed on a set of heterogeneous devices, giving each device an amount of work proportional to its computing power. A wrapper has been developed so the library can balance the workload of an existing application not only without introducing any changes into its source code, but without any recompilation stage. Also a general OpenCL profiler has been developed to easily do a detailed profiling of the obtained results.

HetroOMP: OpenMP for Hybrid Load Balancing Across Heterogeneous Processors

2019

The OpenMP accelerator model enables an efficient method of offloading computation from host CPU cores to accelerator devices. However, it leaves it up to the programmer to try and utilize CPU cores while offloading computation to an accelerator. In this paper, we propose HetroOMP, an extension of the OpenMP accelerator model that supports a new clause hetro which enables computation to execute simultaneously across both host and accelerator devices using standard tasking and work-sharing pragmas.

Parallelizing Workload Execution in Embedded and High-Performance Heterogeneous Systems

2018

In this paper, we introduce a software-defined framework that enables the parallel utilization of all the programmable processing resources available in heterogeneous system-on-chip (SoC) including FPGA-based hardware accelerators and programmable CPUs. Two platforms with different architectures are considered, and a single C/C++ source code is used in both of them for the CPU and FPGA resources. Instead of simply using the hardware accelerator to offload a task from the CPU, we propose a scheduler that dynamically distributes the tasks among all the resources to fully exploit all computing devices while minimizing load unbalance. The multi-architecture study compares an ARMV7 and ARMV8 implementation with different number and type of CPU cores and also different FPGA micro-architecture and size. We measure that both platforms benefit from having the CPU cores assist FPGA execution at the same level of energy requirements.

Improving CPU Performance For Heterogeneous Computing Using OpenCL

International journal of engineering research and technology, 2013

With the rapid development of smart phones, tablets, and pads, there has been widespread adoption of Graphic Processing Units (GPUs). The hand-held market is now seeing an ever-increasing rate of development of computationally intensive applications, which require significant amounts of processing resources. To meet this challenge, GPUs can be used for general-purpose processing. We are moving towards a future where devices will be more connected and integrated. This will allow applications to run on handheld devices, while offloading computationally intensive tasks to other compute units available. There is a growing need for a general programming framework which can utilize heterogeneous processing units such as GPUs and DSPs. OpenCL, a widely used programming framework has been a step towards integrating these different processing units on desktop platforms. A thorough study of the factors that impact the performance of OpenCL on CPUs. Specifically, starting from the two main arc...

EngineCL: Usability and Performance in Heterogeneous Computing

Future Generation Computer Systems

Heterogeneous systems have become one of the most common architectures today, thanks to their excellent performance and energy consumption. However, due to their heterogeneity they are very complex to program and even more to achieve performance portability on different devices. This paper presents EngineCL, a new OpenCL-based runtime system that outstandingly simplifies the co-execution of a single massive data-parallel kernel on all the devices of a heterogeneous system. It performs a set of low level tasks regarding the management of devices, their disjoint memory spaces and scheduling the workload between the system devices while providing a layered API. EngineCL has been validated in two compute nodes (HPC and commodity system), that combine six devices with different architectures. Experimental results show that it has excellent usability compared with OpenCL; a maximum 2.8% of overhead compared to the native version under loads of less than a second of execution and a tendency towards zero for longer execution times; and it can reach an average efficiency of 0.89 when balancing the load.

Static Mapping for Opencl Workloads in Heterogeneous Computer Systems

2018

Today, heterogeneous computer systems (HCS) commonly rely on CPU and GPU, for processing elements, and OpenCL, for the programming framework. In an HCS, a workload should execute on its best processor to achieve its best speedup. OpenCL currently entirely lefts the selection for the best-fit processor, termed as workload mapping, to programmers. However, the NP-completeness of the workload mapping task indicates it is not a trivial task to do manually by programmers so that effective computational approaches are necessary. This research proposes a static mapping method for OpenCL workloads that automatically select the best-fit processor for the workloads. The method accepts static features of a workload and utilizes K-Nearest Neighbor algorithm to classify the workload to either CPU or GPU. The static features are collected using LLVM/Clang compiler framework. To increase the accuracy of classification while keep maintaining the physical meaning of features, the features are reduce...