An Adaptive Task-Core Ratio Load Balancing Strategy for Multi-core Processors (original) (raw)

Task Migration for Dynamic Power and Performance Characteristics on Many-Core Distributed Operating Systems

Spatial locality of task execution will become more important on future hardware platforms since the number of cores are steadily increasing. The large amount of cores requires more intelligent power management due to the notion of spatial locality, and the high chip density requires an increased thermal awareness in order to avoid thermal hotspots on the chip. At the same time, high performance of the CPU is only achieved by parallelizing tasks over the chip in order to fully utilize the hardware. This paper presents a task migration mechanism for distributed operating systems running on manycore platforms. In this work, we evaluate the performance and energy efficiency of an implemented task migration mechanism. This is shown by parallelizing tasks as the performance of a single core is not sufficient, and by collecting tasks to as few cores as possible as CPU load is low. The task migration mechanism is implemented as a library for FreeRTOS using 1300 lines of code, and introduced a total task migration overhead of 100 ms on a shared memory platform. With the presented task migration mechanism, we intend to improve the dynamism of power and performance characteristics in distributed many-core operating systems.

Evaluating the impact of task migration in multi-processor systems-on-chip

2010

This paper presents a Multi-Processor System-on-Chip platform which is capable of load balancing at run-time. The system is purely distributed in the sense that each processor is capable of making decisions on its own, without having relying by any central unit. All the management is ensured by a very tiny preemptive RTOS (run-time operating system) running on every processor which is mainly responsible for running and distributing tasks among the processing elements (PEs). The goal of such strategy is to improve the performance of the system while ensuring scalability of the design. In order to validate the concepts, we have conducted some experiments with a widely used multimedia application: the MJPEG (Motion JPEG) decoder. Obtained results show that the overhead caused by the task migration mechanism is amortized by the gain in term of performance.

Adaptive and Aggressive Low Power Load Balancing for Multicore Systems

Modern embedded systems commonly adopt a multicore processor to provide high performance. In a multi-core processor, the operating system scheduler is in charge of migrating tasks across the multiple cores to maximize the resource utilization while minimizing the power consumption. In order to reduce power consumption further, the operating system scheduler often provides a power-saving mode which can be triggered according to some predetermined threshold value. However, it turns out that the statically fixed threshold is set to a low value which hardly triggers the power-saving mode. In this paper, we propose an adaptive and aggressive low power load balancer which makes cores enter a power-saving mode more frequently based on system condition. The proposed load balancer adjusts the threshold value for the power-saving mode taking some load balancing overhead into account. Experimental results show that the proposed load balancer enters the power-saving mode frequently to reduce power consumption by 6.81% compared to a conventional load balancer.

Operating system scheduling on heterogeneous core systems

Proc. Workshop on Op. …, 2007

We make a case that a thread scheduler for heterogeneous multicore systems should target three objectives: optimal performance, core assignment balance and response time fairness. Performance optimization via optimal thread-to-core assignment has been explored in the past; in this paper we demonstrate the need for balanced core assignment. We show that unbalanced core assignment results in completion time jitter and inconsistent priority enforcement; we then present a simple fix to the Linux scheduler that eliminates these problems. The second part of the paper addresses the problem of building the HMC scheduler that balances all three objectives. This is a difficult optimization problem. We introduce a definition of this scheduling problem in terms of these three objectives and present a blueprint for a self-tuning algorithm based on reinforcement learning that maximizes a performance function that is an arbitrary weighted sum of these three objectives. Implementing and evaluating this algorithm is the subject of our future work.

Scaling software on multi-core through co-scheduling of related tasks

Ever increasing demand for more processing power, coupled with problems in designing higher frequency chips are forcing CPU vendors to take the multi-core route. IBM R introduced the first multi-core processor with its POWER4 R in 2001, that had two cores in a chip and also 4 chips in a package. Other CPU vendors have followed the trend with dual and quad-core processors becoming increasingly common. It is estimated that by year 2021, there will be chips with 1024 cores on them . Such platforms pose huge challenge on how software effectively utilizes so many cores. One problem of interest is how tasks are scheduled on such platforms. The existing Linux scheduler attempts to distribute tasks equally among all CPU chips. It does not optimize this task placement, taking into consideration that all tasks need not be equal with respect to their use of shared CPU resources (like L2 cache). In this paper, we look at how misplacement of tasks across CPU chips can significantly affect performance and how existing Linux interface to solve that problem is inflexible. We present a new interface which can be used by applications to hint which threads share data closely and thus should be co-scheduled on neighbouring 1 CPUs to the extent possible by OS scheduler. We present several results showing the inflexibility of existing interface and how the suggested interface solves those problems.

E-OSched: a load balancing scheduler for heterogeneous multicores

The Journal of Supercomputing, 2018

The contemporary multicore era has adhered to the heterogeneous computing devices as one of the proficient platforms to execute compute-intensive applications. These heterogeneous devices are based on CPUs and GPUs. OpenCL is deemed as one of the industry standards to program heterogeneous machines. The conventional application scheduling mechanisms allocate most of the applications to GPUs while leaving CPU device underutilized. This underutilization of slower devices (such as CPU) often originates the sub-optimal performance of data-parallel applications in terms of load balance, execution time, and throughput. Moreover, multiple scheduled applications on a heterogeneous system further aggravate the problem of performance inefficiency. This paper is an attempt to evade the aforementioned deficiencies via initiating a novel scheduling strategy named OSched. An enhancement to the OSched named E-OSched is also part of this study. The OSched performs the resource-aware assignment of jobs to both CPUs and GPUs while ensuring a balanced load. The load balancing is achieved via contemplation on computational requirements of jobs and computing potential of a device. The load-balanced execution is beneficiary in terms of lower execution time, higher throughput, and improved utilization. The E-OSched reduces the magnitude of the main memory contention during concurrent job execution phase. The mathematical model of the proposed algorithms is evaluated by comparison of simulation results with different state-of-the-art scheduling heuristics. The results revealed that the proposed E-OSched has performed significantly well than the state-of-the-art scheduling heuristics by obtaining up to 8.09% improved execution time and up to 7.07% better throughput.

Performance-Aware Task Management and Frequency Scaling in Embedded Systems

2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, 2014

Due to the dissemination of smartphones and tablets, a constant complexity growth can be observed for both embedded systems and mobile applications. However, this results in an increase in energy consumption. To guarantee longer battery life cycles, it is fundamental to develop system level strategies that allow guaranteeing the applications' required quality of service by managing the available system resources. In this paper a new task management framework is proposed that controls, in real-time, the execution of multi-threaded applications in order to meet their performance targets. For this, we amend the Linux CFS scheduler decisions to efficiently control the shared resource utilization of parallel applications. The proposed framework relies on runtime performance modeling of both the underlying architecture and the running applications to scale the system resource allocation and frequency. As a result, efficient application execution is achieved not only in terms of performance, but also in energy consumption. Experimental results show that the proposed approach satisfies the applications required performance level by decreasing the relative performance error from 2.801 to 0.168, while achieving 49 % energy savings.

MLB: A Memory-aware Load Balancing for Mitigating Memory Contention

Most of the current CPUs have not single cores, but multicores integrated in the Symmetric MultiProcessing (SMP) architecture, which share the resources such as Last Level Cache (LLC) and Integrated Memory Controller (IMC). On the SMP platforms, the contention for the resources may lead to huge performance degradation. To mitigate the contention, various methods were developed and used; most of these methods focus on finding which tasks share the same resource assuming that a task is the sole owner of a CPU core. However, less attention has been paid to the methods considering the multitasking case. To mitigate contention for memory subsystems, we have devised a new load balancing method, Memory-aware Load Balancing (MLB). MLB dynamically recognizes contention by using simple contention models and performs inter-core task migration. We have evaluated MLB on an Intel i7-2600 and a Xeon E5-2690, and found that our approach can be effectively taken in an adaptive manner, leading to noticeable performance improvements of memory intensive tasks on the different CPU platforms. Also, we have compared MLB with the state of the art method in the multitasking case, Vector Balancing & Sorted Co-scheduling (VBSC), finding out that MLB can lead to performance improvements compared with VBSC without the modification of timeslice mechanism and is more effective in allowing I/O bound applications to be performed. Also, it can effectively handle the worst case where many memory intensive tasks are co-scheduled when non memory intensive ones terminate in contrast to VBSC. In addition, MLB can achieve performance improvements in CPU-GPU communication in discrete GPU systems.

Operating system support to an online hardware-software co-design scheduler for heterogeneous multicore architectures

2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications, 2014

This paper aims at designing and implementing a scheduler model for heterogeneous multiprocessor architectures based on software and hardware. As a proof of concept, the scheduler model was applied to the Linux operating system run ning on the SPARe Leon3 processor. In this sense, performance monitors have been implemented within the processors, which identify demands of processes in real-time. For each process, its demand is projected for the other processors in the architecture and then, it is performed a balancing to maximize the total system performance by distributing processes among processors. The Hungarian maximization algorithm, used in balancing scheduler was developed in hardware, and provides greater parallelism and performance in the execution of the algorithm. The scheduler has been validated through the parallel execution of several benchmarks, resulting in decreased execution times compared to the scheduler without the heterogeneity support.

An infrastructure for embedded systems using task scheduling

Microprocessors and Microsystems

Task scheduling in heterogeneous environments such as cloud data centers is considered to be an NPcomplete problem. Efficient task scheduling will lead to balance the load on the virtual machines (VMs) thereby achieving effective resource utilization. Hence there is a need for a new scheduling framework to perform load balancing amid considering multiple quality of service (QoS) metrics such as makespan, response time, execution time, and task priority. Multi-core Web server is difficult to achieve dynamic balance in the process of remote dynamic request scheduling, so it is necessary to improve it based on the traditional scheduling algorithm to enhance the actual effect of the algorithm. This article do research on the multi-core Web server, Focusing on multi-core Web server queuing model. On this basis, the author draws the drawbacks of the multi-core Web server in the remote dynamic request scheduling algorithm, and improves the traditional algorithm with the demand analysis. Not only it overcomes the drawbacks of traditional algorithms, but also promotes the system threads carrying the same amount of tasks, and promotes the server being always in a dynamic balance. On the basis of this, it achieves an effective solution to customer requests.