Hierarchical work stealing on manycore clusters (original) (raw)
Related papers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09, 2009
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks
2013 42nd International Conference on Parallel Processing, 2013
Improving the performance of work-stealing loadbalancing algorithms in distributed shared-memory systems is challenging. These algorithms need to overcome high costs of contention among workers, communication and remote datareferences between nodes, and their impact on the locality preferences of tasks. Prior research focus on stealing from a victim that best exploits data locality, and on using special deques that minimize the contention between local and remote workers.
ACM Transactions on Architecture and Code Optimization
Parallel computers now start to adopt Bandwidth-Asymmetric Memory architecture that consists of traditional DRAM memory and new High Bandwidth Memory (HBM) for high memory bandwidth. However, existing task schedulers suffer from low bandwidth usage and poor data locality problems in bandwidthasymmetric memory architectures. To solve the two problems, we propose a Bandwidth and Locality Aware Task-stealing (BATS) system, which consists of an HBM-aware data allocator, a bandwidth-aware traffic balancer, and a hierarchical task-stealing scheduler. Leveraging compile-time code transformation and run-time data distribution, the data allocator enables HBM usage automatically without user interference. According to data access hotness, the traffic balancer migrates data to balance memory traffic across memory nodes proportional to their bandwidth. The hierarchical scheduler improves data locality at runtime without a priori program knowledge. Experiments on an Intel Knights Landing server that adopts bandwidth-asymmetric memory show that BATS reduces the execution time of memory-bound programs up to 83.5% compared with traditional task-stealing schedulers.
Effective Task Binding in Work Stealing Runtimes for Numa Multi-Core Processors
2017
Modern server processors in high performance computing consist of multiple integrated memory controllers on-chip and behave as NUMA in nature. Many user level runtime systems like Open MP, Cilk and TBB provide task construct for programming multi core processors. Task body may define the code that can access task local data and also shared data. Most of the shared data is scattered across virtual memory pages. These virtual pages may be mapped to various memory banks of sockets due to first touch policy of Linux. The user level run-time environments must ensure that the tasks are mapped at a minimum possible distant memory bank where the shared data is located. At the same time, the runtime systems must ensure the load balancing among the multiple cores. Many of the user level runtime systems apply work stealing technique for balancing the load among the cores. Often, there is a tradeoff between these two requirements: object locality and load balancing. In this paper, we address th...
Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages
ACM Transactions on Architecture and Code Optimization, 2014
We present a joint scheduling and memory allocation algorithm for efficient execution of task-parallel programs on non-uniform memory architecture (NUMA) systems. Task and data placement decisions are based on a static description of the memory hierarchy and on runtime information about intertask communication. Existing locality-aware scheduling strategies for fine-grained tasks have strong limitations: they are specific to some class of machines or applications, they do not handle task dependences, they require manual program annotations, or they rely on fragile profiling schemes. By contrast, our solution makes no assumption on the structure of programs or on the layout of data in memory. Experimental results, based on the OpenStream language, show that locality of accesses to main memory of scientific applications can be increased significantly on a 64-core machine, resulting in a speedup of up to 1.63× compared to a state-of-the-art work-stealing scheduler.
Improving data locality for irregular partitioned global address space parallel programs
Proceedings of the 50th Annual Southeast Regional Conference on - ACM-SE '12, 2012
This paper describes a technique for improving the data reference locality of parallel programs using the Partitioned Global Address Space (PGAS) model of computation. One of the principal challenges in writing PGAS parallel applications is maximizing communication efficiency. This work describes an on-line technique based on run-time data reference profiling to organize fine-grained data elements into locality-aware blocks suitable for coarse-grained communication. This technique is applicable to parallel applications with large, irregular, pointer-based applications. The described system can perform automatic data relayout using the locality-aware mapping with either iterative (timestep) based applications or as a collective data relayout operation. An empirical evaluation of the approach shows that the technique is useful in increasing data reference locality and improves performance by 10-17% on the SPLASH-2 Barnes-Hut tree benchmark.
Scalable Task Parallelism for NUMA
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16, 2016
Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models for both computing and memory resources with high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of accesses to task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system, and placement information from the operating system. On a 192-core system with 24 NUMA nodes, our optimizations achieve above 94% locality (fraction of local memory accesses), up to 5× better performance than NUMAaware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that stateof-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.
An extensible global address space framework with decoupled task and data abstractions
2006
Although message passing using MPI is the dominant model for parallel programming today, the significant effort required to develop high-performance MPI applications has prompted the development of several parallel programming models that are more convenient. Programming models such as Co-Array Fortran, Global Arrays, Titanium, and UPC provide a more convenient global view of the data, but face significant challenges in delivering high performance over a range of applications. It is particularly challenging to achieve high performance using global-addressspace languages for unstructured applications with irregular data structures.
Scalable locality-conscious multithreaded memory allocation
Proceedings of the 5th …, 2006
We present Streamflow, a new multithreaded memory manager designed for low overhead, high-performance memory allocation while transparently favoring locality. Streamflow enables low overhead simultaneous allocation by multiple threads and adapts to sequential allocation at speeds comparable to that of custom sequential allocators. It favors the transparent exploitation of temporal and spatial object access locality, and reduces allocator-induced cache conflicts and false sharing, all using a unified design based on segregated heaps. Streamflow introduces an innovative design which uses only synchronization-free operations in the most common case of local allocations and deallocations, while requiring minimal, non-blocking synchronization in the less common case of remote deallocations. Spatial locality at the cache and page level is favored by eliminating small objects headers, reducing allocator-induced conflicts via contiguous allocation of page blocks in physical memory, reducing allocator-induced false sharing by using segregated heaps and achieving better TLB performance and fewer page faults via the use of superpages. Combining these locality optimizations with the drastic reduction of synchronization and latency overhead allows Streamflow to perform comparably with optimized sequential allocators and outperform-on a shared-memory system with four two-way SMT processors-four state-of-the-art multiprocessor allocators by sizeable margins in our experiments. The allocation-intensive sequential and parallel benchmarks used in our experiments represent a variety of behaviors, including mostly local object allocation-deallocation patterns and producer-consumer allocation-deallocation patterns.