RDMS: A hardware task scheduling algorithm for Reconfigurable Computing (original) (raw)

Hardware task scheduling optimizations for reconfigurable computing

2008 Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications, 2008

Reconfigurable Computers (RC) can provide significant performance improvement for domain applications. However, wide acceptance of today's RCs among domain scientist is hindered by the complexity of design tools and the required hardware design experience. Recent developments in hardware/software co-design methodologies for these systems provide the ease of use, but they are not comparable in performance to manual co-design. This paper aims at improving the overall performance of hardware tasks assigned to FPGA. Particularly the analysis of inter-task communication as well as data dependencies among tasks are used to reduce the number of configurations and to minimize the communication overhead and task processing time. This work leverages algorithms developed in the RC and Reconfigurable Hardware (RH) domains to address efficient use of hardware resources to propose two algorithms, Weight-Based Scheduling (WBS) and Highest Priority First-Next Fit (HPF-NF). However, traditional resource based scheduling alone is not sufficient to reduce the performance bottleneck, therefore a comprehensive algorithm is necessary. The Reduced Data Movement Scheduling (RDMS) algorithm is proposed to address dependency analysis and inter-task communication optimizations. Simulation shows that compared to WBS and HPF-NF, RDMS is able to reduce the amount of FPGA configurations to schedule random generated graphs with heavy weight nodes by 30% and 11% respectively. Additionally, the proof-of-concept implementation of a complex 13-node example task graph on the SGI RC100 reconfigurable computer shows that RDMS is not only able to trim down the amount of necessary configurations from 6 to 4 but also to reduce communication overhead by 48% and the hardware processing time by 33%. * One instantiation of a FPGA configuration is denoted to the process of loading the corresponding bitstream into the device, configuring it, executing the tasks in the configuration, and then releasing the device.

Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing

2010

High-performance reconfigurable computing involves acceleration of significant portions of an application using reconfigurable hardware. When the hardware tasks of an application cannot simultaneously fit in an FPGA, the task graph needs to be partitioned and scheduled into multiple FPGA configurations, in a way that minimizes the total execution time. This article proposes the Reduced Data Movement Scheduling (RDMS) algorithm that aims to improve the overall performance of hardware tasks by taking into account the reconfiguration time, data dependency between tasks, intertask communication as well as task resource utilization. The proposed algorithm uses the dynamic programming method. A mathematical analysis of the algorithm shows that the execution time would at most exceed the optimal solution by a factor of around 1.6, in the worst-case. Simulations on randomly generated task graphs indicate that RDMS algorithm can reduce interconfiguration communication time by 11% and 44% respectively, compared with two other approaches that consider data dependency and hardware resource utilization only. The practicality, as well as efficiency of the proposed algorithm over other approaches, is demonstrated by simulating a task graph from a real-life application - N-body simulation - along with constraints for bandwidth and FPGA parameters from existing high-performance reconfigurable computers. Experiments on SRC-6 are carried out to validate the approach.

Online Task Scheduling for the FPGA-Based Partially Reconfigurable Systems

2009

Given the FPGA-based partially reconfigurable systems, hardware tasks can be configured into (or removed from) the FPGA fabric without interfering with other tasks running on the same device. In such systems, the efficiency of task scheduling algorithms directly impacts the overall system performance. By using previously proposed 2D scheduling model, existing algorithms could not provide an efficient way to find all suitable allocations. In addition, most of them ignored the single reconfiguration port constraint and inter-task dependencies. Further more, to our best knowledge there is no previous work investigating in the impact on the scheduling result by reusing already placed tasks. In this paper, we focus on online task scheduling and propose task scheduling solution that takes the ignored constraints into account. In addition, a novel "reuse and partial reuse" approach is proposed. The simulation results show that our proposed solution achieves shorter application completion time up to 43.9% and faster single task response time up to 63.8% compared to the previously proposed stuffing algorithm.

Hardware Task Scheduling for Partially Reconfigurable FPGAs

Partial reconfiguration (PR) of FPGAs can be used to dynamically extend and adapt the functionality of computing systems, swapping in and out HW tasks. To coordinate the on-demand task execution, we propose and implement a run time system manager for scheduling software (SW) tasks on available processor(s) and hardware (HW) tasks on any number of reconfigurable regions of a partially reconfigurable FPGA. Fed with the initial partitioning of the application into tasks, the corresponding task graph, and the available task mappings, the RTSM considers the runtime status of each task and region, e.g. busy, idle, scheduled for reconfiguration/execution etc., to execute tasks. Our RTSM supports task reuse and configuration prefetching to minimize reconfigurations, task movement among regions to efficiently manage the FPGA area, and RR reservation for future reconfiguration and execution. We validate its correctness using our RTSM to execute an image processing application on a ZedBoard platform. We also evaluate its features within a simulation framework, and find that despite the technology limitations, our approach can give promising results in terms of quality of scheduling. 1 Introduction Reconfiguration can dynamically adapt the functionality of hardware systems by swapping in and out HW tasks. To select the proper resource for loading and triggering HW task reconfiguration and execution in partially reconfigurable systems with FPGAs, efficient and flexible runtime system support is needed [6]. In this paper we propose and implement a Run-Time System Manager (RTSM) incorporating efficient scheduling mechanisms that balance effectively the execution of HW and SW tasks and the use of physical resources. We aim to execute as fast as possible a given application, without exhausting the physical resources. Our motivation during the development of RTSM was to find ways to overcome the strict technology restrictions imposed by the Xilinx PR flow [8]:  Static partitioning of the reconfigurable surface in reconfigurable regions (RR).

A hybrid design-time/run-time scheduling flow to minimise the reconfiguration overhead of FPGAs

Microprocessors and Microsystems, 2004

Current multimedia applications are characterized by highly dynamic and non-deterministic behaviour as well as high-performance requirements. Potentially, partially reconfigurable fine-grain configurable architectures like FPGAs can be reconfigured at run-time to match the dynamic behaviour. However, the lack of programming support for dynamic task placement as well as the large configuration overhead has prevented their use for highly dynamic applications. To cope with these two problems, we have adopted an FPGA model with specific support for task allocation. On top of this model, we have applied an existing hybrid design-time/run-time scheduling flow initially developed for multiprocessor systems. Finally, we have extended this flow with specific modules that greatly reduce the reconfiguration overhead making it affordable for current multimedia applications. q

A Communication Aware Online Task Scheduling Algorithm for FPGA-Based Partially Reconfigurable Systems

2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2010

In this paper, we propose an efficient online task scheduling algorithm which targets 2D FPGA area partitioning model and takes into account the data dependency and the data communications 1) among hardware tasks and 2) between hardware tasks and external devices which have not been explicitly investigated in previous work. In the experiment with 10000 workloads, the evaluation result shows that our proposed scheduling algorithm is about 20x faster than the comparable approach.

A Methodology for Automating Co-Scheduling for Reconfigurable Computing Systems

2007 5th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE 2007), 2007

A formal methodology for automatic hardwaresoftware partitioning and co-scheduling between the µP and the FPGA has not yet been established. Current work in automatic task partitioning and scheduling for reconfigurable systems strictly addresses the FPGA hardware, and does not take advantage of the synergy between the microprocessor and the FPGA. In this work, we consider the problem of co-scheduling task graphs on reconfigurable systems. The target systems have an execution model which allows any subtask that can run on the FPGA to also run on the microprocessor, and allows reconfigurability of the FPGA (subject to area, performance, resource, and timing constraints). In this paper, we introduce a methodology for automatic coscheduling using a proposed heuristic algorithm for hardware/software co-scheduling, ReCoS. It will be shown that the proposed algorithm provides up to an order of magnitude improvement in scheduling and execution times when compared with hardware/software co-schedulers found in related fields such as embedded systems, heterogeneous systems, and reconfigurable hardware systems.

Efficient Mapping of Hardware Tasks on Reconfigurable Computers Using Libraries of Architecture Variants

2009 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009

Scheduling and partitioning of task graphs on reconfigurable hardware needs to be carefully carried out in order to achieve the best possible performance. In this paper, we demonstrate that a significant improvement to the total execution time is possible by incorporating a library of hardware task implementations, which contains multiple architectural variants for each hardware task reflecting tradeoffs between the resources utilization and the task execution throughput. We develop a genetic algorithm based mapping approach, which considers both task graph and target platform, and present results for an N-body simulation application using estimated numbers for resource utilization for the constituent tasks and based on actual architectural constraints from different reconfigurable platforms. The results demonstrate improvements of up to 85.3% in the execution time, compared to choosing a fixed implementation variant for each task while keeping a reasonable searching time. * . Inspired by the Telescoping Languages by Kennedy et al. 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines 978-0-7695-3716-0/09 $25.00

System-level parallelism and throughput optimization in designing reconfigurable computing applications

18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004

Reconfigurable Computers (RCs) can leverage the synergism between conventional processors and FPGAs to provide low-level hardware functionality at the same level of programmability as general-purpose computers. In a large class of applications, the total I/O time is comparable or even greater than the computations time. As a result, the rate of the DMA transfer between the microprocessor memory and the on-board memory becomes the performance bottleneck even on RCs. In this paper, we perform a theoretical and experimental study of this specific performance limitation. The mathematical formulation of the problem has been experimentally verified on the state-of-the art reconfigurable platform, SRC-6E. We demonstrate and quantify the possible solution to this problem that exploits the system-level parallelism within reconfigurable machines.