ViPar: High-Level Design Space Exploration for Parallel Video Processing Architectures (original) (raw)

Real-Time Image and Video Processing Using High-Level Synthesis (HLS)

Advances in Multimedia and Interactive Technologies

Implementing high-performance, low-cost hardware accelerators for the computationally intensive image and video processing algorithms has attracted a lot of attention in the last 20 years. Most of the recent research efforts were trying to figure out new design automation methods to fill the gap between the ability of realizing efficient accelerators in hardware and the tight performance requirements of the complex image processing algorithms. High-Level synthesis (HLS) is a new method to automate the design process by transforming high-level algorithmic description into digital hardware while satisfying the design constraints. This chapter focuses on evaluating the suitability of using HLS as a new tool to accelerate the most demanding image and video processing algorithms in hardware. It discusses the gained benefits and current limitations, the recent academic and commercial tools, the compiler's optimization techniques and four case studies.

Daedalus: Toward composable multimedia MP-SoC design

Proceedings of the …, 2008

Daedalus is a system-level design flow for the design of multiprocessor system-on-chip (MP-SoC) based embedded multimedia systems. It offers a fully integrated tool-flow in which design space exploration (DSE), system-level synthesis, application mapping, and system prototyping of MP-SoCs are highly automated. In this paper, we describe our first industrial deployment experiences with the Daedalus framework. Daedalus is currently being deployed in the early stages of the design of an image compression system for very high resolution cameras targeting medical appliances. In this context, we performed a DSE study with a JPEG encoder application, which exploits both task and data parallelism. This application was mapped onto a range of different MP-SoC architectures. We achieved a performance speed-up of up to 20x compared to a single processor system. In addition, the results show that the Daedalus high-level MP-SoC models accurately predict the overall system performance, i.e., the performance error is around 5%.

Architecture driven memory allocation for FPGA based real-time video processing systems

2011 VII Southern Conference on Programmable Logic (SPL), 2011

In this paper, we present an approach that uses information about the FPGA architecture to achieve optimized allocation of embedded memory in real-time video processing system. A cost function defined in terms of required memory sizes, available block-and distributed-RAM resources is used to motivate the allocation decision. This work is a high-level exploration that generates VHDL RTL modules and synthesis constraint files to specify memory allocation. Results show that the proposed approach achieves appreciable reduction in block RAM usage over previous logic to memory mapping approach at negligible increase in logic usage.

Video Encoding Analysis for Parallel Execution on Reconfigurable Architectures

Performance improvement on heterogeneous reconfigurable architectures depends on application analysis for parallel execution. This paper describes a performance analysis methodology for video encoding applications to estimate the expected performance of parallel execution on reconfigurable architectures. We formulate the performance estimation of a video encoding application on a target architecture with an equation. The equation shows the overhead factors that hinder the speed-up of parallel execution.

A Case Study of Design Space Exploration for Embedded Multimedia Applications on SoCs

Seventeenth IEEE International Workshop on Rapid System Prototyping (RSP'06), 2006

Embedded real-time multimedia applications usually imply data parallel processing. SIMD processors embedded in SOCs are cost-effective to exploit the underlying parallelism. However, programming applications for SIMD targets requires data placement and operation scheduling which have been proven to be NP-complete problems and beyond present compiler abilities.

A templated programmable architecture for highly constrained embedded HD video processing

Journal of Real-Time Image Processing, 2018

The implementation of a video reconstruction pipeline is required to improve the quality of images delivered by highly constrained devices. These algorithms require high computing capacities-several dozens of GOPs for real-time HD 1080p video streams. Today's embedded design constraints impose limitations both in terms of silicon budget and power consumption-usually 2 mm 2 for half a Watt. This paper presents the eISP architecture that is able to reach 188 MOPs/mW with 94 GOPs/mm 2 and 378 GOPs/mW using TSMC 65-nm integration technology. This fully programmable and modular architecture, is based on an analysis of video-processing algorithms. Synthesizable VHDL is generated taking into account different parameters, which simplify the architecture sizing and characterization.

Application specific instruction-set processor generation for video processing based on loop optimization

Motion estimation is the most demanding operation of a video encoder, corresponding to at least 80% of the overall computational cost. With the proliferation of portable handheld devices that support digital video coding, dataadaptive motion estimation algorithms have been required to dynamically configure the search pattern not only to avoid unnecessary computations and memory accesses but also to save energy. This paper proposes an Application Specific Instruction Set Processor (ASIP) to implement data-adaptive motion estimation algorithms, that is characterized by a specialized data-path and minimum and optimized instruction set. Due to its low-power nature, this architecture is specially adequate to develop motion estimators for portable, mobile and battery supplied devices. A cycle-based accurate simulator was also developed for the proposed ASIP and fast and data-adaptive search algorithms have been implemented, namely, the four-step search and the motion vector field adaptive search algorithms. Based on the proposed ASIP and the considered adaptive algorithms, several motion estimators were synthesized in 0 .13 µm CMOS technology. Experimental results show that very-low power adaptive motion estimators have been achieved to encode QCIF video sequences.

Strategy for power-efficient design of parallel systems

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2000

Application studies in the areas of image and video processing indicate that between 50 and 80% of the power cost in these systems is due to data storage and transfers. This is especially true for multi-processor realizations, because conventional parallelization methods ignore the power cost and focus only on performance. However, also the power consumption depends heavily on the way a system is parallelized. To reduce this dominant cost, we propose to address the system-level storage organization for the multi-dimensional signals as a first step in mapping these applications, before the parallelization or partitioning decisions (in particular before the SW/HW partitioning which is traditionally done too early in the design trajectory). Our methodology is illustrated on a parallel QSDPCM video codec.