Rapid prototyping for an optimized MPEG4 decoder implementation over a parallel heterogeneous architecture (original) (raw)

A low-cost media-processor based real-time MPEG-4 video decoder

IEEE Transactions on Consumer Electronics, 2003

Due to the high-speed development in semiconductor technology, general-purpose computers can process digital video, audio and graphics easily. Nowadays, the similar functionality has shifted from the general-purpose computer to the so-called media-processor one. Since the processor is targeted on mass-produced consumer electronics devices, the price and the ease of use are the main concern. The latest MPEG-4 video coding standard, which consumes relatively low bit-rate but yields acceptable quality, meets the target exactly. In this paper, by realizing an MPEG-4 codec using both general-purpose processor and media-processor, we investigated some issues related to the use of a media-processor for satisfying various multimedia compression and processing requirements. The investigated issues include the speedup of the MPEG-4 decoder based on a media-processor DSP chip, a multimedia co-processor or hardware accelerator, the Very Long Instruction Word (VLIW) and the Single Instruction Multiple data stream (SIMD) programming techniques. Moreover, a mediaprocessor based MPEG-4 solution for set-top box applications is also included 1 .

Mpeg-4 and the New Multimedia Architectural Challenges

INFORMATION technology issues & challenges, 2009

The recent development of multimedia applications made them one of the most popular and most demanding types of workloads. To meet the new requirements and to map all new multimedia functionality onto systems with restricted resources a dramatical need of visual data compression standards arose. This paper discusses the performance requirements of the MPEG standards and outlines some approaches to meet these requirements. The focus is on the first representative of the latest generation visual data compression standards-MPEG-4. Its computational requirements and new architectural demands are analyzed. An innovative reconfigurable ¢-architectural approach is presented as a new concept for a flexible and cost-effective implementation of multimedia processors. At the expense of three new instructions, the proposed mechanisms allow instructions, entire pieces of code, or their combination to execute in a reconfigurable manner. These three instructions are proposed as an ISA extension of a super-scalar processor to illustrate the advantages of this new concept.

Multicore System-on-Chip Architecture for MPEG-4 Streaming Video

Circuits and Systems for …, 2002

Mladen Bereković (M'96) received the Dipl.-Ing. degree in electrical engineering from the University of Hannover, Hannover, Germany, in 1995. Since then, he has been a Research Assistant with the Institute of Microelectronic Circuits and Systems, University of Hannover. His current research interests include VLSI architectures for video signal processing, MPEG-4, system-on-chip designs, and simultaneously multithreaded processor architectures.

Parallel H.264 Decoding on an Embedded Multicore Processor

Lecture Notes in Computer Science, 2009

In previous work the 3D-Wave parallelization strategy was proposed to increase the parallel scalability of H.264 video decoding. This strategy is based on the observation that inter-frame dependencies have a limited spatial range. The previous results, however, investigate application scalability on an idealized multiprocessor. This work presents an implementation of the 3D-Wave strategy on a multicore architecture composed of NXP TriMedia TM3270 embedded processors. The results show that the parallel H.264 implementation scales very well, achieving a speedup of more than 54 on a 64-core processor. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the latencies of some frames might increase. To address these drawbacks, policies to reduce the number of frames in flight and the frame latency are also presented. The results show that our policies combat memory and latency issues with a negligible effect on the performance scalability.

A Thread and Data-Parallel MPEG-4 Video Encoder for a System-On-Chip Multiprocessor

2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05), 2005

We studied the dynamic instruction count reduction for a single-thread, vectorized and a multi-threaded, non-vectorized, MPEG-4 video encoder. Results indicate a maximum improvement of the order of 88% for 22 CPU contexts for the multi-threaded case whereas the single-thread, vectorized version demonstrates an 85% improvement for a vector register file length of 24 bytes, over the scalar case. We present VLSI macrocells of a vector accelerator implementing a subset of the MPEG-4 vector ISA and a 2-way, parametric, bus-based, cache coherent, SoC multi-processor.

Algorithmic and architectural enhancements for real-time MPEG-1 decoding on a general purpose RISC workstation

IEEE Transactions on Circuits and Systems for Video Technology, 1995

Traditional video decoders require use of specially designed video decompression processors. We present novel algorithmic and architectural enhancements that allowed for the first time the real-time decompression of MPEG-1 video and audio streams on a low-end, general purpose RISC processor. For video decompression, efficient algorithmic implementations were derived by examining the Huffman decoder, the inverse quantizer and the inverse DCT as a single system. For audio decompression, a new DCT based implementation of the subband filtering operation yields 30% speed improvement in the audio decoding process and 17% speed improvement in overall audio and video decoding. Besides algorithmic enhancements, a new set of "multimedia" instructions and minor changes in the design of a traditional RISC ALU allowed increased parallelism of pixelbased operations with minimal design and control overhead. Experimental results show that with the synergistic combination of algorithmic and architectural enhancements a multimediaenhanced RISC processor can achieve higher decoding rates than generic RISC and CISC processor^, even when these processors operate at higher clock rates and have larger instruction and data caches. I. I NTRODUCTI ON ULTIMEDIA APPLICATIONS combine many differ-M ent forms of information, including text, graphics, audio, and video. Video in particular has the potential to become just another data type. To the computer community, this usually implies that video will be digitally encoded so that it can be manipulated, stored, and transmitted along with other digital data types on standard computing platforms, storage devices and networks. Processing of raw video results in data rates that overwhelm the storage and interconnect capacities of today's networked computer systems. Compression algorithms can reduce this data rate to manageable levels. A strong demand by the computer industry's customer base for open and interoperable systems has contributed significantly to the adoption of MPEG as the compression standard for motionvideo compression. The first Moving Pictures Expert Group (MPEG-1) specification [I] defines a bit stream syntax for synchronized compressed video and audio with a maximum bit rate of Manuscript received January 24, 1995; revised May 23, 1995. This paper V. Bhaskaran and K. Konstantinides are with Hewlett-Packard Laboratories, R. B. Lee is with the Hewlett-Packard Computer Systems Organization, J. Beck is with DOME Imaging Systems, Waltham, MA 02154 USA. IEEE Log Number 9414346. was recommended by Guest Editor B. Ackland. Palo Alto, CA 94304 USA. Cupertino, CA 95014 USA. Dr. Lee has served as program chair of Hot Chips conference, is an editorial board member of IEEE Micro, IEEE Spectrum, and HP Journal. She is a member of the Phi Beta Kappa and Alpha Lambda Delta honoraries and ACM. John P. Beck has worked with Digital Equipment Corporation and Apollo on high-performance CPU's and workstations. In 1989, he joined HP's Apollo division where he worked on digital video conferencing and graphics accelerators. He is currently with DOME Imaging Systems.

Mapping and optimization of the AVS video decoder on a high performance chip multiprocessor

2010 IEEE International Conference on Multimedia and Expo, 2010

Modern multimedia workloads provide increased levels of quality and compression efficiency at the expense of substantially increased computational complexity. It is important to leverage the off-the-shelf emerging multi-core processor architectures and exploit all levels of parallelism of such workloads in order to achieve real time functionality at a reasonable cost. This paper presents the implementation, optimization and characterization of the AVS video decoder on Intel Core i7, a quad-core, hyper-threaded, chip multiprocessor (CMP). AVS (Audio Video Standard), a new compression standard from China, is competing with H.264 to potentially replace MPEG-2, mainly in the Chinese market. We show that it is necessary to perform a series of software optimizations and exploit parallelism at different levels in order to achieve FullHD real time functionality. The input dependent variability of execution time per work chunk is addressed using dynamic scheduling to allocate work to each thread. Moreover, we evaluate the interaction of the application with the i7 CMP architecture using both high-and low-level performance metrics. Finally, we evaluate a new feature of Intel's i7 micro-architecture called Turbo Boost, which dynamically varies the frequencies of non-idling cores to optimize performance.

Thread-Parallel MPEG-2 and MPEG-4 Encoders for Shared-Memory System-On-Chip Multiprocessors

International Journal of Computers and Applications, 2007

This work focuses on speeding up MPEG-2 and MPEG-4 encoding by using thread-parallelism for sharedmemory, System-On-Chip multiprocessors. Improving the performance of the MPEG encoders is shown by reducing the dynamic instruction count at multiple processor contexts and then mapping onto a configurable SoC multiprocessor. The resulting reduction in the dynamic instruction count of the parallelized MPEG-2 TM5 encoder for 32 processor contexts reaches a maximum of 95% and that of the MPEG-4 XViD a maximum of 83% for 16 processor contexts, both compared to the sequential encoder. To realize the parallelized encoders we present a configurable, N-way, extensible, bus-based, cache-coherent SoC multiprocessor, augmented with data-parallel coprocessors, and we give the VLSI implementation for the 2-way and 4-way configurations.

High level H.264/AVC video encoder parallelization for multiprocessor implementation

2009

H.264/AVC (Advanced Video Codec) is a new video coding standard developed by a joint effort of the ITU-TVCEG and ISO/IEC MPEG. This standard provides higher coding efficiency relative to former standards at the expense of higher computational requirements. Implementing the H.264 video encoder for an embedded System-on-Chip (SoC) is a big challenge. For an efficient implementation, we motivate the use of multiprocessor platforms for the execution of a parallel model of the encoder. In this paper, we propose a high-level independent target-architecture parallelization methodology for the development of an optimized parallel model of a H.264/AVC encoder (i.e. a processes network model balanced in communication and computation workload).