High-performance implementation of wavelet algorithms on a standard PC (original) (raw)

Wavelet Transform for Large Scale Image Processing on Modern Microprocessors

Lecture Notes in Computer Science, 2003

In this paper we discuss several issues relevant to the vectorization of a 2-D Discrete Wavelet Transform on current microprocessors. Our research is based on previous studies about the efficient exploitation of the memory hierarchy, due to its tremendous impact on performance. We have extended this work with a more detailed analysis based on hardware performance counters and a study of vectorization, in particular, we have used the Intel Pentium SSE instruction set. Most of our optimizations are performed at source code level to allow automatic vectorization, though some compiler intrinsic functions have been introduced to enhance performance. Taking into account the abstraction at which the optimizations are performed, the results obtained on an Intel Pentium III microprocessor are quite satisfactory, even though further improvement can be obtained by a more extensive use of compiler intrinsics.

On the Design of Fast Wavelet Transform Algorithms With Low Memory Requirements

IEEE Transactions on Circuits and Systems for Video Technology, 2000

In this paper, a new algorithm to efficiently compute the two-dimensional wavelet transform is presented. This algorithm aims at low memory consumption and reduced complexity, meeting these requirements by means of line-by-line processing. In this proposal, we use recursion to automatically place the order in which the wavelet transform is computed. This way, we solve some synchronization problems that have not been tackled by previous proposals. Furthermore, unlike other similar proposals, our proposal can be straightforwardly implemented from the algorithm description. To this end, a general algorithm is given which is further detailed to allow its implementation with a simple filter bank or using the more efficient lifting scheme. We also include a new fast run-length encoder to be used along with the proposed wavelet transform for fast image compression and reduced memory consumption. When a 5-megapixel image is transformed, experimental results show that the proposed wavelet transform requires 200 times less memory and is five times faster than the regular one. If we consider the whole coding system, numerical results show that it achieves state-of-the-art performance with very low memory requirements and fast execution, becoming an interesting solution for resource-constrained devices such as mobile phones, digital cameras, and PDAs.

EFFICIENT IMPLEMENTATIONS OF WAVELET TRANSFORMS–A ROADMAP

2001

The two major implementation methods for the discrete, two-dimensional binary-tree wavelet decomposition are presented. They are proposed in the context of efficient coupling with coding algorithms of compression standards, namely JPEG-2000 and MPEG-4. When implemented in software or hardware systems, they are capable of producing in real-time the binary-tree decomposition of the entire input image with a higher sample-rate. This is achieved by dividing and localizing the processing into small blocks of data. These blocks can efficiently be handled by a cache hierarchy in a programmable processor or by a custom-hardware design.

Performance and Power Comparative Study of Discrete Wavelet Transform on Programmable Processors

2002

The Discrete Wavelet Transformations (DWT) are data intensive algorithms. Energy dissipation and execution time of such algorithms heavily depends on data memory hierarchy performance, when programmable platforms are considered. Existing filtering operations for the 1D-DWT, employ different levels of data accesses locality. However locality of data references, usually comes at the expense of complex control and addressing operations. In this paper, the two main scheduling techniques for the 1D-DWT are compared in terms of energy consumption and performance. Additionally, the effect of an in-place mapping scheme, which minimizes memory requirements and improves locality of data references for the 1D-DWT, is described and evaluated. As execution platform, two commercially available general purpose processors are used.

A Wavelet Toolbox for Large Scale Image Processing

Lecture Notes in Computer Science, 1999

The wavelet transform has proven to be a valuable tool for image processing applications, like image compression and noise reduction. In this paper we present a scheme to process very large images that do not fit in the memory of a single computer, based on the software library WAILI (Wavelets with Integer Lifting). Such images are divided into blocks that are processed quasi independently, allowing efficient parallel programming. The blocking is almost completely transparent to the user.

Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors

IEEE Transactions on Multimedia, 2000

The 2-D Discrete Wavelet Transform (DWT) consumes up to 68% of the JPEG2000 encoding time. In this paper, we develop efficient implementations of this important kernel on general-purpose processors (GPPs), in particular the Pentium 4 (P4). Efficient implementations of the 2-D DWT on the P4 must address three issues. First, the P4 suffers from a problem known as 64K aliasing, which can degrade performance by an order of magnitude. We propose two techniques to avoid 64K aliasing which improve performance by a factor of up to 4.20. Second, a straightforward implementation of vertical filtering incurs many cache misses. Cache performance can be improved by applying loop interchange, but there will still be many conflict misses if the filter length exceeds the cache associativity. Two methods are proposed to reduce the number of conflict misses which provide an additional performance improvement of up to 1.24. To show that these methods are general, results for the P3 and Opteron are also provided. Third, efficient implementations of the 2-D DWT must exploit the SIMD instructions supported by most GPPs, including the P4, and we present MMX and SSE implementations of horizontal and vertical filtering which provide a maximum speedup of 3.39 and 6.72, respectively.

Low Complexity Hardware Architectures for Wavelet Transforms: A Survey Low Complexity Hardware Architectures for Wavelet Transforms: A Survey

Presently, the major focus is on developing techniques to efficiently decrease hardware expenditure as well as hardware complications while realizing the requirements for a real-time system. The enhancement of Discrete Wavelet Transforms DWT's hardware modelling is still a relatively novel subject of research. Such areas comprise developing an effective hardware acceleration of the implementation of the DWT of the JPEG2000 standard, to construct a practical model and to deal with the computational and communication energy limitations of the image compression system. This paper emphasizes a comprehensive survey to develop necessary solutions to enhance the potential and capacity of DWT's computation-intensive nature algorithm implementation, particularly for low power image compression applications. The paper focuses on the major factors in order to lower the DWT principal energy consuming phase, given the energy consumption of the whole wavelet based image compression. These factors may possibly encompass some hardware-based features, such as basic coding features, low memory requirement, and low computational load. In combination with this research, other paper areas are also being investigated.

Low Complexity Hardware Architectures for Wavelet Transforms: A Survey

IOP Conference Series: Materials Science and Engineering, 2018

Presently, the major focus is on developing techniques to efficiently decrease hardware expenditure as well as hardware complications while realizing the requirements for a real-time system. The enhancement of Discrete Wavelet Transforms DWT's hardware modelling is still a relatively novel subject of research. Such areas comprise developing an effective hardware acceleration of the implementation of the DWT of the JPEG2000 standard, to construct a practical model and to deal with the computational and communication energy limitations of the image compression system. This paper emphasizes a comprehensive survey to develop necessary solutions to enhance the potential and capacity of DWT's computationintensive nature algorithm implementation, particularly for low power image compression applications. The paper focuses on the major factors in order to lower the DWT principal energy consuming phase, given the energy consumption of the whole wavelet based image compression. These factors may possibly encompass some hardware-based features, such as basic coding features, low memory requirement, and low computational load. In combination with this research, other paper areas are also being investigated.

Parallel implementation of the discrete wavelet transform on graphics processing units

2014 1st International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2014

Discrete wavelet transform (DWT), has diverse applications in signal and image processing fields. In this paper, we have implemented the lifting "Le Gall 5/3" algorithm on a low cost NVIDIA's GPU (Graphics processing unit) with MatLab to achieve speedup in computation. The efficiency of our GPU based implementation is measured and compared with CPU based algorithms. Our investigational results with GPU show performance enhancement over a factor of 1.52 compared with CPU for an image of size 3072x3072.