The Parallel Algorithm for the 2-D Discrete Wavelet Transform (original) (raw)

Accelerating Discrete Wavelet Transforms on Parallel Architectures

2017

The 2-D discrete wavelet transform (DWT) can be found in the heart of many image-processing algorithms. Until recently, several studies have compared the performance of such transform on various shared-memory parallel architectures, especially on graphics processing units (GPUs). All these studies, however, considered only separable calculation schemes. We show that corresponding separable parts can be merged into non-separable units, which halves the number of steps. In addition, we introduce an optional optimization approach leading to a reduction in the number of arithmetic operations. The discussed schemes were adapted on the OpenCL framework and pixel shaders, and then evaluated using GPUs of two biggest vendors. We demonstrate the performance of the proposed non-separable methods by comparison with existing separable schemes. The non-separable schemes outperform their separable counterparts on numerous setups, especially considering the pixel shaders.

Speeding-up the discrete wavelet transform computation with multicore and GPU-based algorithms

Parallel Computing, 2011

In this work we propose several parallel algorithms to compute the twodimensional discrete wavelet transform (2D-DWT), exploiting the available hardware resources. In particular, we will explore OpenMP optimized versions of 2D-DWT over a multicore platform and we will also develop CUDA-based 2D-DWT algorithms which are able to run on GPUs (Graphics Processing Unit). The proposed algorithms are based on several 2D-DWT computation approaches as (1) filter-bank convolution, (2) lifting transform and (3) matrix convolution, so we can determine which of them better adapts to our parallel versions. All proposed algorithms are based on the Daubechies 9/7 filter which is widely used in image/video compression.

Vectorization and Parallelization of 2-D Wavelet Lifting

With the start of the widespread use of discrete wavelet transform in image processing, the need for its efficient implementation is becoming increasingly more important. This work presents several novel SIMD-vectorized algorithms of 2-D discrete wavelet transform, using a lifting scheme. At the beginning, a stand-alone core of an already known single-loop approach is extracted. This core is further simplified by an appropriate reorganization of operations. Furthermore, the influence of the CPU cache on a 2-D processing order is examined. Finally, SIMD-vectorizations and parallelizations of the proposed approaches are evaluated. The best of the proposed algorithms scale almost linearly with the number of threads. For all of the platforms used in the tests, these algorithms are significantly faster than other known methods, as shown in the experimental sections of the paper.

A parallel implementation of the 2-D discrete wavelet transform without interprocessor communications

IEEE Transactions on Signal Processing, 1999

The discrete wavelet transform is currently attracting much interest among researchers and practitioners as a powerful tool for a wide variety of digital signal and imaging processing applications. This correspondence presents an efficient approach to compute the twodimensional (2-D) discrete wavelet transform in standard form on parallel general-purpose computers. This approach does not require transposition of intermediate results and avoids interprocessor communication. Since it is based on matrix-vector multiplication, our technique does not introduce any restriction on the size of the input data or on the transform parameters. Complete use of the available processor parallelism, modularity, and scalability are achieved. Theoretical and experimental evaluations and comparisons are given with respect to traditional parallelization.

Parallel Wavelet Transform for Large Scale Image Processing

2002

In this paper we discuss several issues relevant to the parallel implementation of a 2 -D Discrete Wavelet Transform (DWT) on general purpose multiprocessors. Our interest in this transform is motivated by its usage in an image fusion application which has to manage large image sizes, making parallel computing highly advisable. We have also paid much attention to memory hierarchy exploitation, since it has a tremendous impact on performance due to the lack of spatial locality when the DWT processes image columns.

Parallel Performance of Fast Wavelet Transforms

International Journal of High Speed Computing, 2000

We present a parallel 2D wavelet transform algorithm with modest communication requirements. Data are transmitted between nearest neighbors only and the amount is independent of the problem size as well as the number of processors. An analysis of the theoretical performance shows that the algorithm is scalable approaching perfect speedup as the problem size is increased. This performance is realized in practice on the IBM SP2 as well as on the Fujitsu VPP300 where it will form part of the Scientific Software Library.

Parallel Implementation of the 2D Discrete Wavelet Transform on Graphics Processing Units: Filter Bank versus Lifting

IEEE Transactions on Parallel and Distributed Systems, 2008

The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.

A scalable parallel 2D wavelet transform algorithm

1997

We present a new parallel 2D wavelet transform algorithm with minimal communication requirements. Data are transmitted between nearest neighbors only and the amount is independent of the problem size as well as the number of processors. An analysis of the theoretical performance shows that our algorithm is highly scalable approaching perfect speedup as the problem size is increased. This performance is realized in practice on the IBM SP2 as well as on the Fujitsu VPP300 where it will form part of the Scienti c Software Library.

2-D Wavelet Transform Enhancement on General- Purpose Microprocessors: Memory Hierarchy and SIMD Parallelism Exploitation

Lecture Notes in Computer Science, 2002

This paper addresses the implementation of a 2-D Discrete Wavelet Transform on general-purpose microprocessors, focusing on both memory hierarchy and SIMD parallelization issues. Both topics are somewhat related, since SIMD extensions are only useful if the memory hierarchy is efficiently exploited. In this work, locality has been significantly improved by means of a novel approach called pipelined computation, which complements previous techniques based on loop tiling and non-linear layouts. As experimental platforms we have employed a Pentium-III (P-III) and a Pentium-4 (P-4) microprocessor. However, our SIMD-oriented tuning has been exclusively performed at source code level. Basically, we have reordered some loops and introduced some modifications that allow automatic vectorization. Taking into account the abstraction level at which the optimizations are carried out, the speedups obtained on the investigated platforms are quite satisfactory, even though further improvement can be obtained by dropping the level of abstraction (compiler intrinsics or assembly code).

Parallel implementation of the discrete wavelet transform on graphics processing units

2014 1st International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), 2014

Discrete wavelet transform (DWT), has diverse applications in signal and image processing fields. In this paper, we have implemented the lifting "Le Gall 5/3" algorithm on a low cost NVIDIA's GPU (Graphics processing unit) with MatLab to achieve speedup in computation. The efficiency of our GPU based implementation is measured and compared with CPU based algorithms. Our investigational results with GPU show performance enhancement over a factor of 1.52 compared with CPU for an image of size 3072x3072.