Pramod Kumar Meher - Academia.edu (original) (raw)

Papers by Pramod Kumar Meher

2017 IEEE International Symposium on Nanoelectronic and Information Systems (iNIS), 2017

In this work we present a low power multi-channel finite impulse response (FIR) filter using look... more In this work we present a low power multi-channel finite impulse response (FIR) filter using look-up table (LUT) approach. We have been able to reduce the LUT size by a factor of 8 over the conventional LUTs using Booth recoding, odd multiple storage and asymmetric product techniques. The proposed design provides 22%, 30% and 33% less duration of cycle period and 14%, 32% and 35% less energy per sample over the recently proposed design for multi-channel filter for tap size 4, 8 and 16 respectively. This design would therefore be highly useful for the implementation of multi-channel filters for certain applications like software defined radio systems.

2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015

Multiple constant multiplication (MCM) is widely used in several digital signal processing applic... more Multiple constant multiplication (MCM) is widely used in several digital signal processing applications. Recently, pipelining techniques have been applied to accelerate the computation of the MCM blocks. The existing pipelining techniques consider the adder stage pipelining, i.e., inserting registers between two adjacent adder stages, to reduce the adder depth. However, the critical path may be still long, even though the adder depth is minimized. In this paper, the adder stage pipelining method is analyzed at bit-level and a novel pipelining method is proposed for pipelining the adders in the MCM block. Experimental results show that the proposed pipelining method provides nearly 32% reduction of critical path over the traditional adder stage pipelining in average for several benchmark MCM blocks, while the area and power consumption are maintained.

$Research paper thumbnail of Low-Latency, Low-Area, and Scalable Systolic-Like Modular Multipliers for <math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>G</mi><mi>F</mi><mo stretchy="false">(</mo><msup><mn>2</mn><mi>m</mi></msup><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">GF(2^{m})</annotation></semantics></math>GF(2m) Based on Irreducible All-One Polynomials$

IEEE Transactions on Circuits and Systems I: Regular Papers, 2017

In this paper, an efficient recursive formulation is suggested for systolic implementation of can... more In this paper, an efficient recursive formulation is suggested for systolic implementation of canonical basis finite field multiplication over G F(2 m) based on irreducible AOP. We have derived a recursive algorithm for the multiplication, and used that to design a regular and localized bit-level dependence graph (DG) for systolic computation. The bit-level regular DG is converted into a fine-grained DG by node-splitting, and mapped that into a parallel systolic architecture. Unlike most of the existing structures, it does not involve any global communications for modular reduction. The proposed bit-parallel systolic structure has the same cycle time as that of the best existing bit-parallel systolic structure [1], but involves significantly less number of registers. The proposed bit-parallel design has a scalable latency of l + log 2 s + 1 cycles which is considerably low compared with those of existing systolic designs. Moreover, the proposed time-multiplexed structure is designed specifically for scalability of throughput and hardware-complexity to meet the area-time trade-off in resource-constrained applications while maintaining or reducing the overall latency. The ASIC synthesis report shows that the proposed bit-parallel structures offers nearly 30% saving of area and nearly 38% saving of power consumption over the best of the existing AOP-based systolic finite field multiplier.

2019 IEEE International Symposium on Circuits and Systems (ISCAS)

Low-complexity systolic multipliers for GF(2m) are required in several high-performance cryptogra... more Low-complexity systolic multipliers for GF(2m) are required in several high-performance cryptographic systems. In this paper, we propose a novel design strategy to derive efficient systolic multiplier for GF(2m) based on Toeplitz Matrix-Vector Product (TMVP) approach. The proposed work is carried out through two coherent interdependent stages. (i) A novel multiplication algorithm based on TMVP method to obtain subquadratic space complexity is proposed first. (ii) The proposed algorithm is then mapped unto to a novel and efficient architecture which is optimized further to derive a low-complexity systolic structure. The complexity analysis and comparison show that the proposed design outperforms the existing work. The proposed design can thus be used in many practical cryptosystems.

The optimization of shift‐and‐add network for constant multiplications is found to have great pot... more The optimization of shift‐and‐add network for constant multiplications is found to have great potential for reducing the area, delay, and power consumption of implementation of multiplications in s ...

2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015

ABSTRACT Toeplitz matrix-vector product (TMVP) approach is a special case of Karatsuba algorithm ... more ABSTRACT Toeplitz matrix-vector product (TMVP) approach is a special case of Karatsuba algorithm to design subquadratic multiplier in GF(2m). In binary extension fields, shifted polynomial basis (SPB) is a variable basis representation, and is widely studied. SPB multiplication using coordinate transformation technique can transform TMVP formulas, however, this approach is only applied for the field constructed by all trinomials or special class of pentanomials. For this reason, we present a new modified SPB multiplication for an arbitrary irreducible pentanomial, and the proposed multiplication scheme has formed a TMVP formula.

IEEE Transactions on Multi-Scale Computing Systems, 2018

Digit-serial systolic multipliers over GF(2m)GF(2^m)GF(2m) based on the National Institute of Standards and... more Digit-serial systolic multipliers over GF(2m)GF(2^m)GF(2m) based on the National Institute of Standards and Technology (NIST) recommended trinomials play a critical role in the real-time operations of cryptosystems. Systolic multipliers over GF(2m)GF(2^m)GF(2m) involve a large number of registers of size O(m2)O(m^2)O(m2) which results in significant increase in area complexity. In this paper, we propose a novel low register-complexity digit-serial trinomial-based finite field multiplier. The proposed architecture is derived through two novel coherent interdependent stages: (i) derivation of an efficient hardware-oriented algorithm based on a novel input-operand feeding scheme and (ii) appropriate design of novel low register-complexity systolic structure based on the proposed algorithm. The extension of the proposed design to Karatsuba algorithm (KA)-based structure is also presented. The proposed design is synthesized for FPGA implementation and it is shown that it (the design based on regular multiplication process) could achieve more than 12.1 percent saving in area-delay product and nearly 2.8 percent saving in power-delay product. To the best of the authors’ knowledge, the register-complexity of proposed structure is so far the least among the competing designs for trinomial based systolic multipliers (for the same type of multiplication algorithm).

JSTS:Journal of Semiconductor Technology and Science, 2016

Reconfigurable finite impulse response (FIR) filters whose filter coefficients and filter order c... more Reconfigurable finite impulse response (FIR) filters whose filter coefficients and filter order change dynamically during run-time play an important role in the software defined radio (SDR) systems, multi-channel filters, and digital up/down converters. However, there are not many reports on such reconfigurable designs which can support dynamic variation of filter order and filter coefficients. The purpose of this paper is to provide an architectural solution for the FIR filters to support run-time variation of the filter order and filter coefficients. First, two straightforward designs, namely, (i) single-MAC based design and (ii) fullparallel design are presented. For large variation of the filter order, two designs based on (iii) folded structure and (iv) fast FIR algorithm are presented. Finally, we propose (v) high throughput design which provides significant advantage in terms of hardware and/or time complexities over the other designs. We compare complexities of all the five structures, and provide the synthesis results for verification.

Abstract—In this paper, we present the design optimization of one- and two-dimensional fully-pipe... more Abstract—In this paper, we present the design optimization of one- and two-dimensional fully-pipelined computing structures for area-delay-power-efficient implementation of finite impulse response (FIR) filter by systolic decomposition of distributed arithmetic (DA)-based inner-product computation. The systolic decomposition scheme is found to offer a flexible choice of the address length of the look-up-tables (LUT) for DA-based computation to decide on suitable area-time trade-off. It is observed that by using smaller address-lengths for DA-based computing units, it is possible to reduce the memory-size but on the other hand that leads to increase of adder complexity and the latency. For efficient DA-based realization of FIR filters of different orders, the flexible linear systolic design is implemented on a Xilinx Virtex-E XCV2000E FPGA using a hybrid combi-nation of Handel-C and parameterizable VHDL cores. Various key performance metrics such as number of slices, maximum usable f...

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2020

We present a new image integration technique for a flash and long-exposure image pair to capture ... more We present a new image integration technique for a flash and long-exposure image pair to capture a dark scene without incurring blurring or noisy artifacts. Most existing methods require well-aligned images for the integration, which is often a burdensome restriction in practical use. We address this issue by locally transferring the colors of the flash images using a small fraction of the corresponding pixels in the longexposure images. We formulate the image integration as a convex optimization problem with the local linear model. The proposed method makes it possible to integrate the color of the longexposure image with the detail of the flash image without causing any harmful effects to its contrast, where we do not need perfect alignment between the images by virtue of our new integration principle. We show that our method successfully outperforms the state of the art in the image integration and reference-based color transfer for challenging misaligned data sets.

Journal of Biomedical Science and Engineering, 2011

DNA electrophoresis gel is an important biologically experimental technique and DNA sequencing ca... more DNA electrophoresis gel is an important biologically experimental technique and DNA sequencing can be defined by it. Traditionally, it is time consuming for biologists to exam the gel images by their eyes and often has human errors during the process. Therefore, automatic analysis of the gel image could provide more information that is usually ignored by human expert. However, basic tasks such as the identification of lanes in a gel image, easily done by human experts, emerge as problems that may be difficult to be executed automatically. In this paper, we design an automatic procedure to analyze DNA gel images using various image processing algorithms. Firstly, we employ an enhanced fuzzy c-means algorithm to extract the useful information from DNA gel images and exclude the undesired background. Then, Gaussian function is utilized to estimate the location of each lane of A, T, C, and G on the gels images automatically. Finally, the location of each band on the gel image can be detected accurately by tracing lanes, renewing lost bands, and eliminating repetitive bands.

IEE Proceedings - Circuits, Devices and Systems, 1996

Two different linear systolic arrays have been suggested for the computation of discrcle cosine t... more Two different linear systolic arrays have been suggested for the computation of discrcle cosine transform (DCT). The proposed linear arrays are compkmentary to each other in the sense that the o u q u t of the linear arrays of one type may be fed a s the input for the linear arrays of the other type. This feature of the proposed linear arrays has been utilised for designing a bilayer structure For computing thc prime-factor DCT. it is interesting to note that the proposed structure does noi. require any hardwarehime for transposition of the intermediate results. The desired transposit ion is achieved by orthogonal alignment of the linear arrays of the upper laycr with respect to .:hose of the lower layer. The proposed structures provide high throughput of computation due to fully pipelined processing, and massive parallelism employed in the bilayer architecturc.

IEEE Transactions on Biomedical Circuits and Systems, 2017

Integration, the VLSI Journal, 2016

A pathway from one vertex of a quiver to another is a reduced path. We modify the classical defin... more A pathway from one vertex of a quiver to another is a reduced path. We modify the classical definition of quiver representations and we prove that semi-invariant polynomials for filtered quiver representations come from diagonal entries if and only if the quiver has at most two pathways between any two vertices. Such class of quivers includes finite ADE-Dynkin quivers, affine A D E-Dynkin quivers, star-shaped and comet-shaped quivers. Next, we explicitly write all semi-invariant generators for filtered quiver representations for framed quivers with at most two pathways between any two vertices; this result may be used to study constructions analogous to Nakajima's affine quotient and quiver varieties, which are, in special cases, M F • 0 (n, 1) := µ −1 B (0)/ /B and M F • (n, 1) := µ −1 B (0) s /B, respectively, where µ B : T * (b × C n) → b * ∼ = gl * n /u, B is the set of invertible upper triangular n × n complex matrices, b = Lie(B), and u ⊆ b is the biggest unipotent subalgebra.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2015

International Journal of Computers and Applications, 2013

ABSTRACT

2019 IEEE International Symposium on Circuits and Systems (ISCAS), 2019

Distributed arithmetic (DA) based architectures are popularly used for inner-product computation ... more Distributed arithmetic (DA) based architectures are popularly used for inner-product computation in various applications. Existing literature shows that the use of approximate DA-architectures in error resilient applications provides a significant improvement in the overall efficiency of the system. Based on precise error analysis, we find that the existing methods introduce large truncation error in the computation of the final inner-product. Therefore, to have a suitable trade-off between the overall hardware complexity and truncation error, a weight-dependent truncation approach is proposed in this paper. The overall efficiency of the structure is further enhanced by incorporating an input truncation strategy in the proposed method. It is observed that the area, time and energy efficiency of the proposed designs are superior to the existing designs with significantly lower truncation error. Evaluation in the case of noisy image smoothing application is also shown in this paper.

VLSI Architectures for Future Video Coding, 2019

In spite of the recent advances in telecommunication standards, communication networks still have... more In spite of the recent advances in telecommunication standards, communication networks still have limited bandwidths and storage capacity. Therefore, video compression has drawn increasing importance since high-resolution video contents have become more and more used in various fields. These requirements raise the need for high-performance video-compression technologies able to reduce the amount of data to be transmitted or stored by compressing the input video signal into a bitstream file. Improving the coding efficiency was always one of the crucial issues of various compression standards that aim to get the most compact representation of the reconstructed video, with a high subjective quality. The high-efficiency video coding (HEVC) comes to respond to these requirements. However, the increased consumption of high-quality multimedia content has pushed the international communication companies to put much effort to better enhance video-coding techniques. In this perspective, an upcoming video-coding standard to be known as versatile video coding (VVC) has emerged aiming to improve the coding efficiency of the current HEVC codec. Improvements on rate distortion (RD) performance that came with both HEVC and VVC have brought an increased complexity in the majority of the coding modules, which makes it difficult to implement on hardware systems with real-time encoding. This chapter focuses on the transform coding stage as one of the most computationally demanding modules. For HEVC, efficient approximation algorithms in addition to reconfigurable and scalable architecture have been developed in order to decrease the computational complexity of the transform module. The main objectives are to meet low power and real-time processing constraints while maintaining a compression gain and a satisfying video quality. Field programmable gate array (FPGA) implementation results and comparisons with existing works confirm the efficiency of the proposed approximations since they contribute in reducing time and power consumption, optimizing the hardware resources and bringing peak signal-to noise ratio (PSNR) improvement as well. Similarly, for the adaptive multiple transform (AMT) introduced in the transform module of the VVC, approximations were done for discrete cosine transform (DCT)-II and discrete sine transform (DST)-VII transforms since they are statistically the most used ones among the five predefined types. Bitrate (BR) reduction with a slightly degradation of video quality and a less use of hardware resources are the main contributions of the proposed approximations.

2017 IEEE International Symposium on Nanoelectronic and Information Systems (iNIS), 2017

2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015

IEEE Transactions on Circuits and Systems I: Regular Papers, 2017

2019 IEEE International Symposium on Circuits and Systems (ISCAS)

2015 IEEE International Symposium on Circuits and Systems (ISCAS), 2015

IEEE Transactions on Multi-Scale Computing Systems, 2018

JSTS:Journal of Semiconductor Technology and Science, 2016

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2020

Journal of Biomedical Science and Engineering, 2011

IEE Proceedings - Circuits, Devices and Systems, 1996

IEEE Transactions on Biomedical Circuits and Systems, 2017

Integration, the VLSI Journal, 2016

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2015

International Journal of Computers and Applications, 2013

ABSTRACT

2019 IEEE International Symposium on Circuits and Systems (ISCAS), 2019

VLSI Architectures for Future Video Coding, 2019