Kamakoti Veezhinathan - Profile on Academia.edu (original) (raw)

Papers by Kamakoti Veezhinathan

FadingBF: A Bloom Filter With Consistent Guarantees for Online Applications

IEEE Transactions on Computers, 2020

Bloom filter (BF), when used by an online application, experiences monotonically increasing false... more Bloom filter (BF), when used by an online application, experiences monotonically increasing false-positive errors. The decay of stale elements can control false-positives. Existing mechanisms for decay require unreasonable storage and computation. Inexpensive methods reset the BF periodically, resulting in inconsistent guarantees and performance issues in the underlying computing system. In this article, we propose Fading Bloom filter (FadingBF), which can provide inexpensive yet safe decay of elements. FadingBF neither requires additional storage nor computation to achieve this but instead exploits the underlying storage medium’s intrinsic properties, i.e., DRAM capacitor characteristics. We realize FadingBF by implementing the BF on a DRAM memory module with its periodic refresh disabled. Consequently, the capacitors holding the data elements that are not accessed frequently will predictably lose charge and naturally decay. The retention time of capacitors guarantees against premature deletion. However, some capacitors may store information longer than required due to the FadingBF’s software and hardware variables. Using an analytical model of the FadingBF, we show that carefully tuning its parameters can minimize such cases. For a surveillance application, we demonstrate that FadingBF achieves better guarantees through graceful decay, consumes 57 percent lesser energy, and has a system load that is lesser than the standard BF.

Proceeding of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays - FPGA '04, 2004

Spring 2007 Coop at AMD. Worked under Pat Conway in the Northbridge performance team. Developed a... more Spring 2007 Coop at AMD. Worked under Pat Conway in the Northbridge performance team. Developed and implemented stochastic processor models for current and upcoming AMD processors.

Journal of Low Power Electronics, 2006

Power consumption and delay are two of the most important constraints in current-day on-chip bus ... more Power consumption and delay are two of the most important constraints in current-day on-chip bus design. The two major sources of dynamic power dissipation on a bus are the self capacitance and the coupling capacitance. As technology scales, the interconnect resistance increases due to shrinking wire-width. At the same time, spacing between the interconnects decreases resulting in an increase in the coupling capacitance. This, in turn, leads to stronger crosstalk effects between the interconnects. In Deep Sub-Micron technology the coupling capacitance exceeds the self capacitance, which, in turn, cause more power consumption and delay on the bus. Recently, the interest has also shifted to minimizing peak power dissipation. The reason being that higher peak power leads to an undesired increase in switching noise, metal electromigration problems and operationinduced variations due to non-uniform temperature on the die. Thus, minimizing power consumption and delay are the most important design objectives for on-chip buses. Several bus encoding schemes have been proposed in the literature for reducing crosstalk. Most of these encoding techniques use spatial redundancy that requires additional transmission wires on the bus. In this paper, a new temporal encoding scheme is proposed, which uses self-shielding memory-less codes to completely eliminate worst-case crosstalk effects and hence significantly minimizes power consumption and delay of the bus. A major advantage of the proposed temporal redundancy based encoding scheme is the reduction in the number of wires of the on-chip bus. This reduction facilitates extra spacing between the bus wires, when compared with the normal bus, for a given area. This, in turn, leads to reduced crosstalk effects between the wires. The proposed encoding scheme is tested with the SPEC2000 CINT benchmarks. The experimental results, when compared to the transmission over a normal bus, show that on an average the proposed technique leads to a reduction in the peak-power consumption by 51% (28%), 51% (29%) and 52% (30%) in the data (address) bus for 90nm, 65nm and 45nm technologies, respectively. For a bus length of 10mm the proposed technique also achieves 17%, 31% and 37% reduction in the bus delay for 90nm, 65nm and 45nm technologies, respectively, when compared to what is incurred by the data transmission on a normal bus.

Lecture Notes in Computer Science, 1995

Emaih rangan~iitm.ernet.in A b s t r a c t. We present a new data structure, the Lea]ary tree, fo... more Emaih rangan~iitm.ernet.in A b s t r a c t. We present a new data structure, the Lea]ary tree, for designing an efficient randomized algorithm for the Closest Pair Problem. Using this data structure, we show that the Closest Pair of n points in D-dimensional space, where, D ~ 2, is a fixed constant, can be found in O(n log n/log log n) expected time. The algorithm does not employ hashing. K e y w o r d s : Closest pair, Computational Geometry, Randomized Algorithms. 1 I n t r o d u c t i o n The Closest Pair Problem (CPP) is to find a closest pair in a given set of n points. It is well known that this problem requires ~(n log n) time in the algebraic computation tree model [6] and optimM algorithms already exist. However, if the model of computation is changed then Y2(n log n) is no longer a lower bound. We summarize the major results and the corresponding models.

Journal of Electrical and Computer Engineering, 2012

Buffers in on-chip networks constitute a significant proportion of the power consumption and area... more Buffers in on-chip networks constitute a significant proportion of the power consumption and area of the interconnect, and hence reducing them is an important problem. Application-specific designs have nonuniform network utilization, thereby requiring a buffer-sizing approach that tackles the nonuniformity. Also, congestion effects that occur during network operation need to be captured when sizing the buffers. Many NoCs are designed to operate in multiple voltage/frequency islands, with interisland communication taking place through frequency converters. To this end, we propose a two-phase algorithm to size the switch buffers in network-on-chips (NoCs) considering support for multiple-frequency islands. Our algorithm considers both the static and dynamic effects when sizing buffers. We analyze the impact of placing frequency converters (FCs) on a link, as well as pack and send units that effectively utilize network bandwidth. Experiments on many realistic system-on-Chip (SoC) bench...

Proceedings of the 16th ACM Great Lakes symposium on VLSI, 2006

Proceedings 1997 International Conference on Parallel and Distributed Systems

The paper presents eficient scalable algorithms for performing Prefix (PC) and General Prefix (GP... more The paper presents eficient scalable algorithms for performing Prefix (PC) and General Prefix (GPC) Computations on a Distributed Shared Memory (D S M) system with applications.

IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings.

Designing chips for lower power applications is one of the most important challenges faced by the... more Designing chips for lower power applications is one of the most important challenges faced by the VLSI designers. Since the power consumed by I/O pins of a CPU is a significant source of power consumption, work has been done on developing encoding schemes for reducing switching activity on external buses. In this paper, we propose a new coding technique, namely, the Dynamic Coding Scheme, for low-power data bus. Our method considers two logical groupings of the bus lines, each being a permutation of the bus lines, and dynamically selects that grouping which yields the minimum number of transitions.

Design, Automation and Test in Europe

With growing computational needs of many real-world applications, frequently changing specificati... more With growing computational needs of many real-world applications, frequently changing specifications of standards, and the high design and NRE costs of ASICs, an algorithm-agile FPGA based co-processor has become a viable alternative. In this article, we report about the general design of an algorith-agile co-processor and the proof-ofconcept implementation.

This paper proposes a novel hybrid pattern based branch predictor (H-Pattern) which uses a dynami... more This paper proposes a novel hybrid pattern based branch predictor (H-Pattern) which uses a dynamic learning approach to find patterns in the execution of conditional branches. H-Pattern is comprised of two branch predictors-our proposed nBPAT (N Bit pattern) predictor and an alternate predictor (henceforth referred to as AltPred) that can be any other predictor such as GShare, TAGE or ISL-TAGE. The local nBPAT predictor aims to capture patterns in branch behavior. If the pattern predictor is in its learning phase, the AltPred predictor is used. A performance based selection is carried out between the nBPAT predictor and AltPred, when both are available. On implementing H-Pattern with GShare on the CBP simulator with all 40 traces, we achieved 3.8, 4.7 and 6.4 mispredictions per kilo instructions (MPKI) for the unlimited, 32KB and 4KB storage budgets respectively. On implementing H-Pattern with TAGE, we achieved 2.134, 2.644 and 3.712 mispredictions per kilo instructions (MPKI) and with ISL-TAGE) we achieved 2.058, 2.542 and 3.691 mispredictions per kilo instructions (MPKI) for the unlimited, 32KB and 4KB storage budgets respectively.

Among the various network protocols that can be used to stream the video data, RTP over UDP is th... more Among the various network protocols that can be used to stream the video data, RTP over UDP is the best to do with real time streaming in H.264 based video streams. Videos transmitted over a communication channel are highly prone to errors; it can become critical when UDP is used. In such cases real time error concealment becomes an important aspect. A subclass of the error concealment is the motion vector recovery which is used to conceal errors at the decoder side. Lagrange Interpolation is the fastest and a popular technique for the motion vector recovery. This paper proposes a new system architecture which enables the RTP-UDP based real time video streaming as well as the Lagrange interpolation based real time motion vector recovery in H.264 coded video streams. A completely open source H.264 video codec called FFmpeg is chosen to implement the proposed system. Proposed implementation was tested against the different standard benchmark video sequences and the quality of the recovered videos was measured at the decoder side using various quality measurement metrics. Experimental results show that the real time motion vector recovery does not introduce any noticeable difference or latency during display of the recovered video.

2007 Design, Automation & Test in Europe Conference & Exhibition, 2007

With increasing use of low cost wire-bond packages for mobile devices, excessive dynamic IR-drop ... more With increasing use of low cost wire-bond packages for mobile devices, excessive dynamic IR-drop may cause tests to fail on the tester. Identifying and debugging such scan test failures is a very complex and effort-intensive process. A better solution is to generate correct-by-construction "power-safe" patterns. Moreover, with glitch power contributing to a significant component of dynamic power, pattern generation needs to be timing-aware to minimize glitching. In this paper, we propose a timing-based, power and layout-aware pattern generation technique that minimizes both global and localized switching activity. Techniques are also proposed for power-profiling and optimizing an initial pattern set to obtain a power-safe pattern set, with the addition of minimal patterns. The proposed technique also comprehends irregular power grid topologies for constraints on localized switching activity. Experiments on ISCAS benchmark circuits reveal the effectiveness of the proposed scheme.

Proceedings Design, Automation and Test in Europe Conference and Exhibition

Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonabl... more Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonably big configurable Embedded Memory Blocks (EMB) to cater to the on-chip memory requirements of systems/applications mapped on them. While mapping applications on to such FPGAs, some of the EMBs may be left unused. This paper presents a methodology to utilize such unused EMBs as large look-up tables to map multi-output combinational sub-circuits of the application, which, otherwise would be mapped on to a number of small Look-Up Tables (LUT) available on the FPGA. This inturn leads to a huge reduction in the area of the FPGA, utilized for mapping an application. Experimental results show that our proposed methodology, when employed on popular benchmark circuits, can lead to additional 50% reduction in area utilized when compared with other methodologies reported in the literature.

18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design

The primary advantage of using 3D-FPGA over 2D-FPGA is that the vertical stacking of active layer... more The primary advantage of using 3D-FPGA over 2D-FPGA is that the vertical stacking of active layers reduce the Manhattan distance between the components in 3D-FPGA than when placed on 2D-FPGA. This results in a considerable reduction in total interconnect length. Reduced wire length eventually leads to reduction in delay and hence improved performance and speed. Design of an efficient placement and routing algorithm for 3D-FPGA that fully exploits the above mentioned advantage is a problem of deep research and commercial interest. In this paper, an efficient placement and routing algorithm is proposed for 3D-FPGAs which yields better results in terms of total interconnect length and channel-width. The proposed algorithm employs two important techniques, namely, Reinforcement Learning (RL) and Support Vector Machines (SVMs), to perform the placement. The proposed algorithm is implemented and tested on standard benchmark circuits and the results obtained are encouraging. This is one of the very few instances where reinforcement learning is used for solving a problem in the area of VLSI.

18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004

Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonabl... more Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonably big configurable Embedded Memory Blocks (EMB) to cater to the on-chip memory requirements of systems/applications mapped on them. While mapping applications on to such FPGAs, some of the EMBs may be left unused. This paper presents a methodology to utilize such unused EMBs as large look-up tables to map multi-output combinational sub-circuits of the application, with depth minimization as the main objective along with area minimization in terms of the number of LUTs used. Depth minimization is an important goal while mapping performance driven circuits. Experimental results show that our proposed methodology, when employed on popular benchmark circuits, leads to upto 14% reduction in depth compared with the DAG-Map, along with comparable reduction in area.

2006 49th IEEE International Midwest Symposium on Circuits and Systems, 2006

Reversible logic is gaining interest in the recent past due to its less heat dissipating characte... more Reversible logic is gaining interest in the recent past due to its less heat dissipating characteristics. It has been proved that any Boolean function can be implemented using reversible gates. In this paper we propose a set of basic sequential elements that could be used for building large reversible sequential circuits leading to logic and garbage reduction by a factor of 2 to 6 when compared to existing reversible designs reported in the literature.

25th IEEE VLSI Test Symmposium (VTS'07), 2007

The problem of peak power estimation in CMOS circuits is essential for analyzing the reliability ... more The problem of peak power estimation in CMOS circuits is essential for analyzing the reliability and performance of circuits at extreme conditions. The dynamic power dissipated is directly proportional to the switching activity (number of gate outputs that toggles (changes state)) in the circuit. The Power Virus problem involves finding input vectors that cause maximum dynamic power dissipation (maximum toggles) in circuits. As the power virus problem is NP-complete the gate-level techniques are less scalable with increasing design size and produce less optimal vectors. In this paper, an approach for power virus generation using behavioral models of digital circuits is presented. The proposed technique converts the given behavioral model automatically to an integer (word-level) constraint model and employs an integer constraint solver to generate the required power virus vectors. Experimenting the proposed technique on ISCAS behavioral level benchmark circuits and the standard DLX processor model show that the above technique is fast and yields higher-quality results than the known gate-level techniques. Interestingly, the paper attempts to generate an assembly program that cause the maximum dynamic power dissipation on the given DLX processor model. To the best of our knowledge the proposed technique is the first reported that considers power virus generation using behavioral level models.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008

IEEE Transactions on Broadcasting, 2010

The H.264 encoded video is highly sensitive to loss of motion vectors during transmission. Severa... more The H.264 encoded video is highly sensitive to loss of motion vectors during transmission. Several statistical techniques are proposed for recovering such lost motion vectors. These use only the motion vectors that belong to the macroblocks that are horizontally or vertically adjacent to the lost macroblock, to recover the latter. Intuitively this is one of the main reasons behind why these techniques yield inferior solutions in scenarios where there is a non-linear motion. This paper proposes B-Spline based statistical techniques that comprehensively address the motion vector recovery problem in the presence of different types of motions that include slow, fast/sudden, continuous and non-linear movements. Testing the proposed algorithms with different benchmark video sequences shows an average improvement of up to 2 dB in the Peak Signal to Noise Ratio of some of the recovered videos, over existing techniques. A 2 dB improvement in PSNR is very significant from an application point of view.

Applied Soft Computing, 2011

This work presents a hardware implementation of an FIR Filter that is self-adaptive; that respond... more This work presents a hardware implementation of an FIR Filter that is self-adaptive; that responds to arbitrary frequency response landscapes; that has built-in coefficient error tolerance capabilities; and that has a minimal adaptation latency. This hardware design is based on a heuristic genetic algorithm. Experimental results show that the proposed design is more efficient than non-evolutionary designs even for arbitrary response filters. As a byproduct, the paper also presents a novel flow for the complete hardware design of what is termed as an Evolutionary System on Chip (ESoC). With the inclusion of an evolutionary process, the ESoC is a new paradigm in modern System-on-Chip (SoC) designs. The ESoC methodology could be a very useful structured FPGA/ASIC implementation alternative in many practical applications of FIR Filters.

FadingBF: A Bloom Filter With Consistent Guarantees for Online Applications

IEEE Transactions on Computers, 2020

Proceeding of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays - FPGA '04, 2004

Journal of Low Power Electronics, 2006

Lecture Notes in Computer Science, 1995

Journal of Electrical and Computer Engineering, 2012

Proceedings of the 16th ACM Great Lakes symposium on VLSI, 2006

Proceedings 1997 International Conference on Parallel and Distributed Systems

IEEE Computer Society Annual Symposium on VLSI, 2003. Proceedings.

Design, Automation and Test in Europe

2007 Design, Automation & Test in Europe Conference & Exhibition, 2007

Proceedings Design, Automation and Test in Europe Conference and Exhibition

18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design

18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., 2004

Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonabl... more Modern day Field Programmable Gate Arrays (FPGA) include in addition to Look-up Tables, reasonably big configurable Embedded Memory Blocks (EMB) to cater to the on-chip memory requirements of systems/applications mapped on them. While mapping applications on to such FPGAs, some of the EMBs may be left unused. This paper presents a methodology to utilize such unused EMBs as large look-up tables to map multi-output combinational sub-circuits of the application, with depth minimization as the main objective along with area minimization in terms of the number of LUTs used. Depth minimization is an important goal while mapping performance driven circuits. Experimental results show that our proposed methodology, when employed on popular benchmark circuits, leads to upto 14% reduction in depth compared with the DAG-Map, along with comparable reduction in area.

2006 49th IEEE International Midwest Symposium on Circuits and Systems, 2006

25th IEEE VLSI Test Symmposium (VTS'07), 2007

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008

IEEE Transactions on Broadcasting, 2010

Applied Soft Computing, 2011