Subhrajit Bhattacharya | IIT Kharagpur (original) (raw)

Uploads

Papers by Subhrajit Bhattacharya

Research paper thumbnail of Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016

We focus on the problem of detecting anomalous run-time behavior of distributed applications from... more We focus on the problem of detecting anomalous run-time behavior of distributed applications from their execution logs. Specifically we mine templates and template sequences from logs to form a control flow graph (cfg) spanning distributed components. This cfg represents the baseline healthy system state and is used to flag deviations from the expected behavior of runtime logs. The novelty in our work stems from the new techniques employed to: (1) overcome the instrumentation requirements or application specific assumptions made in prior log mining approaches, (2) improve the accuracy of mined templates and the cfg in the presence of long parameters and high amount of interleaving respectively, and (3) improve by orders of magnitude the scalability of the cfg mining process in terms of volume of log data that can be processed per day. We evaluate our template and cfg mining approaches using (a) synthetic log traces and (b) multiple real-world log datasets collected at different layers of application stack. Results demonstrate that the template mining, cfg mining, and our anomaly detection algorithms have high accuracy. The distributed implementation of our pipeline is highly scalable and has more than 500 GB/day of log data processing capability even on a 10 low-end VM based (Spark + Hadoop) cluster.

Research paper thumbnail of An RTL methodology to enable low overhead combinational testing

Proceedings European Design and Test Conference. ED & TC 97

Research paper thumbnail of Methods and arrangements for automatic synthesis of systems-on-chip

Research paper thumbnail of Power Gating Techniques Able to Have Data Retention And Variability Immunity Properties

Research paper thumbnail of Hardware synthesis and analysis of control-intensive designs from high level specifications

The goal of high level synthesis is to automate the synthesis of a VLSI circuit from its algorith... more The goal of high level synthesis is to automate the synthesis of a VLSI circuit from its algorithmic description. Successful high level synthesis methodologies exist for arithmetic-intensive designs. This dissertation targets issues in synthesis of control-intensive designs which may contain nested conditional statements and nested loops. Major tasks in high level synthesis include allocation, scheduling, assignment, and register transfer level (RTL) circuit generation. The objective of performing these tasks efficiently is to produce circuits optimized for cost functions such as delay, area and testability. The time required to execute the algorithmic specification by the synthesized circuit is a product of the number of clock cycles required to execute the specification and the clock period of the circuit. We propose techniques which target efficient scheduling of operations in loops to minimize the number of clock cycles required to execute the specification (algorithm LDS). To produce circuits with a small clock period, traditional techniques use more resources or faster resources. We demonstrate that even when the resource allocation is fixed, the clock period of the circuit can be minimized using assignment techniques (algorithm ClkMin). An RTL circuit is generated following the scheduling and assignment phases. The RTL designs produced may not be optimized for area, and may not be completely testable at the gate level. We propose a general methodology to generate RTL designs optimized for area which uses the hierarchy of the RTL design and the interaction between control and data path (algorithm WONDER). The optimized RTL circuits are completely testable under full-scan at the gate level. In general, there may be multiple solutions to the scheduling, assignment and RTL generation tasks. Hence fast and accurate estimation tools are essential to evaluate the quality of different possible implementations. Good techniques exist for area estimation. In this dissertation, we develop efficient techniques for estimating the number of clock cycles required to execute a specification and the clock period of the implementation. We apply Markov chain techniques for computing the expected number of clock cycles required by a schedule to execute the complete algorithmic specification for various possible inputs (algorithm PERSIS). The clock period estimation technique uses high level information about scheduling and assignment to compute the true delay as opposed to the topological delay of the implementation (algorithm FEST). The algorithms developed in this dissertation have been integrated in the SECONDS high level synthesis system. SECONDS has successfully produced circuit descriptions from control-intensive algorithmic specifications. Experiments with SECONDS demonstrate that high level synthesis can create competitive designs but reduces design time significantly.

Research paper thumbnail of Low cost testing method for register transfer level circuits

Research paper thumbnail of Modeling and simulating a powergated hierarchical element

Research paper thumbnail of System for Using Partitioned Masks to Build a Chip

Research paper thumbnail of Method for using partitioned masks to build a chip

Research paper thumbnail of Influence-based circuit design

Research paper thumbnail of ffects of Resource Sharing on Circuit Delay: An Assignment Algorithm for Clock Period Optimization

This paper analyzes the e ect of resource sharing and assignment on the clock period of the synth... more This paper analyzes the e ect of resource sharing and assignment on the clock period of the synthesized circuit. The assignment phase assigns or binds operations of the scheduled behavioral description to a set of allocated resources. We focus on control-ow intensive descriptions, characterized by the presence of mutually exclusive paths due to the presence of nested conditional branches and loops. We show that clustering of multiple operations in the same state of the schedule, possibly leading to chaining of functional units (FUs) in the RTL circuit, is an e ective way to minimize the total number of clock cycles and hence the total execution time. We present an assignment algorithm which is particularly e ective for such design styles, by minimizing the data chaining and hence the clock period of the circuit, thereby leading to further reduction in the total execution time. Existing resource sharing and assignment approaches for reducing the clock period of the resulting circuit either increase the resource allocation or use faster modules, both leading to larger area requirements. In this paper we show that even when the type of available resource units and the number of resource units of each type is xed, di erent assignments may lead to circuits with signi cant di erences in clock period. We provide a comprehensive analysis of how resource sharing and assignment introduces long paths in the circuit. Based on the analysis, we develop an assignment algorithm which uses a high-level delay estimator to assign operations to a xed set of available resources so as to minimize the clock period of the resultant circuit, with no or minimal e ect on the area of the circuit. Experimental results on several conditional-intensive designs demonstrate the e ectiveness of the assignment algorithm.

Research paper thumbnail of Bridging Behavioral and Register-Transfer Synthesis

The 24th Southeastern Symposium on and The 3rd Annual Symposium on Communications, Signal Processing Expert Systems, and ASIC VLSI Design System Theory, 1992

This paper considers register-transfer synthesis and optimization from a control-data flowgraph s... more This paper considers register-transfer synthesis and optimization from a control-data flowgraph specification. In contrast t o scheduling under resource constraints, we derive a register-transfer (RT) description without imposing constraints o n resources. The initial RT-level description may seem to have an ezcessive number of functional units and multiplexors, however i t will typically ezhibit also high signal reconvergence. We demonstrate with non-trivial benchmark ezamples that regions of high signal reconvergence offer high resynthesis and optimization potential also at RT-level, producing standard cell realizations that are comparable and competitive with alternate approaches in all aspects: layout area, path delay and gate-level testability. Tradeoffs in resource allocation are ezamined at RT-level only after optimizing the initial description. The RT-level description we generate serves as a top-level input to OASIS, which ezpands it to the required data-path components, synthesizes all control specifications, performs test generation and global optimization b y redundancy removal and submits the final standard cell netlist for automatic placement and routing. 'This work i s performed at MCNC and Duke University with support of a benchmark grant from ACM/SIGDA. Franc Brglez Centre for Microelectronics MCNC Research Triangle Park, N.C. 27709 tual RT-level realiiation. In contrast, we derive the RT description without imposing constraints on the resources. Typical RT description will initially have

Research paper thumbnail of A Mask Reuse Methodology for Reducing System-on-a-Chip Cost

Sixth International Symposium on Quality of Electronic Design (ISQED'05), 2005

Today's System-on-a-Chip (SoC) design methodology provides an efficient way to develop highly int... more Today's System-on-a-Chip (SoC) design methodology provides an efficient way to develop highly integrated systems on a single chip by utilizing pre-designed intellectual property (IP) or "cores". However, once assembled, the physical design and manufacturing process that follows does not benefit from the reuse of these cores. We propose an alternative Mask Reuse Methodology (MRM) where most cores are provided with hardened layouts, significantly reducing the number of components for chip-level processing and the associated turnaround time. In addition, each core has a preverified mask set, which can be re-used to significantly reduce the overall mask cost and mask manufacturing time. Since mask cost and design and verification times are rapidly becoming prohibitive for low or even medium volume ASIC parts, the proposed MRM methodology can help reduce the barrier for ASIC starts. We provide details of the methodology, as well as an assessment of its impact on design time and design cost with an example of a network processor SoC.

Research paper thumbnail of RT-level transformations for gate-level testability

1993 European Conference on Design Automation with the European Event in ASIC Design, 1993

The authors introduce a technique to transform a given RT-level design into a functionality equiv... more The authors introduce a technique to transform a given RT-level design into a functionality equivalent, minimized design which is 100% testable under full-scan at the gate level. The proposed optimization technique uses the RT-level structure and exploits the interaction between the control and the data path. The approach maintains the design hierarchy while performing RT-level transformations of initially specified data

Research paper thumbnail of Keeping hot chips cool

Proceedings. 42nd Design Automation Conference, 2005., 2005

With 90nm CMOS in production and 65nm testing in progress, power has been pushed to the forefront... more With 90nm CMOS in production and 65nm testing in progress, power has been pushed to the forefront of design metrics. This paper will outline practical techniques that are used to reduce both leakage as well as active power in a standard-cell library based high-performance design flow. We will discuss the design and cost issues for using different power saving techniques such as: power gating to reduce leakage, multiple and hybrid threshold libraries for leakage reduction and multiple supply voltage based design. In addition techniques to reduce clock tree power will be presented as power consumed in clocks accounts for a significant portion of total chip power. Practical aspects of implementing these techniques will also be discussed.

Research paper thumbnail of Seas

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign & system synthesis - CODES+ISSS '03, 2003

Research paper thumbnail of Automating the design of SOCs using cores

IEEE Design & Test of Computers, 2001

... can help by automating many of the design tasks. Reinaldo A. Bergamaschi, Subhrajit Bhattacha... more ... can help by automating many of the design tasks. Reinaldo A. Bergamaschi, Subhrajit Bhattacharya, Ronaldo Wagner, Colleen Fellenz, and Michael Muhlada IBM William R. Lee Cisco Systems Foster White Intel Jean-Marc Daveau ST Microelectronics ...

Research paper thumbnail of Early analysis tools for system-on-a-chip design

IBM Journal of Research and Development, 2002

The paper describes the need for early analysis tools to enable developers of today's system-on-a... more The paper describes the need for early analysis tools to enable developers of today's system-on-a-chip (SoC) designs to take advantage of pre-designed components, such as those found in the IBM Blue Logic ® Library, and rapidly explore high-level design alternatives to meet their system requirements. We report on a new approach for developing high-level performance models for these SoC designs and outline how this performance analysis capability can be integrated into an overall environment for efficient SoC design.

Research paper thumbnail of Blue Gene/L compute chip: Memory and Ethernet subsystem

IBM Journal of Research and Development, 2005

The Blue Genet/L compute chip is a dual-processor system-on-a-chip capable of delivering an arith... more The Blue Genet/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 (L2 and L3) to reduce memory access time. A Gigabit Ethernet interface driven by direct memory access (DMA) is integrated in the cache hierarchy, requiring only an external physical link layer chip to connect to the media. The integrated L3 cache stores a total of 4 MB of data, using multibank embedded dynamic random access memory (DRAM). The 1,024-bit-wide data port of the embedded DRAM provides 22.4 GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine. To reduce hardware overhead due to cache coherence intervention requests, memory coherence is maintained by software. This is particularly efficient for regular highly parallel applications with partitionable working sets. The system further integrates an on-chip double-data-rate (DDR) DRAM controller for direct attachment of main memory modules to optimize overall memory performance and cost. For booting the system and low-latency interprocessor communication and synchronization, a 16-KB static random access memory (SRAM) and hardware locks have been added to the design.

Research paper thumbnail of Methods and arrangements for automatically interconnecting cores in systems-on-chip

Research paper thumbnail of Anomaly Detection Using Program Control Flow Graph Mining From Execution Logs

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016

We focus on the problem of detecting anomalous run-time behavior of distributed applications from... more We focus on the problem of detecting anomalous run-time behavior of distributed applications from their execution logs. Specifically we mine templates and template sequences from logs to form a control flow graph (cfg) spanning distributed components. This cfg represents the baseline healthy system state and is used to flag deviations from the expected behavior of runtime logs. The novelty in our work stems from the new techniques employed to: (1) overcome the instrumentation requirements or application specific assumptions made in prior log mining approaches, (2) improve the accuracy of mined templates and the cfg in the presence of long parameters and high amount of interleaving respectively, and (3) improve by orders of magnitude the scalability of the cfg mining process in terms of volume of log data that can be processed per day. We evaluate our template and cfg mining approaches using (a) synthetic log traces and (b) multiple real-world log datasets collected at different layers of application stack. Results demonstrate that the template mining, cfg mining, and our anomaly detection algorithms have high accuracy. The distributed implementation of our pipeline is highly scalable and has more than 500 GB/day of log data processing capability even on a 10 low-end VM based (Spark + Hadoop) cluster.

Research paper thumbnail of An RTL methodology to enable low overhead combinational testing

Proceedings European Design and Test Conference. ED & TC 97

Research paper thumbnail of Methods and arrangements for automatic synthesis of systems-on-chip

Research paper thumbnail of Power Gating Techniques Able to Have Data Retention And Variability Immunity Properties

Research paper thumbnail of Hardware synthesis and analysis of control-intensive designs from high level specifications

The goal of high level synthesis is to automate the synthesis of a VLSI circuit from its algorith... more The goal of high level synthesis is to automate the synthesis of a VLSI circuit from its algorithmic description. Successful high level synthesis methodologies exist for arithmetic-intensive designs. This dissertation targets issues in synthesis of control-intensive designs which may contain nested conditional statements and nested loops. Major tasks in high level synthesis include allocation, scheduling, assignment, and register transfer level (RTL) circuit generation. The objective of performing these tasks efficiently is to produce circuits optimized for cost functions such as delay, area and testability. The time required to execute the algorithmic specification by the synthesized circuit is a product of the number of clock cycles required to execute the specification and the clock period of the circuit. We propose techniques which target efficient scheduling of operations in loops to minimize the number of clock cycles required to execute the specification (algorithm LDS). To produce circuits with a small clock period, traditional techniques use more resources or faster resources. We demonstrate that even when the resource allocation is fixed, the clock period of the circuit can be minimized using assignment techniques (algorithm ClkMin). An RTL circuit is generated following the scheduling and assignment phases. The RTL designs produced may not be optimized for area, and may not be completely testable at the gate level. We propose a general methodology to generate RTL designs optimized for area which uses the hierarchy of the RTL design and the interaction between control and data path (algorithm WONDER). The optimized RTL circuits are completely testable under full-scan at the gate level. In general, there may be multiple solutions to the scheduling, assignment and RTL generation tasks. Hence fast and accurate estimation tools are essential to evaluate the quality of different possible implementations. Good techniques exist for area estimation. In this dissertation, we develop efficient techniques for estimating the number of clock cycles required to execute a specification and the clock period of the implementation. We apply Markov chain techniques for computing the expected number of clock cycles required by a schedule to execute the complete algorithmic specification for various possible inputs (algorithm PERSIS). The clock period estimation technique uses high level information about scheduling and assignment to compute the true delay as opposed to the topological delay of the implementation (algorithm FEST). The algorithms developed in this dissertation have been integrated in the SECONDS high level synthesis system. SECONDS has successfully produced circuit descriptions from control-intensive algorithmic specifications. Experiments with SECONDS demonstrate that high level synthesis can create competitive designs but reduces design time significantly.

Research paper thumbnail of Low cost testing method for register transfer level circuits

Research paper thumbnail of Modeling and simulating a powergated hierarchical element

Research paper thumbnail of System for Using Partitioned Masks to Build a Chip

Research paper thumbnail of Method for using partitioned masks to build a chip

Research paper thumbnail of Influence-based circuit design

Research paper thumbnail of ffects of Resource Sharing on Circuit Delay: An Assignment Algorithm for Clock Period Optimization

This paper analyzes the e ect of resource sharing and assignment on the clock period of the synth... more This paper analyzes the e ect of resource sharing and assignment on the clock period of the synthesized circuit. The assignment phase assigns or binds operations of the scheduled behavioral description to a set of allocated resources. We focus on control-ow intensive descriptions, characterized by the presence of mutually exclusive paths due to the presence of nested conditional branches and loops. We show that clustering of multiple operations in the same state of the schedule, possibly leading to chaining of functional units (FUs) in the RTL circuit, is an e ective way to minimize the total number of clock cycles and hence the total execution time. We present an assignment algorithm which is particularly e ective for such design styles, by minimizing the data chaining and hence the clock period of the circuit, thereby leading to further reduction in the total execution time. Existing resource sharing and assignment approaches for reducing the clock period of the resulting circuit either increase the resource allocation or use faster modules, both leading to larger area requirements. In this paper we show that even when the type of available resource units and the number of resource units of each type is xed, di erent assignments may lead to circuits with signi cant di erences in clock period. We provide a comprehensive analysis of how resource sharing and assignment introduces long paths in the circuit. Based on the analysis, we develop an assignment algorithm which uses a high-level delay estimator to assign operations to a xed set of available resources so as to minimize the clock period of the resultant circuit, with no or minimal e ect on the area of the circuit. Experimental results on several conditional-intensive designs demonstrate the e ectiveness of the assignment algorithm.

Research paper thumbnail of Bridging Behavioral and Register-Transfer Synthesis

The 24th Southeastern Symposium on and The 3rd Annual Symposium on Communications, Signal Processing Expert Systems, and ASIC VLSI Design System Theory, 1992

This paper considers register-transfer synthesis and optimization from a control-data flowgraph s... more This paper considers register-transfer synthesis and optimization from a control-data flowgraph specification. In contrast t o scheduling under resource constraints, we derive a register-transfer (RT) description without imposing constraints o n resources. The initial RT-level description may seem to have an ezcessive number of functional units and multiplexors, however i t will typically ezhibit also high signal reconvergence. We demonstrate with non-trivial benchmark ezamples that regions of high signal reconvergence offer high resynthesis and optimization potential also at RT-level, producing standard cell realizations that are comparable and competitive with alternate approaches in all aspects: layout area, path delay and gate-level testability. Tradeoffs in resource allocation are ezamined at RT-level only after optimizing the initial description. The RT-level description we generate serves as a top-level input to OASIS, which ezpands it to the required data-path components, synthesizes all control specifications, performs test generation and global optimization b y redundancy removal and submits the final standard cell netlist for automatic placement and routing. 'This work i s performed at MCNC and Duke University with support of a benchmark grant from ACM/SIGDA. Franc Brglez Centre for Microelectronics MCNC Research Triangle Park, N.C. 27709 tual RT-level realiiation. In contrast, we derive the RT description without imposing constraints on the resources. Typical RT description will initially have

Research paper thumbnail of A Mask Reuse Methodology for Reducing System-on-a-Chip Cost

Sixth International Symposium on Quality of Electronic Design (ISQED'05), 2005

Today's System-on-a-Chip (SoC) design methodology provides an efficient way to develop highly int... more Today's System-on-a-Chip (SoC) design methodology provides an efficient way to develop highly integrated systems on a single chip by utilizing pre-designed intellectual property (IP) or "cores". However, once assembled, the physical design and manufacturing process that follows does not benefit from the reuse of these cores. We propose an alternative Mask Reuse Methodology (MRM) where most cores are provided with hardened layouts, significantly reducing the number of components for chip-level processing and the associated turnaround time. In addition, each core has a preverified mask set, which can be re-used to significantly reduce the overall mask cost and mask manufacturing time. Since mask cost and design and verification times are rapidly becoming prohibitive for low or even medium volume ASIC parts, the proposed MRM methodology can help reduce the barrier for ASIC starts. We provide details of the methodology, as well as an assessment of its impact on design time and design cost with an example of a network processor SoC.

Research paper thumbnail of RT-level transformations for gate-level testability

1993 European Conference on Design Automation with the European Event in ASIC Design, 1993

The authors introduce a technique to transform a given RT-level design into a functionality equiv... more The authors introduce a technique to transform a given RT-level design into a functionality equivalent, minimized design which is 100% testable under full-scan at the gate level. The proposed optimization technique uses the RT-level structure and exploits the interaction between the control and the data path. The approach maintains the design hierarchy while performing RT-level transformations of initially specified data

Research paper thumbnail of Keeping hot chips cool

Proceedings. 42nd Design Automation Conference, 2005., 2005

With 90nm CMOS in production and 65nm testing in progress, power has been pushed to the forefront... more With 90nm CMOS in production and 65nm testing in progress, power has been pushed to the forefront of design metrics. This paper will outline practical techniques that are used to reduce both leakage as well as active power in a standard-cell library based high-performance design flow. We will discuss the design and cost issues for using different power saving techniques such as: power gating to reduce leakage, multiple and hybrid threshold libraries for leakage reduction and multiple supply voltage based design. In addition techniques to reduce clock tree power will be presented as power consumed in clocks accounts for a significant portion of total chip power. Practical aspects of implementing these techniques will also be discussed.

Research paper thumbnail of Seas

Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign & system synthesis - CODES+ISSS '03, 2003

Research paper thumbnail of Automating the design of SOCs using cores

IEEE Design & Test of Computers, 2001

... can help by automating many of the design tasks. Reinaldo A. Bergamaschi, Subhrajit Bhattacha... more ... can help by automating many of the design tasks. Reinaldo A. Bergamaschi, Subhrajit Bhattacharya, Ronaldo Wagner, Colleen Fellenz, and Michael Muhlada IBM William R. Lee Cisco Systems Foster White Intel Jean-Marc Daveau ST Microelectronics ...

Research paper thumbnail of Early analysis tools for system-on-a-chip design

IBM Journal of Research and Development, 2002

The paper describes the need for early analysis tools to enable developers of today's system-on-a... more The paper describes the need for early analysis tools to enable developers of today's system-on-a-chip (SoC) designs to take advantage of pre-designed components, such as those found in the IBM Blue Logic ® Library, and rapidly explore high-level design alternatives to meet their system requirements. We report on a new approach for developing high-level performance models for these SoC designs and outline how this performance analysis capability can be integrated into an overall environment for efficient SoC design.

Research paper thumbnail of Blue Gene/L compute chip: Memory and Ethernet subsystem

IBM Journal of Research and Development, 2005

The Blue Genet/L compute chip is a dual-processor system-on-a-chip capable of delivering an arith... more The Blue Genet/L compute chip is a dual-processor system-on-a-chip capable of delivering an arithmetic peak performance of 5.6 gigaflops. To match the memory speed to the high compute performance, the system implements an aggressive three-level on-chip cache hierarchy. The implemented hierarchy offers high bandwidth and integrated prefetching on cache hierarchy levels 2 and 3 (L2 and L3) to reduce memory access time. A Gigabit Ethernet interface driven by direct memory access (DMA) is integrated in the cache hierarchy, requiring only an external physical link layer chip to connect to the media. The integrated L3 cache stores a total of 4 MB of data, using multibank embedded dynamic random access memory (DRAM). The 1,024-bit-wide data port of the embedded DRAM provides 22.4 GB/s bandwidth to serve the speculative prefetching demands of the two processor cores and the Gigabit Ethernet DMA engine. To reduce hardware overhead due to cache coherence intervention requests, memory coherence is maintained by software. This is particularly efficient for regular highly parallel applications with partitionable working sets. The system further integrates an on-chip double-data-rate (DDR) DRAM controller for direct attachment of main memory modules to optimize overall memory performance and cost. For booting the system and low-latency interprocessor communication and synchronization, a 16-KB static random access memory (SRAM) and hardware locks have been added to the design.

Research paper thumbnail of Methods and arrangements for automatically interconnecting cores in systems-on-chip