Aviral Shrivastava - Academia.edu (original) (raw)
Papers by Aviral Shrivastava
Proceedings of the 55th Annual Design Automation Conference
Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate even non-p... more Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate even non-parallel loops. Acceleration achieved through CGRAs critically depends on the goodness of mapping (of loop operations onto the PEs of CGRA), and in particular, the compiler's ability to route the dependencies among operations. Previous works have explored several mechanisms to route data dependencies, including, routing through other PEs, registers, memory, and even re-computation. All these routing options change the graph to be mapped onto PEs (often by adding new operations), and without rescheduling , it may be impossible to map the new graph. However, existing techniques explore these routing options inside the Place and Route (P&R) phase of the compilation process, which is performed after the scheduling step. As a result, they either may not achieve the mapping or obtain poor results. Our method RAMP, explicitly and intelligently explores the various routing options, before the scheduling step, and makes improve the mapping-ability and mapping quality. Evaluating top performance-critical loops of MiBench benchmarks over 12 architectural configurations, we find that RAMP is able to accelerate loops by 23× over sequential execution, achieving a geomean speedup of 2.13× over state-of-the-art.
ACM Transactions on Cyber-Physical Systems
IEEE Transactions on Reliability
ACM Transactions on Embedded Computing Systems
International Journal of Pharma and Bio Sciences, Jun 1, 2013
Proceedings of the Conference on Design Automation and Test in Europe, 2007
Customizing the bypasses in pipelined processors is an effective and popular means to perform pow... more Customizing the bypasses in pipelined processors is an effective and popular means to perform power, performance and complexity trade-offs in embedded systems. However existing techniques are unable to automatically generate test patterns to functionally validate a partially bypassed processor. Manually specifying directed test sequences to validate a partially bypassed processor is not only a complex and cumbersome task, but is also highly error-prone. In this paper we present an automatic directed test generation technique to verify a partially bypassed processor pipeline using a high-level processor description. We define a fault model and coverage metric for a partially bypassed processor pipeline and demonstrate that our technique can fully cover all the faults using 107,074 tests for the Intel XScale processor within 40 minutes. In contrast, randomly generated tests can achieve 100% coverage with 2 million tests after half day. Furthermore, we demonstrate that our technique is able to generate tests for all possible bypass configurations of the Intel XScale processor.
2009 Design Automation Test in Europe Conference Exhibition, 2009
With continuous technology scaling, soft errors are becoming an increasingly important design con... more With continuous technology scaling, soft errors are becoming an increasingly important design concern even for earth-bound applications. While compiler approaches have the potential to mitigate the effect of soft errors with minimal runtime overheads, static vulnerability estimation-an essential part of compiler approaches-is lacking due to its inherent complexity. This paper presents a static analysis approach for Register File (RF) vulnerability estimation. We decompose the vulnerability of a register into intrinsic and conditional basicblock vulnerabilities. This decomposition allows us to develop a fast, yet reasonably accurate, linear equation-based RF vulnerability estimation mechanism. We demonstrate its practical application to compiler optimizations. Our experimental results on benchmarks from MiBench suite indicate that not only our static RF vulnerability estimation is fast and accurate, but also compiler optimizations enabled by our static estimation can achieve very cost-effective protection of register files against soft errors.
Ieee Transactions on Very Large Scale Integration Systems, Nov 1, 2009
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due ... more Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, Split-Push Kernel Mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5 more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.
... This type of virus can evade detection from naive users as well as system administrators who ... more ... This type of virus can evade detection from naive users as well as system administrators who ... scanned. 1.5.5. Heuristics. Virus writers slowly started using techniques such as entry ... a change in virus detection technology. Programmers came up with decision support ...
NAND Flash Memories require Garbage Collection (GC) and Wear Leveling (WL) operations to be carri... more NAND Flash Memories require Garbage Collection (GC) and Wear Leveling (WL) operations to be carried out by Flash Translation Layers (FTLs) that oversee flash management. Owing to expensive erasures and data copying, these two operations essentially determine application response times. Since file systems do not share any file deletion information with FTL, dead data is treated as valid by FTL, resulting in significant WL and GC overheads. In this work, we propose a novel method to dynamically interpret and treat dead data at the FTL level so as to reduce above overheads and improve application response times, without necessitating any changes to existing file systems. We demonstrate that our resourceefficient approach can improve application response times and memory write access times by 22% and reduce erasures by 21.6% on average.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016
Agricultural Engineering Today, 2010
IET Computers & Digital Techniques, 2016
Software Programmable Memories, or SPMs, are raw on-chip memories that are not implicitly managed... more Software Programmable Memories, or SPMs, are raw on-chip memories that are not implicitly managed by the processor hardware, but explicitly by software. For example, while caches fetch data from memories automatically and maintain coherence with other caches, SPMs explicitly manage data movement between memories and other SPMs through software instructions. SPMs make the design of on-chip memories simpler, more scalable, and power efficient, but also place additional burden for programming of SPM-based processors. Traditionally, SPMs have been utilized in embedded systems, especially multimedia and gaming systems, but recently research on SPM-based systems has seen increased interest as a means to solve the memory scaling challenges of manycore architectures. This article presents an overview of the state of the art in SPM management techniques in manycore processors, summarizes some recent research on SPM-based systems, and outlines future research directions in this field.
Indian Journal of Genetics and Plant Breeding, 2013
With advances in process technology, soft errors are becoming an increasingly critical design con... more With advances in process technology, soft errors are becoming an increasingly critical design concern. Soft errors are manifested as a toggle in Boolean logic, which may result in failure of the system functionality. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Although Error Correction Code (ECC) based mechanisms have been suggested to protect the data in caches, they have high performance and power overheads. We observe that in multimedia applications, not all data require the same amount of protection from soft errors. In fact, an error in the multimedia data itself does not result in a failure, but often just results in a slight loss of quality of service. Thus, it is possible to trade-off the power and performance overheads of soft error protection with quality of service. To this end, we propose a Partially Protected Cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47× reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, we observe that the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, there exists the scope of further improving the solution by correctly parameterizing the PPC configurations. Consequently, we develop Design Space Exploration (DSE) strategies to find out the best PPC configuration. Our DSE technique can reduce the exploration time by more than 6× as compared to the exhaustive approach.
2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2015
Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA ... more Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8059 ... {kyoungwl, minyounk, dutt, nalini}@ics.uci.edu, Aviral.Shrivastava@asu.edu http://www.cecs.uci.edu/ ... Cross-Layer Interactions of Error ...
Acm Transactions on Design Automation of Electronic Systems, Oct 1, 2013
ABSTRACT Incessant and rapid technology scaling has brought us to a point where today's, ... more ABSTRACT Incessant and rapid technology scaling has brought us to a point where today's, and future transistors are susceptible to transient errors induced by energy carrying particles, called soft errors. Within a processor, the sheer size and nature of data in the caches render it most vulnerable to electrical interference on data stored in the cache. Data in the cache is vulnerable to corruption by soft errors, for the time it remains actively unused in the cache. Write-through and early-write-back [Li et al. 2004] cache configurations reduce the time for vulnerable data in the cache, at the cost of increased memory writes and thereby energy. We propose a smart cache cleaning methodology, that enables copying of only specific vulnerable cache blocks into the memory at chosen times, thereby ensuring data cache protection with minimal memory writes. In this work, we first propose a hybrid (software-hardware) methodology. We then propose an improved software solution that utilizes cache write-back functionality available in commodity processors; thereby reducing the hardware overhead required to implement smart cache cleaning for such systems. The parameters involved in the implementation of our Smart Cache Cleaning (SCC) technique enable a means to provide for customizable energy-efficient soft error reduction in the L1 data cache. Given the system requirements of reliability, power-budget and runtime priority of the application, appropriate parameters of the SCC can be customized to trade-off power consumption and L1 data cache reliability. Our experiments over LINPACK and Livermore benchmarks demonstrate 26% reduced energy-vulnerability product (energy-efficient vulnerability reduction) compared to that of hardware based cache reliability techniques. Our software-only solution achieves same levels of reliability with an additional 28% performance improvement.
Ieee Transactions on Very Large Scale Integration Systems, Sep 1, 2009
With advances in process technology, soft errors are becoming an increasingly critical design con... more With advances in process technology, soft errors are becoming an increasingly critical design concern. Soft errors are manifested as a toggle in Boolean logic, which may result in failure of the system functionality. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Although Error Correction Code (ECC) based mechanisms have been suggested to protect the data in caches, they have high performance and power overheads. We observe that in multimedia applications, not all data require the same amount of protection from soft errors. In fact, an error in the multimedia data itself does not result in a failure, but often just results in a slight loss of quality of service. Thus, it is possible to trade-off the power and performance overheads of soft error protection with quality of service. To this end, we propose a Partially Protected Cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47× reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, we observe that the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, there exists the scope of further improving the solution by correctly parameterizing the PPC configurations. Consequently, we develop Design Space Exploration (DSE) strategies to find out the best PPC configuration. Our DSE technique can reduce the exploration time by more than 6× as compared to the exhaustive approach.
Proceedings of the 55th Annual Design Automation Conference
Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate even non-p... more Coarse-grained reconfigurable array (CGRA) is a promising solution that can accelerate even non-parallel loops. Acceleration achieved through CGRAs critically depends on the goodness of mapping (of loop operations onto the PEs of CGRA), and in particular, the compiler's ability to route the dependencies among operations. Previous works have explored several mechanisms to route data dependencies, including, routing through other PEs, registers, memory, and even re-computation. All these routing options change the graph to be mapped onto PEs (often by adding new operations), and without rescheduling , it may be impossible to map the new graph. However, existing techniques explore these routing options inside the Place and Route (P&R) phase of the compilation process, which is performed after the scheduling step. As a result, they either may not achieve the mapping or obtain poor results. Our method RAMP, explicitly and intelligently explores the various routing options, before the scheduling step, and makes improve the mapping-ability and mapping quality. Evaluating top performance-critical loops of MiBench benchmarks over 12 architectural configurations, we find that RAMP is able to accelerate loops by 23× over sequential execution, achieving a geomean speedup of 2.13× over state-of-the-art.
ACM Transactions on Cyber-Physical Systems
IEEE Transactions on Reliability
ACM Transactions on Embedded Computing Systems
International Journal of Pharma and Bio Sciences, Jun 1, 2013
Proceedings of the Conference on Design Automation and Test in Europe, 2007
Customizing the bypasses in pipelined processors is an effective and popular means to perform pow... more Customizing the bypasses in pipelined processors is an effective and popular means to perform power, performance and complexity trade-offs in embedded systems. However existing techniques are unable to automatically generate test patterns to functionally validate a partially bypassed processor. Manually specifying directed test sequences to validate a partially bypassed processor is not only a complex and cumbersome task, but is also highly error-prone. In this paper we present an automatic directed test generation technique to verify a partially bypassed processor pipeline using a high-level processor description. We define a fault model and coverage metric for a partially bypassed processor pipeline and demonstrate that our technique can fully cover all the faults using 107,074 tests for the Intel XScale processor within 40 minutes. In contrast, randomly generated tests can achieve 100% coverage with 2 million tests after half day. Furthermore, we demonstrate that our technique is able to generate tests for all possible bypass configurations of the Intel XScale processor.
2009 Design Automation Test in Europe Conference Exhibition, 2009
With continuous technology scaling, soft errors are becoming an increasingly important design con... more With continuous technology scaling, soft errors are becoming an increasingly important design concern even for earth-bound applications. While compiler approaches have the potential to mitigate the effect of soft errors with minimal runtime overheads, static vulnerability estimation-an essential part of compiler approaches-is lacking due to its inherent complexity. This paper presents a static analysis approach for Register File (RF) vulnerability estimation. We decompose the vulnerability of a register into intrinsic and conditional basicblock vulnerabilities. This decomposition allows us to develop a fast, yet reasonably accurate, linear equation-based RF vulnerability estimation mechanism. We demonstrate its practical application to compiler optimizations. Our experimental results on benchmarks from MiBench suite indicate that not only our static RF vulnerability estimation is fast and accurate, but also compiler optimizations enabled by our static estimation can achieve very cost-effective protection of register files against soft errors.
Ieee Transactions on Very Large Scale Integration Systems, Nov 1, 2009
Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due ... more Recently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA, and thus they are i) unable to map applications, even though a mapping exists, and ii) using too many processing elements (PEs) to map an application. In this paper, we model several CGRA details, e.g., irregular CGRA topologies, shared resources and routing PEs in our compiler and develop a graph drawing based approach, Split-Push Kernel Mapping (SPKM), for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5 more applications than the previous approach, while generating mappings which have better qualities in terms of utilized CGRA resources. Utilizing fewer resources is directly translated into increased opportunities for novel power and performance optimization techniques. Our technique shows less power consumption in 71 cases and shorter execution cycles in 66 cases out of 100 synthetic applications, with minimum mapping time overhead. We observe similar results on a suite of benchmarks collected from Livermore loops, Mediabench, Multimedia, Wavelet and DSPStone benchmarks. SPKM is not a customized algorithm only for a specific CGRA template, and it is demonstrated by exploring various PE interconnection topologies and shared resource configurations with SPKM.
... This type of virus can evade detection from naive users as well as system administrators who ... more ... This type of virus can evade detection from naive users as well as system administrators who ... scanned. 1.5.5. Heuristics. Virus writers slowly started using techniques such as entry ... a change in virus detection technology. Programmers came up with decision support ...
NAND Flash Memories require Garbage Collection (GC) and Wear Leveling (WL) operations to be carri... more NAND Flash Memories require Garbage Collection (GC) and Wear Leveling (WL) operations to be carried out by Flash Translation Layers (FTLs) that oversee flash management. Owing to expensive erasures and data copying, these two operations essentially determine application response times. Since file systems do not share any file deletion information with FTL, dead data is treated as valid by FTL, resulting in significant WL and GC overheads. In this work, we propose a novel method to dynamically interpret and treat dead data at the FTL level so as to reduce above overheads and improve application response times, without necessitating any changes to existing file systems. We demonstrate that our resourceefficient approach can improve application response times and memory write access times by 22% and reduce erasures by 21.6% on average.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016
Agricultural Engineering Today, 2010
IET Computers & Digital Techniques, 2016
Software Programmable Memories, or SPMs, are raw on-chip memories that are not implicitly managed... more Software Programmable Memories, or SPMs, are raw on-chip memories that are not implicitly managed by the processor hardware, but explicitly by software. For example, while caches fetch data from memories automatically and maintain coherence with other caches, SPMs explicitly manage data movement between memories and other SPMs through software instructions. SPMs make the design of on-chip memories simpler, more scalable, and power efficient, but also place additional burden for programming of SPM-based processors. Traditionally, SPMs have been utilized in embedded systems, especially multimedia and gaming systems, but recently research on SPM-based systems has seen increased interest as a means to solve the memory scaling challenges of manycore architectures. This article presents an overview of the state of the art in SPM management techniques in manycore processors, summarizes some recent research on SPM-based systems, and outlines future research directions in this field.
Indian Journal of Genetics and Plant Breeding, 2013
With advances in process technology, soft errors are becoming an increasingly critical design con... more With advances in process technology, soft errors are becoming an increasingly critical design concern. Soft errors are manifested as a toggle in Boolean logic, which may result in failure of the system functionality. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Although Error Correction Code (ECC) based mechanisms have been suggested to protect the data in caches, they have high performance and power overheads. We observe that in multimedia applications, not all data require the same amount of protection from soft errors. In fact, an error in the multimedia data itself does not result in a failure, but often just results in a slight loss of quality of service. Thus, it is possible to trade-off the power and performance overheads of soft error protection with quality of service. To this end, we propose a Partially Protected Cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47× reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, we observe that the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, there exists the scope of further improving the solution by correctly parameterizing the PPC configurations. Consequently, we develop Design Space Exploration (DSE) strategies to find out the best PPC configuration. Our DSE technique can reduce the exploration time by more than 6× as compared to the exhaustive approach.
2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2015
Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA ... more Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8059 ... {kyoungwl, minyounk, dutt, nalini}@ics.uci.edu, Aviral.Shrivastava@asu.edu http://www.cecs.uci.edu/ ... Cross-Layer Interactions of Error ...
Acm Transactions on Design Automation of Electronic Systems, Oct 1, 2013
ABSTRACT Incessant and rapid technology scaling has brought us to a point where today's, ... more ABSTRACT Incessant and rapid technology scaling has brought us to a point where today's, and future transistors are susceptible to transient errors induced by energy carrying particles, called soft errors. Within a processor, the sheer size and nature of data in the caches render it most vulnerable to electrical interference on data stored in the cache. Data in the cache is vulnerable to corruption by soft errors, for the time it remains actively unused in the cache. Write-through and early-write-back [Li et al. 2004] cache configurations reduce the time for vulnerable data in the cache, at the cost of increased memory writes and thereby energy. We propose a smart cache cleaning methodology, that enables copying of only specific vulnerable cache blocks into the memory at chosen times, thereby ensuring data cache protection with minimal memory writes. In this work, we first propose a hybrid (software-hardware) methodology. We then propose an improved software solution that utilizes cache write-back functionality available in commodity processors; thereby reducing the hardware overhead required to implement smart cache cleaning for such systems. The parameters involved in the implementation of our Smart Cache Cleaning (SCC) technique enable a means to provide for customizable energy-efficient soft error reduction in the L1 data cache. Given the system requirements of reliability, power-budget and runtime priority of the application, appropriate parameters of the SCC can be customized to trade-off power consumption and L1 data cache reliability. Our experiments over LINPACK and Livermore benchmarks demonstrate 26% reduced energy-vulnerability product (energy-efficient vulnerability reduction) compared to that of hardware based cache reliability techniques. Our software-only solution achieves same levels of reliability with an additional 28% performance improvement.
Ieee Transactions on Very Large Scale Integration Systems, Sep 1, 2009
With advances in process technology, soft errors are becoming an increasingly critical design con... more With advances in process technology, soft errors are becoming an increasingly critical design concern. Soft errors are manifested as a toggle in Boolean logic, which may result in failure of the system functionality. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Although Error Correction Code (ECC) based mechanisms have been suggested to protect the data in caches, they have high performance and power overheads. We observe that in multimedia applications, not all data require the same amount of protection from soft errors. In fact, an error in the multimedia data itself does not result in a failure, but often just results in a slight loss of quality of service. Thus, it is possible to trade-off the power and performance overheads of soft error protection with quality of service. To this end, we propose a Partially Protected Cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47× reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, we observe that the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, there exists the scope of further improving the solution by correctly parameterizing the PPC configurations. Consequently, we develop Design Space Exploration (DSE) strategies to find out the best PPC configuration. Our DSE technique can reduce the exploration time by more than 6× as compared to the exhaustive approach.