Abdel-Hameed A. Badawy - Academia.edu (original) (raw)

Papers by Abdel-Hameed A. Badawy

Research paper thumbnail of SprBlk cache: enabling fault resilience at low voltages

This paper proposes a novel cache architecture that uses spare cache blocks to work as back up bl... more This paper proposes a novel cache architecture that uses spare cache blocks to work as back up blocks in a set associative cache, which can operate reliably at voltages well below the manufacturing induced operating voltage (Vccmin). We detect errors in all cache lines at low voltage, tag them as either faulty or fault-free. At runtime, we bypass the faulty words. To bypass faulty words, we use adder and shifter circuitry. Furthermore, we develop a fault model to find the cache set failure probability at low voltage. At 485mV, SprBlk cache operates with a 16.7% lower bit failure probability compared to a conventional cache operating at 782mV while reducing power consumption by 1% when SprBlk is implemented in the L1 data cache only, by 75% when implemented in the L2 cache only, and by 76% when implemented in both L1 and L2 caches. SprBlk cache is 15% more area efficient than the previously proposed Bit-Fix mechanism. Additionally, SprBlk provides ∼ 73% reduction in EPI (energy per i...

Research paper thumbnail of Al-Imam Muhammad Ibn Saud Islamic University

Domain of Research: Research and development of wideband access routers for hybrid fibre-coaxial ... more Domain of Research: Research and development of wideband access routers for hybrid fibre-coaxial (HFC) cable networks and passive optical networks (PON)

Research paper thumbnail of Supervising Communication SoC for Secure Operation Using Machine Learning

2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS)

Manufacturers normally buy and/or fabricate communication chips using third-party suppliers, whic... more Manufacturers normally buy and/or fabricate communication chips using third-party suppliers, which are then integrated into a complex hardware-software stack with a variety of potential vulnerabilities. This work proposes a compact supervisory circuit to classify the operation of a Bluetooth (BT) SoC at low frequencies by monitoring the input power and radio frequency (RF) output of the BT chip passed through an envelope detector. The idea is to inexpensively fabricate an envelope detector, power supply current monitor, and classification algorithm on a custom low-frequency integrated circuit in a trusted legacy technology. When the supervisory circuit detects unexpected behavior, it can shut off power to the BT SoC. In this preliminary work, we proto-type the supervisory circuit using off-the-shelf components. We extract simple yet descriptive features from the envelope of the RF output signal. Then, we train machine learning (ML) models to classify different BT operation modes, such as BT advertising and transmit modes. Our results show ∼100% classification accuracy.

Research paper thumbnail of Joint security and performance improvement in multilevel shared caches

Research paper thumbnail of Evaluating the Fault Tolerance Performance of HDFS and Ceph

Proceedings of the Practice and Experience on Advanced Research Computing

Large-scale distributed systems are a collection of loosely coupled computers interconnected by a... more Large-scale distributed systems are a collection of loosely coupled computers interconnected by a communication network. They are now an integral part of everyday life with the development of large web applications, social networks, peer-to-peer systems, wireless sensor networks and many more. Because each disk by itself is prone to failure, one key challenge in designing such systems is their ability to tolerate faults. Hence, fault tolerance mechanisms such as replication are widely used to provide data availability at all times. On the other hand, many systems now are increasingly supporting new mechanism called erasure coding (EC), claiming that using EC provides high reliability at lower storage cost than replication. However, this comes at the cost of performance. Our goal in this paper is to compare the performance and storage requirements of these two data reliability techniques for two open source systems: HDFS and Ceph especially that the Apache Software Foundation had released a new version of Hadoop, Apache Hadoop 3.0.0, which now supports EC. In addition, with the Firefly release (May 2014) Ceph added support for EC as well. We tested replication vs. EC in both systems using several benchmarks shipped with these systems. Results show that there are trade-offs between replication and EC in terms of performance and storage requirements.

Research paper thumbnail of Machine Learning Bluetooth Profile Operation Verification via Monitoring the Transmission Pattern

2019 53rd Asilomar Conference on Signals, Systems, and Computers

Manufacturers often buy and/or license communication ICs from third-party suppliers. These commun... more Manufacturers often buy and/or license communication ICs from third-party suppliers. These communication ICs are then integrated into a complex computational system, resulting in a wide range of potential hardware-software security issues. This work proposes a compact supervisory circuit to classify the Bluetooth profile operation of a Bluetooth System-on-Chip (SoC) at low frequencies by monitoring the radio frequency (RF) output power of the Bluetooth SoC. The idea is to inexpensively manufacture an RF envelope detector to monitor the RF output power and a profile classification algorithm on a custom low-frequency integrated circuit in a low-cost legacy technology. When the supervisory circuit observes unexpected behavior, it can shut off power to the Bluetooth SoC. In this preliminary work, we proto-type the supervisory circuit using off-the-shelf components to collect a sufficient data set to train 11 different Machine Learning models. We extract smart descriptive time-domain features from the envelope of the RF output signal. Then, we train the machine learning models to classify three different Bluetooth operation profiles: sensor, hands-free, and headset. Our results demonstrate 100% classification accuracy with low computational complexity.∼

Research paper thumbnail of A Scalable Analytical Memory Model for CPU Performance Prediction

Lecture Notes in Computer Science

As the US Department of Energy (DOE) invests in exascale computing, performance modeling of physi... more As the US Department of Energy (DOE) invests in exascale computing, performance modeling of physics codes on CPUs remain a challenge in computational co-design due to the complex design of processors including memory hierarchies, instruction pipelining, and speculative execution. We present Analytical Memory Model (AMM), a model of cache hierarchies, embedded in the Performance Prediction Toolkit (PPT)-a suite of discrete-event-simulation-based co-design hardware and software models. AMM enables PPT to significantly improve the quality of its runtime predictions of scientific codes. AMM uses a computationally efficient, stochastic method to predict the reuse distance profiles, where reuse distance is a hardware architecture-independent measure of the patterns of virtual memory accesses. AMM relies on a stochastic, static basic block-level analysis of reuse profiles measured from the memory traces of applications on small instances. The analytical reuse profile is useful to estimate the effective latency and throughput of memory access, which in turn are used to predict the overall runtime of an application. Our experimental results demonstrate the scalability of AMM, where we report the error-rates of three benchmarks on two different hardware models.

Research paper thumbnail of DAdHTM: Low overhead dynamically adaptive hardware transactional memory for large graphs a scalability study

2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017

With the availability of multicore and manycore systems with large main memories, Symmetric Multi... more With the availability of multicore and manycore systems with large main memories, Symmetric Multiprocessors (SMPs) can process large scale problems such as graphs spanning millions of nodes and billions of edges which used to be the domain of large clusters. Due to the shared memory architecture of SMPs, we can apply fast, novel policies such as Transactional Memory (TM) to speed up the processing of such problems. Interestingly, most applications in bioinformatics, social networks, and cybersecurity can be represented as large graphs. However, these graphs are sparse in nature. Therefore, TM as a synchronization policy for critical sections can provide better performance since it performs better in low conflict scenarios. Many TM variants exist including: Software, Hardware and combinations of both. Furthermore, a TM can adapt to applications' behavior. In this paper, we introduce DAdHTM. A Dynamically Adaptive Hardware TM (DAdHTM) designed to adapt the HTM to the application&#...

Research paper thumbnail of Fault Tolerance Performance Evaluation of Large-Scale Distributed Storage Systems HDFS and Ceph Case Study

2018 IEEE High Performance extreme Computing Conference (HPEC)

Large-scale distributed systems are a collection of loosely coupled computers interconnected by a... more Large-scale distributed systems are a collection of loosely coupled computers interconnected by a communication network. They are now an integral part of everyday life with the development of large web applications, social networks, peer-to-peer systems, wireless sensor networks and many more. At such a scale, hardware components by themselves are prone to failure. Therefore, one key challenge in designing distributed storage systems is how to tolerate faults. To this end, fault tolerance mechanisms such as replication have been widely used to provide high availability for decades. More recently, many systems start supporting erasure coding for fault tolerance, which is expected to achieve high reliability at a lower storage cost compared to replication. However, the reduced storage overhead comes at the cost of more complicated recovery which hurts performance. In this paper, we study the fault tolerance mechanisms of two representative distributed file systems: HDFS and Ceph. In addition to the traditional replication, both HDFS and Ceph support erasure coding in their latest version. We evaluate the replication and erasure coding implementations in both systems using standard benchmarks and fault injection, and quantitatively measure the performance and storage overhead. Our results demonstrate the trade-offs between replication and erasure coding techniques, and serve as a foundation for building optimal storage systems with high availability as well as high performance.

Research paper thumbnail of A Brief History of HPC Simulation and Future Challenges

High-performance Computing (HPC) systems have gone through many changes during the past two decad... more High-performance Computing (HPC) systems have gone through many changes during the past two decades in their architectural design to satisfy the increasingly large-scale scientific computing demand. Accurate, fast, and scalable performance models and simulation tools are essential for evaluating alternative architecture design decisions for the massive-scale computing systems. This paper recounts some of the influential work in modeling and simulation for HPC systems and applications, identifies some of the major challenges, and outlines future research directions which we believe are critical to the HPC modeling and simulation community.

Research paper thumbnail of Novel flexible buffering architectures for 3D-NoCs

Sustainable Computing: Informatics and Systems

Abstract In the conventional router architecture of Network-on-Chips (NoCs), each input port empl... more Abstract In the conventional router architecture of Network-on-Chips (NoCs), each input port employs a set of dedicated flit buffers to store incoming flits. This mechanism unevenly distributes flits among router buffers, which in turn leads to higher packet blocking rates and under utilization of buffers. In this paper, we address this problem by proposing two novel buffering mechanisms and their corresponding architectures to share flit buffers among several ports of a router efficiently. Our first proposed mechanism is called Minimum-First buffering. This mechanism distributes flits among buffers of input ports based on the number of free buffer slots available in each port, giving priority to minimum occupied buffers. This approach increases the utilization of underutilized buffers by allowing them to store flits of other input ports. The second mechanism (so-called Inverse-Priority buffering) is a lighter yet efficient, flexible buffering technique. This mechanism employs a simple priority order for each buffer. According to our analysis, prioritizing specific ports over others balances the traffic loads between router buffers, and thus yields higher throughput. Both mechanisms lead to lower waiting times in the router and higher utilization in hardware resources. After studying all possible scenarios and analyzing corner cases, we have optimally designed two router architectures equipped with the proposed buffering mechanisms. Moreover, a hardware optimization technique is introduced to reduce the area overhead of the Minimum-First router architecture. The proposed architectures show significant improvements in the performance of 3D-NoCs in terms of the average network throughput and average delay as well as the total number of blocked packets compared to different state-of-the-art and baseline router architectures.

Research paper thumbnail of Performance Evaluation of Mesh-based 3D NoCs

Proceedings of the 10th International Workshop on Network on Chip Architectures, 2017

The advances on 3D circuit integration have reignited the idea of processing-in-memory (PIM). In ... more The advances on 3D circuit integration have reignited the idea of processing-in-memory (PIM). In this paper, we evaluate 3D mesh-based NoC design for 3D-PIM systems. We study the stacked mesh (S-Mesh) which is a mesh-bus hybrid architecture for 3D NoCs that connects vertically stacked 2D meshes through buses. Previous S-Mesh studies have not addressed the problems and modifications needed at the building blocks of the network. We explain in details the internal structure of the S-Mesh, as well as, the problems and solutions of connecting 2D meshes using vertical buses. Also, we evaluate the performance of 3D NoC designs via two traffic patterns, one of which is a novel traffic pattern that better measures 3D-PIM systems performance. Our results show 15% performance improvement for the S-Mesh for zero-load packet latency while having a negligible decrease in saturation throughput.

Research paper thumbnail of 1 Energy Efficient Tri-State CNFET Ternary Logic 2 Gates 3

Traditional silicon binary circuits continue to face challenges such as high leakage power 14 dis... more Traditional silicon binary circuits continue to face challenges such as high leakage power 14 dissipation and large area of interconnections. Multiple-Valued Logic (MVL) and nano-devices are 15 two feasible solutions to overcome these problems. In this paper, we present a novel method to 16 design ternary logic circuits based on Carbon Nanotube Field Effect Transistors (CNFETs). The 17 proposed designs use the unique properties of CNFETs, e.g., adjusting the Carbon 18 Nanotube (CNT) diameters to have the desired threshold voltage and have the same mobility of P19 FET and N-FET transistors. Each of our designed logic circuits implements a logic function and its 20 complementary via a control signal. Also, these circuits have a high impedance state which saves 21 power while the circuits are not in use. We show a more detailed application of our approach by 22 designing a two-digit adder-subtractor circuit. We simulate the proposed ternary circuits using 23 HSPICE via standard 32nm CN...

Research paper thumbnail of Probabilistic Monte Carlo simulations for static branch prediction

2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)

Conditional branch instructions have a significant effect on the microprocessor performance and t... more Conditional branch instructions have a significant effect on the microprocessor performance and throughput. Accurate branch prediction is crucial in reducing control hazards and improving microprocessor performance. Modern microprocessors accurately predict the branch outcomes using advanced prediction techniques. Estimating branch mis-prediction rates accurately helps to improve the overall performance by saving CPU cycles and power. In general, we run the application programs on cycle accurate hardware simulators such as GEM5 [4], to collect the branch prediction statistics. This method comes out to be time consuming and is also not scalable. We present a novel Monte Carlo simulation framework that produces the branch prediction rate statically, without actually running the application on the hardware. Our framework mimics the execution behavior of the real hardware. It uses one of the three different branch prediction schemes to calculate the branch prediction statistics. It also comments on the branch prediction rates of individual branches. Results suggest that the conditional prediction rates for four scientific applications are similar to that of results from the GEM5 [4] simulator.

Research paper thumbnail of Exploring Energy-Efficient Ternary Inexact Multipliers Using CNT Transistors

Research paper thumbnail of Optimizing locality in graph computations using reuse distance profiles

2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), 2017

This work tries to answer the question of whether or not we should write code differently when th... more This work tries to answer the question of whether or not we should write code differently when the underlying chip microarchitecture is powered by a multicore processor. We use a set of three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance. We explore a design space represented by different parallelization schemes and different graph partitionings. This provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. We use a PIN computed reuse distance profile to build a...

Research paper thumbnail of DyAdHyTM: A Low Overhead Dynamically Adaptive Hybrid Transactional Memory on Big Data Graphs

ArXiv, 2017

Big data is a buzzword used to describe massive volumes of data that provides opportunities of ex... more Big data is a buzzword used to describe massive volumes of data that provides opportunities of exploring new insights through data analytics. However, big data is mostly structured but can be semi-structured or unstructured. It is normally so large that it is not only difficult but also slow to process using traditional computing systems. One of the solutions is to format the data as graph data structures and process them on shared memory architecture to use fast and novel policies such as transactional memory. In most graph applications in big data type problems such as bioinformatics, social networks, and cyber security, graphs are sparse in nature. Due to this sparsity, we have the opportunity to use Transactional Memory (TM) as the synchronization policy for critical sections to speedup applications. At low conflict probability TM performs better than most synchronization policies due to its inherent non-blocking characteristics. TM can be implemented in Software, Hardware or a ...

Research paper thumbnail of Local memory store (LMStr): A hardware controlled shared scratchpad for multicores

2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017

We present an on-chip memory store called “Local Memory Store” (LMStr). The LMStr can be used wit... more We present an on-chip memory store called “Local Memory Store” (LMStr). The LMStr can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM). The LMStr is a shared special kind of SPM among the cores in a multicore processor. The LMStr is hardware-controlled in terms of management of the store itself. Yet, compiler support is instrumental in deciding which data items/types should live in the store. Critical data should be stored in the LMStr according to its type (i.e. local, global, static, or temporary). The programmer can provide, at will, hints to the compiler to place certain data items in the LMStr. We evaluate our design using a matrix multiplication micro-application and multiple Mantevo mini-applications. Our results show that LMStr improves data movement by up to 21% compared to cache alone with a mere 3% area overhead. Not only that but LMStr improves the cycles per memory access by up to 40%.

Research paper thumbnail of FPGA-Accelerated Decision Tree Classifier for Real-Time Supervision of Bluetooth SoC

2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2019

Wireless communication protocols are used in all smart devices and systems. This work proposes an... more Wireless communication protocols are used in all smart devices and systems. This work proposes an FPGA-accelerated supervisory system that classifies the operation of a communication system-on-chip (SoC). In this work, the selected communication protocol is Bluetooth (BT). The input supply current to the transceiver block of the SoC is monitored and sampled at 50 kHz. We extract simple descriptive features from the transceiver input power signal and use them to train a machine learning (ML) model to classify two different BT operation modes. We implemented ADC sampling, feature extraction, and a real-time decision tree classifier on an Intel MAX 10 FPGA. The measured classification accuracy is 97.4%.

Research paper thumbnail of Spare block cache (SprBlk): Fault resilience and reliability at low voltages

2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017

This paper proposes a novel cache architecture that uses spare cache blocks to work as back up bl... more This paper proposes a novel cache architecture that uses spare cache blocks to work as back up blocks in a set associative cache, which can operate reliably at voltages well below the manufacturing induced operating voltage (Vccmin). We detect errors in all cache lines at low voltage (i.e. persistent error), tag them as either faulty or fault-free. At runtime, we bypass the faulty words. To bypass faulty words, we use adder and shifter circuitry. Furthermore, we develop a fault model to find the cache set failure probability at low voltage. At 485 mV, SprBlk cache operates with a 16.7% lower bit failure probability compared to a conventional cache operating at 782 mV. Additionally, SprBlk reduce power consumption by 1% when implemented in the L1 data cache only, by 75% when implemented in the L2 cache only, and by 76% when implemented in both caches.

Research paper thumbnail of SprBlk cache: enabling fault resilience at low voltages

This paper proposes a novel cache architecture that uses spare cache blocks to work as back up bl... more This paper proposes a novel cache architecture that uses spare cache blocks to work as back up blocks in a set associative cache, which can operate reliably at voltages well below the manufacturing induced operating voltage (Vccmin). We detect errors in all cache lines at low voltage, tag them as either faulty or fault-free. At runtime, we bypass the faulty words. To bypass faulty words, we use adder and shifter circuitry. Furthermore, we develop a fault model to find the cache set failure probability at low voltage. At 485mV, SprBlk cache operates with a 16.7% lower bit failure probability compared to a conventional cache operating at 782mV while reducing power consumption by 1% when SprBlk is implemented in the L1 data cache only, by 75% when implemented in the L2 cache only, and by 76% when implemented in both L1 and L2 caches. SprBlk cache is 15% more area efficient than the previously proposed Bit-Fix mechanism. Additionally, SprBlk provides ∼ 73% reduction in EPI (energy per i...

Research paper thumbnail of Al-Imam Muhammad Ibn Saud Islamic University

Domain of Research: Research and development of wideband access routers for hybrid fibre-coaxial ... more Domain of Research: Research and development of wideband access routers for hybrid fibre-coaxial (HFC) cable networks and passive optical networks (PON)

Research paper thumbnail of Supervising Communication SoC for Secure Operation Using Machine Learning

2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS)

Manufacturers normally buy and/or fabricate communication chips using third-party suppliers, whic... more Manufacturers normally buy and/or fabricate communication chips using third-party suppliers, which are then integrated into a complex hardware-software stack with a variety of potential vulnerabilities. This work proposes a compact supervisory circuit to classify the operation of a Bluetooth (BT) SoC at low frequencies by monitoring the input power and radio frequency (RF) output of the BT chip passed through an envelope detector. The idea is to inexpensively fabricate an envelope detector, power supply current monitor, and classification algorithm on a custom low-frequency integrated circuit in a trusted legacy technology. When the supervisory circuit detects unexpected behavior, it can shut off power to the BT SoC. In this preliminary work, we proto-type the supervisory circuit using off-the-shelf components. We extract simple yet descriptive features from the envelope of the RF output signal. Then, we train machine learning (ML) models to classify different BT operation modes, such as BT advertising and transmit modes. Our results show ∼100% classification accuracy.

Research paper thumbnail of Joint security and performance improvement in multilevel shared caches

Research paper thumbnail of Evaluating the Fault Tolerance Performance of HDFS and Ceph

Proceedings of the Practice and Experience on Advanced Research Computing

Large-scale distributed systems are a collection of loosely coupled computers interconnected by a... more Large-scale distributed systems are a collection of loosely coupled computers interconnected by a communication network. They are now an integral part of everyday life with the development of large web applications, social networks, peer-to-peer systems, wireless sensor networks and many more. Because each disk by itself is prone to failure, one key challenge in designing such systems is their ability to tolerate faults. Hence, fault tolerance mechanisms such as replication are widely used to provide data availability at all times. On the other hand, many systems now are increasingly supporting new mechanism called erasure coding (EC), claiming that using EC provides high reliability at lower storage cost than replication. However, this comes at the cost of performance. Our goal in this paper is to compare the performance and storage requirements of these two data reliability techniques for two open source systems: HDFS and Ceph especially that the Apache Software Foundation had released a new version of Hadoop, Apache Hadoop 3.0.0, which now supports EC. In addition, with the Firefly release (May 2014) Ceph added support for EC as well. We tested replication vs. EC in both systems using several benchmarks shipped with these systems. Results show that there are trade-offs between replication and EC in terms of performance and storage requirements.

Research paper thumbnail of Machine Learning Bluetooth Profile Operation Verification via Monitoring the Transmission Pattern

2019 53rd Asilomar Conference on Signals, Systems, and Computers

Manufacturers often buy and/or license communication ICs from third-party suppliers. These commun... more Manufacturers often buy and/or license communication ICs from third-party suppliers. These communication ICs are then integrated into a complex computational system, resulting in a wide range of potential hardware-software security issues. This work proposes a compact supervisory circuit to classify the Bluetooth profile operation of a Bluetooth System-on-Chip (SoC) at low frequencies by monitoring the radio frequency (RF) output power of the Bluetooth SoC. The idea is to inexpensively manufacture an RF envelope detector to monitor the RF output power and a profile classification algorithm on a custom low-frequency integrated circuit in a low-cost legacy technology. When the supervisory circuit observes unexpected behavior, it can shut off power to the Bluetooth SoC. In this preliminary work, we proto-type the supervisory circuit using off-the-shelf components to collect a sufficient data set to train 11 different Machine Learning models. We extract smart descriptive time-domain features from the envelope of the RF output signal. Then, we train the machine learning models to classify three different Bluetooth operation profiles: sensor, hands-free, and headset. Our results demonstrate 100% classification accuracy with low computational complexity.∼

Research paper thumbnail of A Scalable Analytical Memory Model for CPU Performance Prediction

Lecture Notes in Computer Science

As the US Department of Energy (DOE) invests in exascale computing, performance modeling of physi... more As the US Department of Energy (DOE) invests in exascale computing, performance modeling of physics codes on CPUs remain a challenge in computational co-design due to the complex design of processors including memory hierarchies, instruction pipelining, and speculative execution. We present Analytical Memory Model (AMM), a model of cache hierarchies, embedded in the Performance Prediction Toolkit (PPT)-a suite of discrete-event-simulation-based co-design hardware and software models. AMM enables PPT to significantly improve the quality of its runtime predictions of scientific codes. AMM uses a computationally efficient, stochastic method to predict the reuse distance profiles, where reuse distance is a hardware architecture-independent measure of the patterns of virtual memory accesses. AMM relies on a stochastic, static basic block-level analysis of reuse profiles measured from the memory traces of applications on small instances. The analytical reuse profile is useful to estimate the effective latency and throughput of memory access, which in turn are used to predict the overall runtime of an application. Our experimental results demonstrate the scalability of AMM, where we report the error-rates of three benchmarks on two different hardware models.

Research paper thumbnail of DAdHTM: Low overhead dynamically adaptive hardware transactional memory for large graphs a scalability study

2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017

With the availability of multicore and manycore systems with large main memories, Symmetric Multi... more With the availability of multicore and manycore systems with large main memories, Symmetric Multiprocessors (SMPs) can process large scale problems such as graphs spanning millions of nodes and billions of edges which used to be the domain of large clusters. Due to the shared memory architecture of SMPs, we can apply fast, novel policies such as Transactional Memory (TM) to speed up the processing of such problems. Interestingly, most applications in bioinformatics, social networks, and cybersecurity can be represented as large graphs. However, these graphs are sparse in nature. Therefore, TM as a synchronization policy for critical sections can provide better performance since it performs better in low conflict scenarios. Many TM variants exist including: Software, Hardware and combinations of both. Furthermore, a TM can adapt to applications' behavior. In this paper, we introduce DAdHTM. A Dynamically Adaptive Hardware TM (DAdHTM) designed to adapt the HTM to the application&#...

Research paper thumbnail of Fault Tolerance Performance Evaluation of Large-Scale Distributed Storage Systems HDFS and Ceph Case Study

2018 IEEE High Performance extreme Computing Conference (HPEC)

Large-scale distributed systems are a collection of loosely coupled computers interconnected by a... more Large-scale distributed systems are a collection of loosely coupled computers interconnected by a communication network. They are now an integral part of everyday life with the development of large web applications, social networks, peer-to-peer systems, wireless sensor networks and many more. At such a scale, hardware components by themselves are prone to failure. Therefore, one key challenge in designing distributed storage systems is how to tolerate faults. To this end, fault tolerance mechanisms such as replication have been widely used to provide high availability for decades. More recently, many systems start supporting erasure coding for fault tolerance, which is expected to achieve high reliability at a lower storage cost compared to replication. However, the reduced storage overhead comes at the cost of more complicated recovery which hurts performance. In this paper, we study the fault tolerance mechanisms of two representative distributed file systems: HDFS and Ceph. In addition to the traditional replication, both HDFS and Ceph support erasure coding in their latest version. We evaluate the replication and erasure coding implementations in both systems using standard benchmarks and fault injection, and quantitatively measure the performance and storage overhead. Our results demonstrate the trade-offs between replication and erasure coding techniques, and serve as a foundation for building optimal storage systems with high availability as well as high performance.

Research paper thumbnail of A Brief History of HPC Simulation and Future Challenges

High-performance Computing (HPC) systems have gone through many changes during the past two decad... more High-performance Computing (HPC) systems have gone through many changes during the past two decades in their architectural design to satisfy the increasingly large-scale scientific computing demand. Accurate, fast, and scalable performance models and simulation tools are essential for evaluating alternative architecture design decisions for the massive-scale computing systems. This paper recounts some of the influential work in modeling and simulation for HPC systems and applications, identifies some of the major challenges, and outlines future research directions which we believe are critical to the HPC modeling and simulation community.

Research paper thumbnail of Novel flexible buffering architectures for 3D-NoCs

Sustainable Computing: Informatics and Systems

Abstract In the conventional router architecture of Network-on-Chips (NoCs), each input port empl... more Abstract In the conventional router architecture of Network-on-Chips (NoCs), each input port employs a set of dedicated flit buffers to store incoming flits. This mechanism unevenly distributes flits among router buffers, which in turn leads to higher packet blocking rates and under utilization of buffers. In this paper, we address this problem by proposing two novel buffering mechanisms and their corresponding architectures to share flit buffers among several ports of a router efficiently. Our first proposed mechanism is called Minimum-First buffering. This mechanism distributes flits among buffers of input ports based on the number of free buffer slots available in each port, giving priority to minimum occupied buffers. This approach increases the utilization of underutilized buffers by allowing them to store flits of other input ports. The second mechanism (so-called Inverse-Priority buffering) is a lighter yet efficient, flexible buffering technique. This mechanism employs a simple priority order for each buffer. According to our analysis, prioritizing specific ports over others balances the traffic loads between router buffers, and thus yields higher throughput. Both mechanisms lead to lower waiting times in the router and higher utilization in hardware resources. After studying all possible scenarios and analyzing corner cases, we have optimally designed two router architectures equipped with the proposed buffering mechanisms. Moreover, a hardware optimization technique is introduced to reduce the area overhead of the Minimum-First router architecture. The proposed architectures show significant improvements in the performance of 3D-NoCs in terms of the average network throughput and average delay as well as the total number of blocked packets compared to different state-of-the-art and baseline router architectures.

Research paper thumbnail of Performance Evaluation of Mesh-based 3D NoCs

Proceedings of the 10th International Workshop on Network on Chip Architectures, 2017

The advances on 3D circuit integration have reignited the idea of processing-in-memory (PIM). In ... more The advances on 3D circuit integration have reignited the idea of processing-in-memory (PIM). In this paper, we evaluate 3D mesh-based NoC design for 3D-PIM systems. We study the stacked mesh (S-Mesh) which is a mesh-bus hybrid architecture for 3D NoCs that connects vertically stacked 2D meshes through buses. Previous S-Mesh studies have not addressed the problems and modifications needed at the building blocks of the network. We explain in details the internal structure of the S-Mesh, as well as, the problems and solutions of connecting 2D meshes using vertical buses. Also, we evaluate the performance of 3D NoC designs via two traffic patterns, one of which is a novel traffic pattern that better measures 3D-PIM systems performance. Our results show 15% performance improvement for the S-Mesh for zero-load packet latency while having a negligible decrease in saturation throughput.

Research paper thumbnail of 1 Energy Efficient Tri-State CNFET Ternary Logic 2 Gates 3

Traditional silicon binary circuits continue to face challenges such as high leakage power 14 dis... more Traditional silicon binary circuits continue to face challenges such as high leakage power 14 dissipation and large area of interconnections. Multiple-Valued Logic (MVL) and nano-devices are 15 two feasible solutions to overcome these problems. In this paper, we present a novel method to 16 design ternary logic circuits based on Carbon Nanotube Field Effect Transistors (CNFETs). The 17 proposed designs use the unique properties of CNFETs, e.g., adjusting the Carbon 18 Nanotube (CNT) diameters to have the desired threshold voltage and have the same mobility of P19 FET and N-FET transistors. Each of our designed logic circuits implements a logic function and its 20 complementary via a control signal. Also, these circuits have a high impedance state which saves 21 power while the circuits are not in use. We show a more detailed application of our approach by 22 designing a two-digit adder-subtractor circuit. We simulate the proposed ternary circuits using 23 HSPICE via standard 32nm CN...

Research paper thumbnail of Probabilistic Monte Carlo simulations for static branch prediction

2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC)

Conditional branch instructions have a significant effect on the microprocessor performance and t... more Conditional branch instructions have a significant effect on the microprocessor performance and throughput. Accurate branch prediction is crucial in reducing control hazards and improving microprocessor performance. Modern microprocessors accurately predict the branch outcomes using advanced prediction techniques. Estimating branch mis-prediction rates accurately helps to improve the overall performance by saving CPU cycles and power. In general, we run the application programs on cycle accurate hardware simulators such as GEM5 [4], to collect the branch prediction statistics. This method comes out to be time consuming and is also not scalable. We present a novel Monte Carlo simulation framework that produces the branch prediction rate statically, without actually running the application on the hardware. Our framework mimics the execution behavior of the real hardware. It uses one of the three different branch prediction schemes to calculate the branch prediction statistics. It also comments on the branch prediction rates of individual branches. Results suggest that the conditional prediction rates for four scientific applications are similar to that of results from the GEM5 [4] simulator.

Research paper thumbnail of Exploring Energy-Efficient Ternary Inexact Multipliers Using CNT Transistors

Research paper thumbnail of Optimizing locality in graph computations using reuse distance profiles

2017 IEEE 36th International Performance Computing and Communications Conference (IPCCC), 2017

This work tries to answer the question of whether or not we should write code differently when th... more This work tries to answer the question of whether or not we should write code differently when the underlying chip microarchitecture is powered by a multicore processor. We use a set of three graph benchmarks each with three different input problems varying in size and connectivity to characterize the importance of how we partition the problem space among cores and how that partitioning can happen at multiple levels of the cache leading to better performance. We explore a design space represented by different parallelization schemes and different graph partitionings. This provides a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain over a baseline legacy parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the best alternative in most of the cases and other parallelization techniques perform better. We use a PIN computed reuse distance profile to build a...

Research paper thumbnail of DyAdHyTM: A Low Overhead Dynamically Adaptive Hybrid Transactional Memory on Big Data Graphs

ArXiv, 2017

Big data is a buzzword used to describe massive volumes of data that provides opportunities of ex... more Big data is a buzzword used to describe massive volumes of data that provides opportunities of exploring new insights through data analytics. However, big data is mostly structured but can be semi-structured or unstructured. It is normally so large that it is not only difficult but also slow to process using traditional computing systems. One of the solutions is to format the data as graph data structures and process them on shared memory architecture to use fast and novel policies such as transactional memory. In most graph applications in big data type problems such as bioinformatics, social networks, and cyber security, graphs are sparse in nature. Due to this sparsity, we have the opportunity to use Transactional Memory (TM) as the synchronization policy for critical sections to speedup applications. At low conflict probability TM performs better than most synchronization policies due to its inherent non-blocking characteristics. TM can be implemented in Software, Hardware or a ...

Research paper thumbnail of Local memory store (LMStr): A hardware controlled shared scratchpad for multicores

2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017

We present an on-chip memory store called “Local Memory Store” (LMStr). The LMStr can be used wit... more We present an on-chip memory store called “Local Memory Store” (LMStr). The LMStr can be used with a regular cache hierarchy or solely as a redesigned scratchpad memory (SPM). The LMStr is a shared special kind of SPM among the cores in a multicore processor. The LMStr is hardware-controlled in terms of management of the store itself. Yet, compiler support is instrumental in deciding which data items/types should live in the store. Critical data should be stored in the LMStr according to its type (i.e. local, global, static, or temporary). The programmer can provide, at will, hints to the compiler to place certain data items in the LMStr. We evaluate our design using a matrix multiplication micro-application and multiple Mantevo mini-applications. Our results show that LMStr improves data movement by up to 21% compared to cache alone with a mere 3% area overhead. Not only that but LMStr improves the cycles per memory access by up to 40%.

Research paper thumbnail of FPGA-Accelerated Decision Tree Classifier for Real-Time Supervision of Bluetooth SoC

2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2019

Wireless communication protocols are used in all smart devices and systems. This work proposes an... more Wireless communication protocols are used in all smart devices and systems. This work proposes an FPGA-accelerated supervisory system that classifies the operation of a communication system-on-chip (SoC). In this work, the selected communication protocol is Bluetooth (BT). The input supply current to the transceiver block of the SoC is monitored and sampled at 50 kHz. We extract simple descriptive features from the transceiver input power signal and use them to train a machine learning (ML) model to classify two different BT operation modes. We implemented ADC sampling, feature extraction, and a real-time decision tree classifier on an Intel MAX 10 FPGA. The measured classification accuracy is 97.4%.

Research paper thumbnail of Spare block cache (SprBlk): Fault resilience and reliability at low voltages

2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 2017

This paper proposes a novel cache architecture that uses spare cache blocks to work as back up bl... more This paper proposes a novel cache architecture that uses spare cache blocks to work as back up blocks in a set associative cache, which can operate reliably at voltages well below the manufacturing induced operating voltage (Vccmin). We detect errors in all cache lines at low voltage (i.e. persistent error), tag them as either faulty or fault-free. At runtime, we bypass the faulty words. To bypass faulty words, we use adder and shifter circuitry. Furthermore, we develop a fault model to find the cache set failure probability at low voltage. At 485 mV, SprBlk cache operates with a 16.7% lower bit failure probability compared to a conventional cache operating at 782 mV. Additionally, SprBlk reduce power consumption by 1% when implemented in the L1 data cache only, by 75% when implemented in the L2 cache only, and by 76% when implemented in both caches.