Magnus Själander | Norwegian University of Science and Technology (original) (raw)

Papers by Magnus Själander

Research paper thumbnail of Poster: Approximation: A New Paradigm also for Wireless Sensing

While for sensor networks energy-efficiency has always been one of the major design goals, energy... more While for sensor networks energy-efficiency has always been one of the major design goals, energy-efficiency has, for the past decade, also become increasingly important in other disciplines. An emerging class of applications do not require a perfect result. Rather, an approximate result is sufficient. Approximate computing enables more energyefficient design of computer systems, reducing, for example, the energy dissipation in data centers. In this poster abstract we argue for embracing the concept of approximation computing in the sensor networking community.

Research paper thumbnail of Energy Efficient Computing (EEC14)

Research paper thumbnail of A Power-Efficient and Versatile Modified-Booth Multiplier

Research paper thumbnail of Selectively Delaying Instructions to Prevent Microarchitectural Replay Attacks

ArXiv, 2021

MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristi... more MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack security critical trusted execution environments (secure enclaves), even under conditions where a side-channel attack would not be possible. At the same time, unlike speculative side-channel attacks, MicroScope can be used to amplify the correct path of execution, rendering many existing speculative side-channel defences ineffective. In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defence against them. We make the observation that such attacks rely on repeated squashes of so-called “replay handles” and that the instructions causing the side-channel must reside in the same reorder buff...

Research paper thumbnail of FlexTools: Design Space Exploration Tool Chain from C to Physical Implementation

The complexity of the hardware-software co-design continues to grow despite the relentless effort... more The complexity of the hardware-software co-design continues to grow despite the relentless efforts of the EDA community. This makes the task of producing an optimal, yet functionally correct design even more challenging. To make the situation worse, the applications that a particular design is optimized for plays a vital role in determining the competitiveness of the design. It is therefore imperative that a generic design is tailored prior to manufacturing. Encouragingly, as the design complexity increases, the innovative methods of alleviating these problems evolve. In this paper, we address the issues related to hardware adaptation for a specific suite of applications.

Research paper thumbnail of Redesigning a tagless access buffer to require minimal ISA changes

2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES), 2016

Energy efficiency is a first-order design goal for nearly all classes of processors, but it is pa... more Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy usage, and thus techniques to improve the energy efficiency of the cache hierarchy are likely to have high impact. Our prior work reduced data cache energy via a tagless access buffer (TAB) that sits at the top of the cache hierarchy. Strided memory references are redirected from the level-one data cache (L1D) to the smaller, more energy-efficient TAB. These references need not access the data translation lookaside buffer (DTLB), and they can avoid unnecessary transfers from lower levels of the memory hierarchy. The original TAB implementation requires changing the immediate field of load and store instructions, necessitating substantial ISA modifications. Here we present a new TAB design that requires minimal instruction set changes, gives software m...

Research paper thumbnail of Area Efficient High Speed and Low Power MAC Unit

With the growing importance of electronic products in day-to-day life, the need for portable elec... more With the growing importance of electronic products in day-to-day life, the need for portable electronic products with low power consumption largely increases. In this paper, an area efficient high speed and low power Multiply Accumulator unit (MAC) with carry look-ahead adder (CLA) as final adder is being designed. In the same MAC architecture design in final adder stage of partial product unit the carry save adder(CSA), carry select adder(CSLA) and carry skip adder(CSKPA) are also used instead of CLA to compare the power and performance. These MAC designs were simulated and synthesized using Xilinx 8. 1. The simulation result shows that the MAC design with CLA has area reducing by 16. 7%, speed increase by 1. 95% and the consumed power reducing by 0. 5%.

Research paper thumbnail of An Efficient FFT Engine Based on Twin-Precision Computation

Applications for System-on-Chips have very differ- ent requirements regarding operand precision. ... more Applications for System-on-Chips have very differ- ent requirements regarding operand precision. We propose a twin-precision FFT engine that can efficiently perform either the common full-precision operation or two simultaneous and independent half-precision operations in the same hardware block. We provide an evaluation on energy, delay and area for synthesized 0.13-"m butterfly circuits, and show how many lower- precision operations that are needed for the twin-precision FFT butterfly to perform better than a conventional, dedicated FFT butterfly.

Research paper thumbnail of It's a Trap!"-How Speculation Invariance Can Be Abused with Forward Speculative Interference

ArXiv, 2021

Side-channel attacks based on speculative execution access sensitive data and use transmitters to... more Side-channel attacks based on speculative execution access sensitive data and use transmitters to leak such data during wrongpath execution. Speculative side-channel defenses have been proposed to prevent such information leakage. In one class of defenses, speculative instructions are considered unsafe and are delayed until they become non-speculative. However, not all speculative instructions are unsafe: Recent work demonstrates that speculative invariant instructions are independent of a speculative control-flow path and are guaranteed to eventually execute and commit, regardless of the outcome of the performed speculation. Compile time information coupled with run-time mechanisms can then selectively lift defenses for Speculative Invariant instructions, regaining some of the performance lost to “delay” defenses. Unfortunately, speculative invariance can be easily mishandled with Speculative Interference to leak information using a new side-channel that we introduce in this paper....

Research paper thumbnail of On Value Recomputation to Accelerate Invisible Speculation

Recent architectural approaches that address speculative side-channel attacks aim to prevent soft... more Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by pro...

Research paper thumbnail of Exposed Datapath for Efficient Computing

We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor ... more We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor paradigm. The FlexCore utilizes an exposed datapath for increased performance. Microbenchmarks yield a performance boost of a factor of two over a traditional five-stage pipeline with the same functional units as the FlexCore. We describe our approach to compiling for the FlexCore. A flexible interconnect allows the FlexCore datapath to be dynamically reconfigured as a consequence of code generation. Additionally, specialized functional units may be introduced and utilized within the same architecture and compilation framework. The exposed datapath requires a wide control word. The conducted evaluation of two micro benchmarks confirms that this increases the instruction bandwidth and memory footprint. This calls for an efficient instruction decoding as proposed in the FlexSoC paradigm.

Research paper thumbnail of Efficient Reconfigurable Multipliers Based on the Twin-Precision Technique

During the last decade of integrated electronic design ever more functionality has been integrate... more During the last decade of integrated electronic design ever more functionality has been integrated onto the same chip, paving the way for having a whole system on a single chip. The strive for ever more functionality increases the demands on circuit designers that have to provide the foundation for all this functionality. The desire for increased functionality and an associated capability to adapt to changing requirements, has led to the design of reconfigurable architectures. With an increased interest and use of reconfigurable architectures there is a need for flexible and reconfigurable computational units that can meet the demands of high speed, high throughput, low power, and area efficiency. Multiplications are complex to implement and they continue to give designers headaches when trying to efficiently implement multipliers in hardware. Multipliers are therefore interesting to study, when investigating how to design flexible and reconfigurable computational units. In this the...

Research paper thumbnail of Improving Error-Resilience of Emerging Multi-Value Technologies

There exist extensive ongoing research efforts on emerging technologies that have the potential t... more There exist extensive ongoing research efforts on emerging technologies that have the potential to become an alternative to today’s CMOS technologies. A common feature among the investigated techno ...

Research paper thumbnail of Efficient and Flexible Embedded Systems and Datapath Components

The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mob... more The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require- ments, in terms of high performance, low power dissipation, and low cost, makes these systems complex and nontrivial to design. This thesis addresses design challenges in three different areas of embedded systems. The presented FlexCore processor intends to improve the programmability of heterogeneous embedded systems while maintaining the performance of application-specific accelerators. This is achieved by integrating accelerators into the datapath of a general-purpose processor in combination with a wide control word consisting of all control signals in a FlexCore’s datapath. Furthermore, a FlexCore processor utilizes a flexible interconnect, which together with the expressiveness of the wide control word improves its performance. When designing new embedde...

Research paper thumbnail of Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Many of the important services running on data centres are latency-critical, time-varying, and de... more Many of the important services running on data centres are latency-critical, time-varying, and demand strict user satisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tail-latency targets resulting in an increased total cost of ownership. This paper introduces Twig, a scalable quality-of-service (QoS) aware task manager for latency-critical services co-located on a server system. Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres. We evaluate Twig on a typical data centre server managing four widely used latency-critical services. Our results show that Twig outperforms prior works in reducing energy usage by up to 38% while achieving up to 99% QoS guarantee for latency-critical services.

Research paper thumbnail of Clearing the Shadows

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Out-of-order processors heavily rely on speculation to achieve high performance, allowing instruc... more Out-of-order processors heavily rely on speculation to achieve high performance, allowing instructions to bypass other slower instructions in order to fully utilize the processor's resources. Speculatively executed instructions do not affect the correctness of the application, as they never change the architectural state, but they do affect the micro-architectural behavior of the system. Until recently, these changes were considered to be safe but with the discovery of new security attacks that misuse speculative execution to leak secrete information through observable micro-architectural changes (so called side-channels), this is no longer the case. To solve this issue, a wave of software and hardware mitigations have been proposed, the majority of which delay and/or hide speculative execution until it is deemed to be safe, trading performance for security. These newly enforced restrictions change how speculation is applied and where the performance bottlenecks appear, forcing us to rethink how we design and optimize both the hardware and the software. We observe that many of the state-of-the-art hardware solutions targeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until they become safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing the causes of loads' unsafety, generally caused by control and memory dependence speculation. As a result, we manage to make more loads safe to execute at an early stage, which enables us to schedule more loads at a time to overlap their delays and improve performance. We apply our techniques on the state-of-the-art Delay-on-Miss hardware defense and show that we reduce the performance gap to the unsafe baseline by 53% (on average).

Research paper thumbnail of Ghost loads

Proceedings of the 16th ACM International Conference on Computing Frontiers

Speculative execution is necessary for achieving high performance on modern general-purpose CPUs ... more Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache updates), caused by the speculatively executed instructions, are commonly left in the system. These changes can be used to leak sensitive information, which has led to a frantic search for solutions that can eliminate such security flaws. The contribution of this work is an evaluation of the cost of hiding speculative side-effects in the cache hierarchy, making them visible only after the speculation has been resolved. For this, we compare (for the first time) two broad approaches: i) waiting for loads to become non-speculative before issuing them to the memory system, and ii) eliminating the side-effects of speculation, a solution consisting of invisible loads (Ghost loads) and performance optimizations (Ghost Buffer and Materialization). While previous work, InvisiSpec, has proposed a similar solution to our latter approach, it has done so with only a minimal evaluation and at a significant performance cost. The detailed evaluation of our solutions shows that: i) waiting for loads to become non-speculative is no more costly than the previously proposed InvisiSpec solution, albeit much simpler, non-invasive in the memory system, and stronger security-wise; ii) hiding speculation with Ghost loads (in the context of a relaxed memory model) can be achieved at the cost of 12% performance degradation and 9% energy increase, which is significantly better that the previous state-of-the-art solution.

Research paper thumbnail of Techniques for modulating error resilience in emerging multi-value technologies

Proceedings of the ACM International Conference on Computing Frontiers - CF '16, 2016

There exist extensive ongoing research e↵orts on emerging atomic scale technologies that have the... more There exist extensive ongoing research e↵orts on emerging atomic scale technologies that have the potential to become an alternative to today's CMOS technologies. A common feature among the investigated technologies is that of multivalue devices, in particular, the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy e ciency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an average energy reduction of 14%, 20%, and 32% for the register file, level-one data cache, and main memory, respectively, and for three integer benchmarks, an energy reduction for arithmetic operations of up to 28%. We also show that for a quaternary technology to be viable a raw bit error rate of one error in 100 million or better is required.

Research paper thumbnail of Static Instruction Scheduling for High Performance on Limited Hardware

IEEE Transactions on Computers

Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstand... more Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14% for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.

Research paper thumbnail of Practical Way Halting by Speculatively Accessing Halt Tags

Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016

Conventional set-associative data cache accesses waste energy since tag and data arrays of severa... more Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.

Research paper thumbnail of Poster: Approximation: A New Paradigm also for Wireless Sensing

While for sensor networks energy-efficiency has always been one of the major design goals, energy... more While for sensor networks energy-efficiency has always been one of the major design goals, energy-efficiency has, for the past decade, also become increasingly important in other disciplines. An emerging class of applications do not require a perfect result. Rather, an approximate result is sufficient. Approximate computing enables more energyefficient design of computer systems, reducing, for example, the energy dissipation in data centers. In this poster abstract we argue for embracing the concept of approximation computing in the sensor networking community.

Research paper thumbnail of Energy Efficient Computing (EEC14)

Research paper thumbnail of A Power-Efficient and Versatile Modified-Booth Multiplier

Research paper thumbnail of Selectively Delaying Instructions to Prevent Microarchitectural Replay Attacks

ArXiv, 2021

MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristi... more MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack security critical trusted execution environments (secure enclaves), even under conditions where a side-channel attack would not be possible. At the same time, unlike speculative side-channel attacks, MicroScope can be used to amplify the correct path of execution, rendering many existing speculative side-channel defences ineffective. In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defence against them. We make the observation that such attacks rely on repeated squashes of so-called “replay handles” and that the instructions causing the side-channel must reside in the same reorder buff...

Research paper thumbnail of FlexTools: Design Space Exploration Tool Chain from C to Physical Implementation

The complexity of the hardware-software co-design continues to grow despite the relentless effort... more The complexity of the hardware-software co-design continues to grow despite the relentless efforts of the EDA community. This makes the task of producing an optimal, yet functionally correct design even more challenging. To make the situation worse, the applications that a particular design is optimized for plays a vital role in determining the competitiveness of the design. It is therefore imperative that a generic design is tailored prior to manufacturing. Encouragingly, as the design complexity increases, the innovative methods of alleviating these problems evolve. In this paper, we address the issues related to hardware adaptation for a specific suite of applications.

Research paper thumbnail of Redesigning a tagless access buffer to require minimal ISA changes

2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES), 2016

Energy efficiency is a first-order design goal for nearly all classes of processors, but it is pa... more Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy usage, and thus techniques to improve the energy efficiency of the cache hierarchy are likely to have high impact. Our prior work reduced data cache energy via a tagless access buffer (TAB) that sits at the top of the cache hierarchy. Strided memory references are redirected from the level-one data cache (L1D) to the smaller, more energy-efficient TAB. These references need not access the data translation lookaside buffer (DTLB), and they can avoid unnecessary transfers from lower levels of the memory hierarchy. The original TAB implementation requires changing the immediate field of load and store instructions, necessitating substantial ISA modifications. Here we present a new TAB design that requires minimal instruction set changes, gives software m...

Research paper thumbnail of Area Efficient High Speed and Low Power MAC Unit

With the growing importance of electronic products in day-to-day life, the need for portable elec... more With the growing importance of electronic products in day-to-day life, the need for portable electronic products with low power consumption largely increases. In this paper, an area efficient high speed and low power Multiply Accumulator unit (MAC) with carry look-ahead adder (CLA) as final adder is being designed. In the same MAC architecture design in final adder stage of partial product unit the carry save adder(CSA), carry select adder(CSLA) and carry skip adder(CSKPA) are also used instead of CLA to compare the power and performance. These MAC designs were simulated and synthesized using Xilinx 8. 1. The simulation result shows that the MAC design with CLA has area reducing by 16. 7%, speed increase by 1. 95% and the consumed power reducing by 0. 5%.

Research paper thumbnail of An Efficient FFT Engine Based on Twin-Precision Computation

Applications for System-on-Chips have very differ- ent requirements regarding operand precision. ... more Applications for System-on-Chips have very differ- ent requirements regarding operand precision. We propose a twin-precision FFT engine that can efficiently perform either the common full-precision operation or two simultaneous and independent half-precision operations in the same hardware block. We provide an evaluation on energy, delay and area for synthesized 0.13-"m butterfly circuits, and show how many lower- precision operations that are needed for the twin-precision FFT butterfly to perform better than a conventional, dedicated FFT butterfly.

Research paper thumbnail of It's a Trap!"-How Speculation Invariance Can Be Abused with Forward Speculative Interference

ArXiv, 2021

Side-channel attacks based on speculative execution access sensitive data and use transmitters to... more Side-channel attacks based on speculative execution access sensitive data and use transmitters to leak such data during wrongpath execution. Speculative side-channel defenses have been proposed to prevent such information leakage. In one class of defenses, speculative instructions are considered unsafe and are delayed until they become non-speculative. However, not all speculative instructions are unsafe: Recent work demonstrates that speculative invariant instructions are independent of a speculative control-flow path and are guaranteed to eventually execute and commit, regardless of the outcome of the performed speculation. Compile time information coupled with run-time mechanisms can then selectively lift defenses for Speculative Invariant instructions, regaining some of the performance lost to “delay” defenses. Unfortunately, speculative invariance can be easily mishandled with Speculative Interference to leak information using a new side-channel that we introduce in this paper....

Research paper thumbnail of On Value Recomputation to Accelerate Invisible Speculation

Recent architectural approaches that address speculative side-channel attacks aim to prevent soft... more Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by pro...

Research paper thumbnail of Exposed Datapath for Efficient Computing

We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor ... more We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor paradigm. The FlexCore utilizes an exposed datapath for increased performance. Microbenchmarks yield a performance boost of a factor of two over a traditional five-stage pipeline with the same functional units as the FlexCore. We describe our approach to compiling for the FlexCore. A flexible interconnect allows the FlexCore datapath to be dynamically reconfigured as a consequence of code generation. Additionally, specialized functional units may be introduced and utilized within the same architecture and compilation framework. The exposed datapath requires a wide control word. The conducted evaluation of two micro benchmarks confirms that this increases the instruction bandwidth and memory footprint. This calls for an efficient instruction decoding as proposed in the FlexSoC paradigm.

Research paper thumbnail of Efficient Reconfigurable Multipliers Based on the Twin-Precision Technique

During the last decade of integrated electronic design ever more functionality has been integrate... more During the last decade of integrated electronic design ever more functionality has been integrated onto the same chip, paving the way for having a whole system on a single chip. The strive for ever more functionality increases the demands on circuit designers that have to provide the foundation for all this functionality. The desire for increased functionality and an associated capability to adapt to changing requirements, has led to the design of reconfigurable architectures. With an increased interest and use of reconfigurable architectures there is a need for flexible and reconfigurable computational units that can meet the demands of high speed, high throughput, low power, and area efficiency. Multiplications are complex to implement and they continue to give designers headaches when trying to efficiently implement multipliers in hardware. Multipliers are therefore interesting to study, when investigating how to design flexible and reconfigurable computational units. In this the...

Research paper thumbnail of Improving Error-Resilience of Emerging Multi-Value Technologies

There exist extensive ongoing research efforts on emerging technologies that have the potential t... more There exist extensive ongoing research efforts on emerging technologies that have the potential to become an alternative to today’s CMOS technologies. A common feature among the investigated techno ...

Research paper thumbnail of Efficient and Flexible Embedded Systems and Datapath Components

The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mob... more The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require- ments, in terms of high performance, low power dissipation, and low cost, makes these systems complex and nontrivial to design. This thesis addresses design challenges in three different areas of embedded systems. The presented FlexCore processor intends to improve the programmability of heterogeneous embedded systems while maintaining the performance of application-specific accelerators. This is achieved by integrating accelerators into the datapath of a general-purpose processor in combination with a wide control word consisting of all control signals in a FlexCore’s datapath. Furthermore, a FlexCore processor utilizes a flexible interconnect, which together with the expressiveness of the wide control word improves its performance. When designing new embedde...

Research paper thumbnail of Twig: Multi-Agent Task Management for Colocated Latency-Critical Cloud Services

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Many of the important services running on data centres are latency-critical, time-varying, and de... more Many of the important services running on data centres are latency-critical, time-varying, and demand strict user satisfaction. Stringent tail-latency targets for colocated services and increasing system complexity make it challenging to reduce the power consumption of data centres. Data centres typically sacrifice server efficiency to maintain tail-latency targets resulting in an increased total cost of ownership. This paper introduces Twig, a scalable quality-of-service (QoS) aware task manager for latency-critical services co-located on a server system. Twig successfully leverages deep reinforcement learning to characterise tail latency using hardware performance counters and to drive energy-efficient task management decisions in data centres. We evaluate Twig on a typical data centre server managing four widely used latency-critical services. Our results show that Twig outperforms prior works in reducing energy usage by up to 38% while achieving up to 99% QoS guarantee for latency-critical services.

Research paper thumbnail of Clearing the Shadows

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Out-of-order processors heavily rely on speculation to achieve high performance, allowing instruc... more Out-of-order processors heavily rely on speculation to achieve high performance, allowing instructions to bypass other slower instructions in order to fully utilize the processor's resources. Speculatively executed instructions do not affect the correctness of the application, as they never change the architectural state, but they do affect the micro-architectural behavior of the system. Until recently, these changes were considered to be safe but with the discovery of new security attacks that misuse speculative execution to leak secrete information through observable micro-architectural changes (so called side-channels), this is no longer the case. To solve this issue, a wave of software and hardware mitigations have been proposed, the majority of which delay and/or hide speculative execution until it is deemed to be safe, trading performance for security. These newly enforced restrictions change how speculation is applied and where the performance bottlenecks appear, forcing us to rethink how we design and optimize both the hardware and the software. We observe that many of the state-of-the-art hardware solutions targeting memory systems operate on a common scheme: the visible execution of loads or their dependents is blocked until they become safe to execute. In this work we propose a generally applicable hardware-software extension that focuses on removing the causes of loads' unsafety, generally caused by control and memory dependence speculation. As a result, we manage to make more loads safe to execute at an early stage, which enables us to schedule more loads at a time to overlap their delays and improve performance. We apply our techniques on the state-of-the-art Delay-on-Miss hardware defense and show that we reduce the performance gap to the unsafe baseline by 53% (on average).

Research paper thumbnail of Ghost loads

Proceedings of the 16th ACM International Conference on Computing Frontiers

Speculative execution is necessary for achieving high performance on modern general-purpose CPUs ... more Speculative execution is necessary for achieving high performance on modern general-purpose CPUs but, starting with Spectre and Meltdown, it has also been proven to cause severe security flaws. In case of a misspeculation, the architectural state is restored to assure functional correctness but a multitude of microarchitectural changes (e.g., cache updates), caused by the speculatively executed instructions, are commonly left in the system. These changes can be used to leak sensitive information, which has led to a frantic search for solutions that can eliminate such security flaws. The contribution of this work is an evaluation of the cost of hiding speculative side-effects in the cache hierarchy, making them visible only after the speculation has been resolved. For this, we compare (for the first time) two broad approaches: i) waiting for loads to become non-speculative before issuing them to the memory system, and ii) eliminating the side-effects of speculation, a solution consisting of invisible loads (Ghost loads) and performance optimizations (Ghost Buffer and Materialization). While previous work, InvisiSpec, has proposed a similar solution to our latter approach, it has done so with only a minimal evaluation and at a significant performance cost. The detailed evaluation of our solutions shows that: i) waiting for loads to become non-speculative is no more costly than the previously proposed InvisiSpec solution, albeit much simpler, non-invasive in the memory system, and stronger security-wise; ii) hiding speculation with Ghost loads (in the context of a relaxed memory model) can be achieved at the cost of 12% performance degradation and 9% energy increase, which is significantly better that the previous state-of-the-art solution.

Research paper thumbnail of Techniques for modulating error resilience in emerging multi-value technologies

Proceedings of the ACM International Conference on Computing Frontiers - CF '16, 2016

There exist extensive ongoing research e↵orts on emerging atomic scale technologies that have the... more There exist extensive ongoing research e↵orts on emerging atomic scale technologies that have the potential to become an alternative to today's CMOS technologies. A common feature among the investigated technologies is that of multivalue devices, in particular, the possibility of implementing quaternary logic and memory. However, multi-value devices tend to be more sensitive to interferences and, thus, have reduced error resilience. We present an architecture based on multi-value devices where we can trade energy e ciency against error resilience. Important data are encoded in a more robust binary format while error tolerant data is encoded in a quaternary format. We show for eight benchmarks an average energy reduction of 14%, 20%, and 32% for the register file, level-one data cache, and main memory, respectively, and for three integer benchmarks, an energy reduction for arithmetic operations of up to 28%. We also show that for a quaternary technology to be viable a raw bit error rate of one error in 100 million or better is required.

Research paper thumbnail of Static Instruction Scheduling for High Performance on Limited Hardware

IEEE Transactions on Computers

Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstand... more Complex out-of-order (OoO) processors have been designed to overcome the restrictions of outstanding long-latency misses at the cost of increased energy consumption. Simple, limited OoO processors are a compromise in terms of energy consumption and performance, as they have fewer hardware resources to tolerate the penalties of long-latency loads. In worst case, these loads may stall the processor entirely. We present Clairvoyance, a compiler based technique that generates code able to hide memory latency and better utilize simple OoO processors. By clustering loads found across basic block boundaries, Clairvoyance overlaps the outstanding latencies to increases memory-level parallelism. We show that these simple OoO processors, equipped with the appropriate compiler support, can effectively hide long-latency loads and achieve performance improvements for memory-bound applications. To this end, Clairvoyance tackles (i) statically unknown dependencies, (ii) insufficient independent instructions, and (iii) register pressure. Clairvoyance achieves a geomean execution time improvement of 14% for memory-bound applications, on top of standard O3 optimizations, while maintaining compute-bound applications' high-performance.

Research paper thumbnail of Practical Way Halting by Speculatively Accessing Halt Tags

Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016

Conventional set-associative data cache accesses waste energy since tag and data arrays of severa... more Conventional set-associative data cache accesses waste energy since tag and data arrays of several ways are simultaneously accessed to sustain pipeline speed. Different access techniques to avoid activating all cache ways have been previously proposed in an effort to reduce energy usage. However, a problem that many of these access techniques have in common is that they need to access different cache memory portions in a sequential manner, which is difficult to support with standard synchronous SRAM memory. We propose the speculative halt-tag access (SHA) approach, which accesses low-order tag bits, i.e., the halt tag, in the address generation stage instead of the SRAM access stage to eliminate accesses to cache ways that cannot possibly contain the data. The key feature of our SHA approach is that it determines which tag and data arrays need to be accessed early enough for conventional SRAMs to be used. We evaluate the SHA approach using a 65-nm processor implementation running MiBench benchmarks and find that it on average reduces data access energy by 25.6%.