Colin Blundell | University of Pennsylvania (original) (raw)

Papers by Colin Blundell

Research paper thumbnail of Session 1P: Instruction-Level Parallelism

Research paper thumbnail of Software transactional memory: Why is it only a research toy?

Abstract STM is sometimes touted as the way forward for developing concurrent software, but is it... more Abstract STM is sometimes touted as the way forward for developing concurrent software, but is it ready for use in real-world applications? The authors built an STM runtime system and compiler framework, the IBM STM, and compared its performance to other similar products by Intel and Sun. They conclude that from both performance and productivity standpoints, STM still has a long way to go before it can be viable in the real world.

Research paper thumbnail of Improved Sequence-based Speculation Techniques for Implementing Memory Consistency

Abstract This work presents BMW, a new design for speculative implementations of memory consisten... more Abstract This work presents BMW, a new design for speculative implementations of memory consistency models in shared-memory multiprocessors. BMW obtains the same performance as prior proposals, but achieves this performance while avoiding several undesirable attributes of prior proposals: non-scalable structures, per-word valid bits in the data cache, modifications to the cache coherence protocol, and global arbitration.

Research paper thumbnail of Adding Token Counting to Directory-Based Cache Coherence

Abstract The coherence protocol is a first-order design concern in multicore designs. Directory p... more Abstract The coherence protocol is a first-order design concern in multicore designs. Directory protocols are naturally scalable, as they place no restrictions on the interconnect and have minimal bandwidth requirements; however, this scalability comes at the cost of increased sharing latency due to indirection. In contrast, broadcast-based systems such as snooping protocols and token coherence reduce latency of sharing misses by sending requests directly to other processors.

Research paper thumbnail of Token tenure and PATCH: a predictive/adaptive Token-counting hybrid

Abstract Traditional coherence protocols present a set of difficult trade-offs: the reliance of s... more Abstract Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces Patch (Predictive/Adaptive Token-Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency.

Research paper thumbnail of RETCON: transactional repair without replay

Abstract Over the past decade there has been a surge of academic and industrial interest in optim... more Abstract Over the past decade there has been a surge of academic and industrial interest in optimistic concurrency, ie the speculative parallel execution of code regions that have the semantics of isolation. This work analyzes scalability bottlenecks of workloads that use optimistic concurrency. We find that one common bottleneck is updates to auxiliary program data in otherwise non-conflicting operations, eg reference count updates and hashtable occupancy field increments.

Research paper thumbnail of Token tenure: PATCHing token counting using directory-based cache coherence

Abstract Traditional coherence protocols present a set of difficult tradeoffs: the reliance of sn... more Abstract Traditional coherence protocols present a set of difficult tradeoffs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces PATCH (Predictive/Adaptive Token Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency.

Research paper thumbnail of Unrestricted transactional memory: Supporting I/O and system calls within transactions

Abstract Hardware transactional memory has great potential to simplify the creation of correct an... more Abstract Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, enabling programmers to exploit the soon-to-be-ubiquitous multi-core designs. Transactions are simply segments of code that are guaranteed to execute without interference from other concurrently-executing threads. The hardware executes transactions in parallel, ensuring non-interference via abort/rollback/restart when conflicts are detected.

Research paper thumbnail of Mechanisms for unbounded, conflict-robust hardware transactional memory

With shared-memory multiprocessing becoming the norm in contexts ranging from webservers to mobil... more With shared-memory multiprocessing becoming the norm in contexts ranging from webservers to mobile devices, the task of developing high-performance parallel programs is being faced by more programmers than ever before. One key challenge in developing such programs is the need to synchronize accesses to shared memory made by different threads. Implementing synchronization that is both (1) correct and (2) not a performance bottleneck has historically been a challenging task.

Research paper thumbnail of A constraint-based approach to open feature verification

Abstract Feature-oriented software architectures provide a powerful model for building product-li... more Abstract Feature-oriented software architectures provide a powerful model for building product-line systems. Each component corresponds to an individual feature, and a composition of features yields a product. Feature-oriented verification methodologies must be able to analyze individual features and to compose the results into results on products; features are hence a form of open systems. In prior work, Li, Fisler and Krishnamurthi proposed a feature verification methodology based on 3-valued model checking.

Research paper thumbnail of Relaxing Synchronization for Performance and Insight

Abstract—Synchronization overhead is a major bottleneck in scaling parallel applications to a lar... more Abstract—Synchronization overhead is a major bottleneck in scaling parallel applications to a large number of cores. This continues to be true in spite of various synchronizationreduction techniques that have been proposed. Previously studied synchronization-reduction techniques tacitly assume that all synchronizations specified in a source program are essential to guarantee quality of the results produced by the program.

Research paper thumbnail of InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Abstract A multiprocessor's memory consistency model imposes ordering constraints among loads, st... more Abstract A multiprocessor's memory consistency model imposes ordering constraints among loads, stores, atomic operations, and memory fences. Even for consistency models that relax ordering among loads and stores, ordering constraints still induce significant performance penalties due to atomic operations and memory ordering fences.

Research paper thumbnail of Hardbound: architectural support for <b>spatial </b>safety of the C programming language

The C programming language is at least as well known for its absence of spatial memory safety gua... more The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C's unchecked pointer arithmetic and array indexing allow simple programming mistakes to lead to erroneous executions, silent data corruption, and security vulnerabilities. Many prior proposals have tackled enforcing spatial safety in C programs by checking pointer and array accesses. However, existing software-only proposals have significant drawbacks that may prevent wide adoption, including: unacceptably high runtime overheads, lack of completeness, incompatible pointer representations, or need for non-trivial changes to existing C source code and compiler infrastructure.

Research paper thumbnail of Making the fast case common and the uncommon case simple in unbounded transactional <b>memory</b>

Hardware transactional memory has great potential to simplify the creation of correct and efficie... more Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, allowing programmers to exploit more effectively the soon-to-be-ubiquitous multi-core designs. Several recent proposals have extended the original bounded transactional memory to unbounded transactional memory, a crucial step toward transactions becoming a generalpurpose primitive. Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging, and as a result, many proposed implementations are complex.

Research paper thumbnail of Subtleties of Transactional Memory Atomicity Semantics

Transactional memory has great potential for simplifying multithreaded programming by allowing pr... more Transactional memory has great potential for simplifying multithreaded programming by allowing programmers to specify regions of the program that must appear to execute atomically. Transactional memory implementations then optimistically execute these transactions concurrently to obtain high performance. This work shows that the same atomic guarantees that give transactions their power also have unexpected and potentially serious negative effects on programs that were written assuming narrower scopes of atomicity. We make four contributions: (1) we show that a direct translation of lock-based critical sections into transactions can introduce deadlock into otherwise correct programs, (2) we introduce the terms strong atomicity and weak atomicity to describe the interaction of transactional and non-transactional code, (3) we show that code that is correct under weak atomicity can deadlock under strong atomicity, and (4) we demonstrate that sequentially composing transactional code can also introduce deadlocks. These observations invalidate the intuition that transactions are strictly safer than lock-based critical sections, that strong atomicity is strictly safer than weak atomicity, and that transactions are always composable.

Research paper thumbnail of Making <b>the </b>fast case common and <b>the </b>uncommon case simple in unbounded transactional memory

Hardware transactional memory has great potential to simplify the creation of correct and efficie... more Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, enabling programmers to exploit the soon-to-be-ubiquitous multi-core designs. Transactions are simply segments of code that are guaranteed to execute without interference from other concurrentlyexecuting threads. The hardware executes transactions in parallel, ensuring non-interference via abort/rollback/restart when conflicts are detected. Transactions thus provide both a simple programming interface and a highly-concurrent implementation that serializes only on data conflicts. A progression of recent work has broadened the utility of transactional memory by lifting the bound on the size and duration of transactions, called unbounded transactions. Nevertheless, two key challenges remain: (i) I/O and system calls cannot appear in transactions and (ii) existing unbounded transactional memory proposals require complex implementations. We describe a system for fully unrestricted transactions (i.e., they can contain I/O and system calls in addition to being unbounded in size and duration). We achieve this via two modes of transaction execution: restricted (which limits transaction size, duration, and content but is highly concurrent) and unrestricted (which is unbounded and can contain I/O and system calls but has limited concurrency because there can be only one unrestricted transaction executing at a time). Transactions transition to unrestricted mode only when necessary. We introduce unoptimized and optimized implementations in order to balance performance and design complexity.

Research paper thumbnail of Assume-guarantee testing

Verification techniques for component-based systems should ideally be able to predict properties ... more Verification techniques for component-based systems should ideally be able to predict properties of the assembled system through analysis of individual components before assembly. This work introduces such a modular technique in the context of testing. Assume-guarantee testing relies on the (automated) decomposition of key system-level requirements into local component requirements at design time. Developers can verify the local requirements by checking components in isolation; failed checks may indicate violations of system requirements, while valid traces from different components compose via the assume-guarantee proof rule to potentially provide system coverage. These local requirements also form the foundation of a technique for efficient predictive testing of assembled systems: given a correct system run, this technique can predict violations by alternative system runs without constructing those runs. We discuss the application of our approach to testing a multi-threaded NASA application, where we treat threads as components.

Research paper thumbnail of Deconstructing Transactional Semantics: The Subtleties of <b>Atomicity</b>

Researchers have recently proposed software and hardware support for transactions as a replacemen... more Researchers have recently proposed software and hardware support for transactions as a replacement for the traditional lock-based synchronization most common in multithreaded programs. Transactions allow the programmer to specify a region of the program that should appear to execute atomically, while the hardware and runtime system optimistically execute the transactions concurrently to obtain high performance. The transactional abstraction is thus a promising approach for creating both faster and simpler multithreaded programs.

Research paper thumbnail of Parameterized Interfaces for Open System Verification of Product Lines

The second phase discharges the constraints upon composition of features into a product. We prese... more The second phase discharges the constraints upon composition of features into a product. We present the technique as well as the results of a case study on an email protocol suite.

Research paper thumbnail of pdf

Research paper thumbnail of Session 1P: Instruction-Level Parallelism

Research paper thumbnail of Software transactional memory: Why is it only a research toy?

Abstract STM is sometimes touted as the way forward for developing concurrent software, but is it... more Abstract STM is sometimes touted as the way forward for developing concurrent software, but is it ready for use in real-world applications? The authors built an STM runtime system and compiler framework, the IBM STM, and compared its performance to other similar products by Intel and Sun. They conclude that from both performance and productivity standpoints, STM still has a long way to go before it can be viable in the real world.

Research paper thumbnail of Improved Sequence-based Speculation Techniques for Implementing Memory Consistency

Abstract This work presents BMW, a new design for speculative implementations of memory consisten... more Abstract This work presents BMW, a new design for speculative implementations of memory consistency models in shared-memory multiprocessors. BMW obtains the same performance as prior proposals, but achieves this performance while avoiding several undesirable attributes of prior proposals: non-scalable structures, per-word valid bits in the data cache, modifications to the cache coherence protocol, and global arbitration.

Research paper thumbnail of Adding Token Counting to Directory-Based Cache Coherence

Abstract The coherence protocol is a first-order design concern in multicore designs. Directory p... more Abstract The coherence protocol is a first-order design concern in multicore designs. Directory protocols are naturally scalable, as they place no restrictions on the interconnect and have minimal bandwidth requirements; however, this scalability comes at the cost of increased sharing latency due to indirection. In contrast, broadcast-based systems such as snooping protocols and token coherence reduce latency of sharing misses by sending requests directly to other processors.

Research paper thumbnail of Token tenure and PATCH: a predictive/adaptive Token-counting hybrid

Abstract Traditional coherence protocols present a set of difficult trade-offs: the reliance of s... more Abstract Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces Patch (Predictive/Adaptive Token-Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency.

Research paper thumbnail of RETCON: transactional repair without replay

Abstract Over the past decade there has been a surge of academic and industrial interest in optim... more Abstract Over the past decade there has been a surge of academic and industrial interest in optimistic concurrency, ie the speculative parallel execution of code regions that have the semantics of isolation. This work analyzes scalability bottlenecks of workloads that use optimistic concurrency. We find that one common bottleneck is updates to auxiliary program data in otherwise non-conflicting operations, eg reference count updates and hashtable occupancy field increments.

Research paper thumbnail of Token tenure: PATCHing token counting using directory-based cache coherence

Abstract Traditional coherence protocols present a set of difficult tradeoffs: the reliance of sn... more Abstract Traditional coherence protocols present a set of difficult tradeoffs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces PATCH (Predictive/Adaptive Token Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency.

Research paper thumbnail of Unrestricted transactional memory: Supporting I/O and system calls within transactions

Abstract Hardware transactional memory has great potential to simplify the creation of correct an... more Abstract Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, enabling programmers to exploit the soon-to-be-ubiquitous multi-core designs. Transactions are simply segments of code that are guaranteed to execute without interference from other concurrently-executing threads. The hardware executes transactions in parallel, ensuring non-interference via abort/rollback/restart when conflicts are detected.

Research paper thumbnail of Mechanisms for unbounded, conflict-robust hardware transactional memory

With shared-memory multiprocessing becoming the norm in contexts ranging from webservers to mobil... more With shared-memory multiprocessing becoming the norm in contexts ranging from webservers to mobile devices, the task of developing high-performance parallel programs is being faced by more programmers than ever before. One key challenge in developing such programs is the need to synchronize accesses to shared memory made by different threads. Implementing synchronization that is both (1) correct and (2) not a performance bottleneck has historically been a challenging task.

Research paper thumbnail of A constraint-based approach to open feature verification

Abstract Feature-oriented software architectures provide a powerful model for building product-li... more Abstract Feature-oriented software architectures provide a powerful model for building product-line systems. Each component corresponds to an individual feature, and a composition of features yields a product. Feature-oriented verification methodologies must be able to analyze individual features and to compose the results into results on products; features are hence a form of open systems. In prior work, Li, Fisler and Krishnamurthi proposed a feature verification methodology based on 3-valued model checking.

Research paper thumbnail of Relaxing Synchronization for Performance and Insight

Abstract—Synchronization overhead is a major bottleneck in scaling parallel applications to a lar... more Abstract—Synchronization overhead is a major bottleneck in scaling parallel applications to a large number of cores. This continues to be true in spite of various synchronizationreduction techniques that have been proposed. Previously studied synchronization-reduction techniques tacitly assume that all synchronizations specified in a source program are essential to guarantee quality of the results produced by the program.

Research paper thumbnail of InvisiFence: performance-transparent memory ordering in conventional multiprocessors

Abstract A multiprocessor's memory consistency model imposes ordering constraints among loads, st... more Abstract A multiprocessor's memory consistency model imposes ordering constraints among loads, stores, atomic operations, and memory fences. Even for consistency models that relax ordering among loads and stores, ordering constraints still induce significant performance penalties due to atomic operations and memory ordering fences.

Research paper thumbnail of Hardbound: architectural support for <b>spatial </b>safety of the C programming language

The C programming language is at least as well known for its absence of spatial memory safety gua... more The C programming language is at least as well known for its absence of spatial memory safety guarantees (i.e., lack of bounds checking) as it is for its high performance. C's unchecked pointer arithmetic and array indexing allow simple programming mistakes to lead to erroneous executions, silent data corruption, and security vulnerabilities. Many prior proposals have tackled enforcing spatial safety in C programs by checking pointer and array accesses. However, existing software-only proposals have significant drawbacks that may prevent wide adoption, including: unacceptably high runtime overheads, lack of completeness, incompatible pointer representations, or need for non-trivial changes to existing C source code and compiler infrastructure.

Research paper thumbnail of Making the fast case common and the uncommon case simple in unbounded transactional <b>memory</b>

Hardware transactional memory has great potential to simplify the creation of correct and efficie... more Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, allowing programmers to exploit more effectively the soon-to-be-ubiquitous multi-core designs. Several recent proposals have extended the original bounded transactional memory to unbounded transactional memory, a crucial step toward transactions becoming a generalpurpose primitive. Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging, and as a result, many proposed implementations are complex.

Research paper thumbnail of Subtleties of Transactional Memory Atomicity Semantics

Transactional memory has great potential for simplifying multithreaded programming by allowing pr... more Transactional memory has great potential for simplifying multithreaded programming by allowing programmers to specify regions of the program that must appear to execute atomically. Transactional memory implementations then optimistically execute these transactions concurrently to obtain high performance. This work shows that the same atomic guarantees that give transactions their power also have unexpected and potentially serious negative effects on programs that were written assuming narrower scopes of atomicity. We make four contributions: (1) we show that a direct translation of lock-based critical sections into transactions can introduce deadlock into otherwise correct programs, (2) we introduce the terms strong atomicity and weak atomicity to describe the interaction of transactional and non-transactional code, (3) we show that code that is correct under weak atomicity can deadlock under strong atomicity, and (4) we demonstrate that sequentially composing transactional code can also introduce deadlocks. These observations invalidate the intuition that transactions are strictly safer than lock-based critical sections, that strong atomicity is strictly safer than weak atomicity, and that transactions are always composable.

Research paper thumbnail of Making <b>the </b>fast case common and <b>the </b>uncommon case simple in unbounded transactional memory

Hardware transactional memory has great potential to simplify the creation of correct and efficie... more Hardware transactional memory has great potential to simplify the creation of correct and efficient multithreaded programs, enabling programmers to exploit the soon-to-be-ubiquitous multi-core designs. Transactions are simply segments of code that are guaranteed to execute without interference from other concurrentlyexecuting threads. The hardware executes transactions in parallel, ensuring non-interference via abort/rollback/restart when conflicts are detected. Transactions thus provide both a simple programming interface and a highly-concurrent implementation that serializes only on data conflicts. A progression of recent work has broadened the utility of transactional memory by lifting the bound on the size and duration of transactions, called unbounded transactions. Nevertheless, two key challenges remain: (i) I/O and system calls cannot appear in transactions and (ii) existing unbounded transactional memory proposals require complex implementations. We describe a system for fully unrestricted transactions (i.e., they can contain I/O and system calls in addition to being unbounded in size and duration). We achieve this via two modes of transaction execution: restricted (which limits transaction size, duration, and content but is highly concurrent) and unrestricted (which is unbounded and can contain I/O and system calls but has limited concurrency because there can be only one unrestricted transaction executing at a time). Transactions transition to unrestricted mode only when necessary. We introduce unoptimized and optimized implementations in order to balance performance and design complexity.

Research paper thumbnail of Assume-guarantee testing

Verification techniques for component-based systems should ideally be able to predict properties ... more Verification techniques for component-based systems should ideally be able to predict properties of the assembled system through analysis of individual components before assembly. This work introduces such a modular technique in the context of testing. Assume-guarantee testing relies on the (automated) decomposition of key system-level requirements into local component requirements at design time. Developers can verify the local requirements by checking components in isolation; failed checks may indicate violations of system requirements, while valid traces from different components compose via the assume-guarantee proof rule to potentially provide system coverage. These local requirements also form the foundation of a technique for efficient predictive testing of assembled systems: given a correct system run, this technique can predict violations by alternative system runs without constructing those runs. We discuss the application of our approach to testing a multi-threaded NASA application, where we treat threads as components.

Research paper thumbnail of Deconstructing Transactional Semantics: The Subtleties of <b>Atomicity</b>

Researchers have recently proposed software and hardware support for transactions as a replacemen... more Researchers have recently proposed software and hardware support for transactions as a replacement for the traditional lock-based synchronization most common in multithreaded programs. Transactions allow the programmer to specify a region of the program that should appear to execute atomically, while the hardware and runtime system optimistically execute the transactions concurrently to obtain high performance. The transactional abstraction is thus a promising approach for creating both faster and simpler multithreaded programs.

Research paper thumbnail of Parameterized Interfaces for Open System Verification of Product Lines

The second phase discharges the constraints upon composition of features into a product. We prese... more The second phase discharges the constraints upon composition of features into a product. We present the technique as well as the results of a case study on an email protocol suite.

Research paper thumbnail of pdf