Transactional Memory Research Papers - Academia.edu (original) (raw)

2025, IEEE Transactions on Parallel and Distributed Systems

Transactional contention management policies show considerable variation in relative performance with changing workload characteristics. Consequently, incorporation of fixed-policy Transactional Memory (TM) in general purpose computing systems is suboptimal by design and renders such systems susceptible to pathologies. Of particular concern are Hardware TM (HTM) systems where traditional designs have hardwired policies in silicon. Adaptive HTMs hold promise, but pose major challenges in terms of design and verification costs. In this paper, we present the ZEBRA HTM design, which lays down a simple yet highperformance approach to implement adaptive contention management in hardware. Prior work in this area has associated contention with transactional code blocks. However, we discover that by associating contention with data (cache blocks) accessed by transactional code rather than the code block itself, we achieve a neat match in granularity with that of the cache coherence protocol. This leads to a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. ZEBRA, therefore, brings together the inherent benefits of traditional eager HTMsVparallel commitsVand lazy HTMsVgood optimistic concurrency without deadlock avoidance mechanismsV, combining them into a low-complexity design.

2025, arXiv (Cornell University)

For a distributed last-level cache (LLC) in a large multicore chip, the access time to one LLC bank can significantly differ from that to another due to the difference in physical distance. In this paper, we successfully demonstrated a new distance-based side-channel attack by timing the AES decryption operation and extracting part of an AES secret key on an Intel Knights Landing CPU. We introduce several techniques to overcome the challenges of the attack, including the use of multiple attack threads to ensure LLC hits, to detect vulnerable memory locations, and to obtain fine-grained timing of the victim operations. While operating as a covert channel, this attack can reach a bandwidth of 205 kbps with an error rate of only 0.02%. We also observed that the side-channel attack can extract 4 bytes of an AES key with 100% accuracy with only 4000 trial rounds of encryption.

2025, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Transactional Memory (TM) is on its way to becoming the programming API of choice for writing correct, concurrent, and scalable programs. Hardware TM (HTM) implementations are expected to be significantly faster than pure software TM (STM); however, full hardware support for true closed and open nested transactions is unlikely to be practical. This paper presents a novel mechanism, the split hardware transaction (SpHT), that uses minimal software support to combine multiple segments of an atomic block, each executed using a separate hardware transaction, into one atomic operation. The idea of segmenting transactions can be used for many purposes, including nesting, local retry, orElse, and user-level thread scheduling; in this paper we focus on how it allows linear closed and open nesting of transactions. SpHT overcomes the limited expressive power of best-effort HTM while imposing overheads dramatically lower than STM and preserving useful guarantees such as strong atomicity provided by the underlying HTM.

2025, Electronic Notes in Theoretical Computer Science

We extend the notion of Store Atomicity [4] to a system with atomic transactional memory. This gives a fine-grained graph-based framework for defining and reasoning about transactional memory consistency. The memory model is defined in terms of thread-local Instruction Reordering axioms and Store Atomicity, which describes inter-thread communication via memory. A memory model with Store Atomicity is serializable: there is a unique global interleaving of all operations which respects the reordering rules and serializes all the operations in a transaction together. We extend Store Atomicity to capture this ordering requirement by requiring dependencies which cross a transaction boundary to point in to the initiating instruction or out from the committing instruction. We sketch a weaker definition of transactional serialization which accounts for the ability to interleave transactional operations which touch disjoint memory. We give a procedure for enumerating the behaviors of a transactional program-noting that a safe enumeration procedure permits only one transaction to read from memory at a time. We show that more realistic models of transactional execution require speculative execution. We define the conditions under which speculation must be rolled back, and give criteria to identify which instructions must be rolled back in these cases.

2025, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures

Transactional memory (TM) eliminates many problems associated with lock-based synchronization. Over recent years, much progress has been made in software and hardware implementation techniques for TM. However, before transactional memory can be integrated into mainstream programming languages, we must precisely define its meaning in the context of these languages. In particular, TM semantics should address the advanced features present in the existing software TM implementations, such as interactions between transactions and locks, explicit user-level abort and support for legacy code. In this paper, we address these topics from both theoretical and practical points of view. We give precise formulations of several popular TM semantics for the domain of sequentially consistent executions and show that some of these semantics are equivalent for C++ programs that do not contain other forms of synchronization. We show that lock-based semantics, such as Single Global Lock Atomicity (SLA) or Disjoint Lock Atomicity (DLA), do not actually guarantee atomicity for race-free programs and propose a new semantics, Race-Free Atomicity (RFA) that gives such a guarantee. We compare these semantics from the programmer and implementation points of view and explain why supporting non-atomic transactions is useful. Finally, we propose a new set of language constructs that let programmers explicitly specify whether transactions should be atomic and describe how these constructs interact with userlevel abort and legacy code.

2025, Distributed and Parallel Databases

We present a linguistic construct to define concurrency control for the objects of an object database. This construct, called concurrent behavior, allows to define a concurrency control specification for each object type in the database; in a sense, it can be seen as a type extension. The concurrent behavior is composed by two parts: the first one, called commutativity specification, is a set of conditional rules, by which the programmer specifies when two methods do not conflict each other. The second part, the constraint specification, is a set of guarded regular expressions, called constraints, by which the programmer defines the allowed sequences of method calls. At each time during an actual execution, a subset of constraints may be active so limiting the external behavior of the object. A constraint becomes active when its guard is verified, where a guard is composed of the occurrence of some method call m along with the verification of a boolean expression on the object state and the actual parameters of m. A constraint dies when a string of the language corresponding to the regular expression has been recognized. While the commutativity specification is devoted to specify the way in which the external behavior of an object is influenced by the existence of concurrent transactions in the system, the constraint specification defines the behavior of the object, independently from the transactions. Since the two parts of the concurrent behavior are syntactically distinct and, moreover, each of them consists of a set of independent rules, modularity in specifying the objects is enhanced, with respect to a unique specification. We outline an implementation of the construct, which is based on a look-ahead policy: at each method execution, we foresee the admissible successive behaviors of the object, instead of checking the admission of each request at the time it is actually made.

2025

Nowadays, most computer manufacturers offer chip multiprocessors (CMPs) due to the always increasing chip density. These CMPs have a broad range of characteristics, but all of them support the shared memory programming model. As a result, every CMP implements a coherence protocol to keep local caches coherent. Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with write-through local caches, a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that energy is wasted in the directory due to two main reasons. Firstly, an important fraction of directory lookups are useless, because the target block is not located in any local cache. The power consumed by the directory could be reduce by filtering out useless directory lookups. Secondly, useful directory lookups (there are local copies of the target block) are performed over target blocks that are shared by a small number of processors. The directory power consumption could be reduced by limiting the directory lookups to only the directory entries that have a copy of the block. Along this thesis we propose two filtering mechanisms. Each of these mechanisms is focused on one of the problems described above: while our first proposal focuses on reducing number of directory lookups performed, our second proposal aims at reducing the associativity of directory lookups. Several implementations of both filtering approaches have been proposed and evaluated, having all of them a very limited hardware complexity. Our results show that the power consumed by the directory can be reduced as much as 30%.

2025, Proceedings of the 5th European conference on Computer systems

In this paper, we study parallelization of multiplayer games using software Transactional Memory (STM) support. We show that the STM provides not only ease of programming, but also better performance than that achievable with stateof-the-art lock-based programming, for this realistic high impact application. For this purpose, we use a game benchmark, SynQuake, that extracts the main data structures and the essential features of the popular game Quake. SynQuake can be driven with a synthetic workload generator that flexibly emulates client game actions and various hot-spot scenarios in the game world. We implement, evaluate and compare the STM version of SynQuake with a state-of-the-art lock-based parallelization of Quake, which we ported to SynQuake. While in STM-SynQuake support for maintaining the consistency of each complex game action is automatic, conservative locking of surrounding objects within a bounding box, for the duration of the game action is inherently needed in lockbased SynQuake. This leads to higher scalability of STM-SynQuake versus lock-based SynQuake, due to a higher degree of false sharing in the latter. Task assignment to threads has a second-order effect on the scalability of STM-SynQuake, due to its impact on the application's true sharing patterns. We show that a dynamic locality-aware task assignment to threads provides the best trade-off between load balancing and conflict reduction.

2025

This work addresses the problem of parallelizing multiplayer games using software Transactional Memory (STM) support. Using a realistic high impact application, we show that STM provides not only ease of programming, but also better performance than that achievable with state-of-the-art lock-based programming. Towards this goal, we use SynQuake, a game benchmark which extracts the main data structures and the essential features of the popular multiplayer game Quake, but can be driven with a synthetic workload generator that flexibly emulates client game actions and various hot-spot scenarios in the game world. We implement, evaluate and compare the STM version of Syn-Quake with a state-of-the-art lock-based parallelization of Quake, which we ported to SynQuake. While in STM-SynQuake support for maintaining the consistency of each potentially complex game action is automatic, conservative locking of surrounding objects within a bounding box for the duration of the game action is inherently needed in lock-based SynQuake. This leads to a higher scalability factor of STM-SynQuake versus lock-based SynQuake, due to a higher degree of false sharing in the latter.

2025

Although this report was prepared by a task force commissioned by the National Science Foundation, all opinions, findings, and recommendations expressed within it are those of the task force and do not necessarily reflect the views of the National Science Foundation. Preface The Software for Science and Engineering (SSE) Task Force commenced in June 2009 with a charge that consisted of the following three elements: • Identify specific needs and opportunities across the spectrum of scientific software infrastructure. Characterize the specific needs and analyze technical gaps and opportunities for NSF to meet those needs through individual and systemic approaches. • Design responsive approaches. Develop initiatives and programs led (or co-led) by NSF to grow, develop, and sustain the software infrastructure needed to support NSF's mission of transformative research and innovation leading to scientific leadership and technological competitiveness. • Address issues of institutional ...

2025, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06)

Transactional memory provides a concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. Researchers have proposed several different implementations of transactional memory, broadly classified into software transactional memory (STM) and hardware transactional memory (HTM). Both approaches have their pros and cons: STMs provide rich and flexible transactional semantics on stock processors but incur significant overheads. HTMs, on the other hand, provide high performance but implement restricted semantics or add significant hardware complexity. This paper is the first to propose architectural support for accelerating transactions executed entirely in software. We propose instruction set architecture (ISA) extensions and novel hardware mechanisms that improve STM performance. We adapt a high-performance STM algorithm supporting rich transactional semantics to our ISA extensions (called hardware accelerated software transactional memory or HASTM). HASTM accelerates fully virtualized nested transactions, supports language integration, and provides both object-based and cache-line based conflict detection. We have implemented HASTM in an accurate multi-core IA32 simulator. Our simulation results show that (1) HASTM single-thread performance is comparable to a conventional HTM implementation; (2) HASTM scaling is comparable to a STM implementation; and (3) HASTM is resilient to spurious aborts and can scale better than HTM in a multi-core setting. Thus, HASTM provides the flexibility and rich semantics of STM, while giving the performance of HTM.

2025

The Go runtime, as well as the most recently proposed changes to it, draw from previous work to improve scalability and performance. In this paper we explore several examples of previous research, some that have actively influenced the Go runtime, and others that are based on similar guiding principles. We propose additional extensions to the runtime based on contention aware scheduling techniques. We also discuss how such changes would not only leverage the proposed improvements currently in the works, but how they can potentially improve the effectiveness of the runtime’s scheduling algorithm.

2025, The Vldb Journal

Optimistic concurrency control, or OCC, can achieve excellent performance on uncontended workloads for main-memory transactional databases. Contention causes OCC's performance to degrade, however, and recent concurrency control designs, such as hybrid OCC/locking systems and variations on multiversion concurrency control (MVCC), have claimed to outperform the best OCC systems. We evaluate several concurrency control designs under varying contention and varying workloads, including TPC-C, and find that implementation choices unrelated to concurrency control may explain much of OCC's previously-reported degradation. When these implementation choices are made sensibly, OCC performance does not collapse on high-contention TPC-C. We also present two optimization techniques, commit-time updates and timestamp splitting, that can dramatically improve the highcontention performance of both OCC and MVCC. Though these techniques are known, we apply them in a new context and highlight their potency: when combined, they lead to performance gains of 3.4× for MVCC and 3.6× for OCC in a TPC-C workload.

2025

Transactional memory (TM) facilitates the development of concurrent applications by letting the programmer designate certain code blocks as atomic. Programmers using a TM often would like to access the same data both inside and outside transactions, e.g., to improve performance or to support legacy code. In this case, programmers would ideally like the TM to guarantee strong atomicity, where transactions can be viewed as executing atomically also with respect to non-transactional accesses. Since guaranteeing strong atomicity for arbitrary programs is prohibitively expensive, researchers have suggested guaranteeing it only for certain data-race free (DRF) programs, particularly those that follow the privatization idiom: from some point on, threads agree that a given object can be accessed non-transactionally. Supporting privatization safely in a TM is nontrivial, because this often requires correctly inserting transactional fences, which wait until all active transactions complete. Unfortunately, there is currently no consensus on a single de nition of transactional DRF, in particular, because no existing notion of DRF takes into account transactional fences. In this paper we propose such a notion and prove that, if a TM satis es a certain condition generalizing opacity and a program using it is DRF assuming strong atomicity, then the program indeed has strongly atomic semantics. We show that our DRF notion allows the programmer to use privatization idioms. We also propose a method for proving our generalization of opacity and apply it to the TL2 TM. • Theory of computation → Concurrency; • Software and its engineering → Software veri cation;

2025, Springer eBooks

One of the main challenges in stating the correctness of transactional memory (TM) systems is the need to provide guarantees on the system state observed by live transactions, i.e., those that have not yet committed or aborted. A TM correctness condition should be weak enough to allow flexibility in implementation, yet strong enough to disallow undesirable TM behavior, which can lead to run-time errors in live transactions. The latter feature is formalized by observational refinement between TM implementations, stating that properties of a program using a concrete TM implementation can be established by analyzing its behavior with an abstract TM, serving as a specification of the concrete one. We show that a variant of transactional memory specification (TMS), a TM correctness condition, is equivalent to observational refinement for the common programming model in which local variables are rolled back upon a transaction abort and, hence, is the weakest acceptable condition for this case. This is challenging due to the nontrivial formulation of TMS, which allows different aborted and live transactions to have different views of the system state. Our proof reveals some natural, but subtle, assumptions on the TM required for the equivalence result.

2025

Existing concurrent priority queues do not allow to update the priority of an element after its insertion. As a result, algorithms that need this functionality, such as Dijkstra's single source shortest path algorithm, resort to cumbersome and inefficient workarounds. We report on a heap-based concurrent priority queue which allows to change the priority of an element after its insertion. We show that the enriched interface allows to express Dijkstra's algorithm in a more natural way, and that its implementation, using our concurrent priority queue, outperform existing algorithms. A priority queue data structure maintains a collection (multiset) of items which are ordered according to a priority associated with each item. Priority queues are amongst the most useful data structures in practice, and can be found in a variety of applications ranging from graph algorithms [21, 5] to discrete event simulation [8] and modern SAT solvers [4]. The importance of priority queues has m...

2025, Lecture Notes in Computer Science

2025, Proceedings of the 2013 ACM symposium on Principles of distributed computing

Transactional memory (TM) has been hailed as a paradigm for simplifying concurrent programming. While several consistency conditions have been suggested for TM, they fall short of formalizing the intuitive semantics of atomic blocks, the interface through which a TM is used in a programming language. To close this gap, we formalize the intuitive expectations of a programmer as observational refinement between TM implementations: a concrete TM observationally refines an abstract one if every user-observable behavior of a program using the former can be reproduced if the program uses the latter. This allows the programmer to reason about the behavior of a program using the intuitive semantics formalized by the abstract TM; the observational refinement relation implies that the conclusions will carry over to the case when the program uses the concrete TM. We show that, for a particular programming language and notions of observable behavior, a variant of the well-known consistency condition of opacity is sufficient for observational refinement, and its restriction to complete histories is furthermore necessary. Our results suggest a new approach to evaluating and comparing TM consistency conditions. They can also reduce the effort of proving that a TM implements its programming language interface correctly, by only requiring its developer to show that it satisfies the corresponding consistency condition.

2025, Linking Theory and Practice of Digital Libraries 27th International Conference on Theory and Practice of Digital Libraries, TPDL 2023 Zadar, Croatia, September 26–29, 2023 Proceedings

2025, Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Blue Gene/Q's (BG/Q) unique transactional memory system provides hardware isolation, atomicity and consistency for memory locations while leaving the details of the transactional programming system to software layers above the hardware . This design allows for complex systems implemented as part of the software runtime. Here a profiling extension to the software runtime is presented, which allows for in-depth analysis of the actions of the transactional memory runtime system, as well as giving insight into the behaviour of the program being profiled.

2025, arXiv (Cornell University)

Multiversioning is widely used in databases, transactional memory, and concurrent data structures. It can be used to support read-only transactions that appear atomic in the presence of concurrent update operations. Any system that maintains multiple versions of each object needs a way of efficiently reclaiming them. We experimentally compare various existing reclamation techniques by applying them to a multiversion tree and a multiversion hash table. Using insights from these experiments, we develop two new multiversion garbage collection (MVGC) techniques. These techniques use two novel concurrent version list data structures. Our experimental evaluation shows that our fastest technique is competitive with the fastest existing MVGC techniques, while using significantly less space on some workloads. Our new techniques provide strong theoretical bounds, especially on space usage. These bounds ensure that the schemes have consistent performance, avoiding the very high worst-case space usage of other techniques. • Computing methodologies Ñ Concurrent algorithms.

2025, Journal of Parallel and Distributed Computing

2025, Bulletin of The European Association for Theoretical Computer Science

2025, Bull. EATCS

hosted a celebration which included several technical presentations about Maurice's work by colleagues and friends. This column includes a summary of some of these presentations, written by the speakers themselves. In the first article, Vassos Hadzilacos overviews and highlights the impact of Maurice's seminal paper on wait-free synchronization. Then, Tim Harris provides a perspective on hardware trends and their impact on distributed computing, mentioning several interesting open problems and making connections to Maurice's work. Finally, Michael Scott gives a concise retrospective on transactional memory, another area where Maurice has been a leader. This is a joint column with the Distributed Computing Column of ACM SIGACT News (June 2015 issue), edited by Jennifer Welch. Many thanks to Vassos, Tim, and Michael for their contributions!

2025

Efficient management of concurrent access to shared resources is crucial in modern multi-threaded systems to avoid race conditions and performance bottlenecks. Traditional locking mechanisms, such as standard read-write locks, often introduce substantial overhead in read-heavy workloads due to their blocking nature. To address these challenges, we introduce the LRW lock: lightweight read-write lock. It allows concurrent read access and ensures exclusive write access. The lock considers atomic operations to provide active readers and writers. This paper, initially presents algorithms to acquire read and write locks using a locking object of the LRW lock. Later, it provides the design of non-blocking methods tryReadLock() and tryWriteLock() for read and write operations. The methods offer flexibility for timesensitive applications. To understand the efficiency of the LRW lock, we consider different concurrent data structures and the state-of-the-art locking object. Experimental results show that the implementation that considers LRW lock outperforms the state-of-the-art locking object and consumes less memory footprint. The LRW lock leverages atomic operations to track active readers and writers efficiently.

2025, 2010 International Conference on Information Retrieval & Knowledge Management (CAMP)

Knowledge transfer has attracted much attention to researchers and practitioners in recent years since knowledge transfer has been considered as a critical determinant of an organization's capacity to confer sustainable competitive advantage. Despite extensive research on knowledge transfer issues, there is a dearth of research that has explicitly focused on the role of transactive memory in enabling intraorganizational knowledge transfer in information technology (IT) outsourcing context, particularly e-government IT outsourcing. Although the information systems literature has recently acknowledged the role of transactive memory plays in improving knowledge processes, most of the research is still in the basic concept of transactive memory which is emphasized more on the individual communication concept rather than integrating those concept with the existing organization memory system. Moreover, it is still at a conceptual level rather than practical action for a firm to address. Therefore, this paper attempts to fill this gap by examining the factors that have been cited as significant influences on the ability to transfer knowledge from the vendor to the client organizations in the context of e-government IT outsourcing, and examine the role of transactive memory system towards effective knowledge transfer process between organizations. Drawing on several theoretical streams, this paper will propose an integrated conceptual framework of inter-organizational knowledge transfer which further can be used for research enhancement.

2025, Journal of Scientific and Engineering Research

Low-code platforms provide rapid deployment at the expense of mission-critical enterprise application performance and scalability requirements. Traditional approaches, on the other hand, offer more control over system design, algorithms, and resource utilization; essential to low latency, high throughput, and efficient resource usage in performance-critical applications. Extensive technical analysis has to be made to identify the best approach for a specific project. Such analysis usually finds that applications developed using traditional methods execute better than low-code applications on query performance, network latency, and handling multiple requests concurrently. Low-code platforms struggle with fine-grained concurrency and scalable usage of resources, which can introduce performance bottlenecks and additional infrastructure expenses. While low-code shines in fast prototyping and simpler business processes, it falls short in latency-sensitive, compute-intensive enterprise applications. Where performance, scalability, and resource utilization for mission-critical systems are given top priority by organizations, traditional development is the more solid and adaptable option. Ultimately, a decision between low-code and traditional development must be made after a detailed technical evaluation of the individual project needs, performance requirements, and long-term scalability requirements.

2025, ACM SIGPLAN Notices

This paper presents a software transactional memory system that introduces first-class C++ language constructs for transactional programming. We describe new C++ language extensions, a production-quality optimizing C++ compiler that translates and optimizes these extensions, and a high-performance STM runtime library. The transactional language constructs support C++ language features including classes, inheritance, virtual functions, exception handling, and templates. The compiler automatically instruments the program for transactional execution and optimizes TM overheads. The runtime library implements multiple execution modes and implements a novel STM algorithm that supports both optimistic and pessimistic concurrency control. The runtime switches a transaction's execution mode dynamically to improve performance and to handle calls to precompiled functions and I/O libraries. We present experimental results on 8 cores (two quad-core CPUs) running a set of 20 non-trivial paral...

2025

Lockand wait-free data structures can be constructed in a generic way. However, when complex operations are involved, their practical use is rather limited due to high performance overheads and, in some settings, difficult to fulfil object lifecycles. While working on a synchronous inter-processor communication (IPC) path for multicore systems, we stumbled over a clever piece of code that did fulfil most of the properties that this path requires for its send queue. Unfortunately, this piece of code was by no means a data-structure publication or somehow related to send queues. Reporting on our experience in translating Krieger’s MCS-style reader-writer lock into a send queue for cross-processor IPC, we would like to make the point that sometimes, searching for code could end up in a valuable treasure chest even for largely different areas.

2025, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000

One way to implement a fault-tolerant a service is by replicating it at sites that fail independently. One of the replication techniques is active replication where each request is executed by all the replicas. Thus, the effects of failures can be completely masked resulting in an increase of the service availability. In order to preserve consistency among replicas, replicas must exhibit a deterministic behavior, what has been traditionally achieved by restricting replicas to be singlethreaded. However, this approach cannot be applied in some setups, like transactional systems, where it is not admissible to process transactions sequentially. In this paper, we present a deterministic scheduling algorithm for multithreaded replicas in a transactional framework. To ensure replica determinism requests to replicated servers are submitted by means of reliable and totally ordered multicast. Internally, a deterministic scheduler ensures that all threads are scheduled in the same way at all replicas what guarantees replica consistency £ This work has been partially funded by the Spanish Research Council (CICYT), contract number TIC98-1032-C03-01 and the Madrid Regional Research Council (CAM), contract number CAM-07T/0012/1998. This paper is organized as follows. First, the system model is described in Section 2. Then, the different sources of nondeterminism are identified in Section 3. Section 4 presents a deterministic scheduling algorithm for multithreaded replicas (MTRDS). The correctness of the MTRDS algorithm is proven in Section 5. Section 6 presents some implementation issues to support the algorithm. Finally, we compare our approach to other works and present our conclusions. The system consists of a set of nodes interconnected by means of a network. We assume the nodes to be fail-silent and that there are no network partitions. We do not consider neither malicious failures (i.e., byzantine failures), nor the non-determinism introduced by software interrupts.

2025

Drawing on transactive memory theory, we propose that transactive memory systems (TMS) operate as a meta-resource that enhances team performance directly by generating resource surpluses and indirectly by diminishing the unnecessary expenditure of resources on inter-member conflict. We also propose that this mediated relationship is moderated by team size. In conceptualizing TMS as a meta-resource, we advance a theoretical model which posits a critical role for TMS in preventing resource losses stemming from dysfunctional member interactions. This model also helps to explain the performance benefits observed in previous TMS research and the boundary conditions exerted by group size.

2025, Journal of Computational Science

As a recently consolidated paradigm for optimistic concurrency in modern multicore architectures, Transactional Memory (TM) can help to the exploitation of parallelism in irregular applications when data dependence information is not available up to runtime. This paper presents and discusses how to leverage TM to exploit parallelism in an important class of irregular applications, the class that exhibits irregular reduction patterns. In order to test and compare our techniques with other solutions, they were implemented in a software TM system called ReduxSTM, that acts as a proof of concept. Basically, ReduxSTM combines two major ideas: a sequential-equivalent ordering of transaction commits that assures the correct result, and an extension of the underlying TM privatization mechanism to reduce unnecessary overhead due to reduction memory updates as well as unnecesary aborts and rollbacks. A comparative study of STM solutions, including ReduxSTM, and other more classical approaches to the parallelization of reduction operations is presented in terms of time, memory and overhead.

2025, IEEE Micro

Desktop processor architectures have crossed a critical threshold. Manufactures have given up attempting to extract ever more performance from a single core and instead have turned to multi-core designs. While straightforward approaches to the architecture of multi-core processors are sufficient for small designs (2-4 cores), little is really known how to build, program, or manage systems of 64 to 1024 processors. Unfortunately, the computer architecture community lacks the basic infrastructure tools required to carry out this research. While simulation has been adequate for single-processor research, significant use of simplified modeling and statistical sampling is required to work in the 2-16 processing core space. Invention is required for architecture research at the level of 64-1024 cores. Fortunately, Moore's law has not only enabled these dense multi-core chips, it has also enabled extremely dense FPGAs. Today, for a few hundred dollars, undergraduates can work with an FPGA prototype board with almost as many gates as a Pentium. Given the right support, the research community can capitalize on this opportunity too. Today, one to two dozen cores can be programmed into a single FPGA. With multiple FPGAs on a board and multiple boards in a system, large complex architectures can be explored. To make this happen, however, requires a significant amount of infrastructure in hardware, software, and what we call "gateware", the register-transfer level models that fill the FGPAs. While it is possible for each research group to create these boards, design the gateware, and create this infrastructure in isolation, significant benefits can be had by pooling our collective resources. Such a system would not just invigorate multiprocessors research in the architecture community. Since processors cores can run at 100 to 200 MHz, a large scale multiprocessor would be fast enough to run operating systems and large programs at speeds sufficient to support software research. Moreover, there is a new generation of FPGAs every 18 months that is roughly twice as fast and with capacity for twice as many cores, so future multiboard FPGA systems are even more attractive. Hence, we believe such a system would accelerate research across all the fields that touch multiple processors: operating systems, compilers, debuggers, programming languages, scientific libraries, and so on. Thus the acronymn RAMP, for Research Accelerator for Multiple Processors. This project intends to foster just such a community endeavor. By leveraging the work that each of us was going to do anyway in isolation, we can create a shared infrastructure for architecture research. Furthermore, by pooling our resources in hardware design we can reduce the risk each of us undertakes by designing hardware prototypes ourselves. Finally, by creating shared and supported baseline platforms for multi-core architectures we can jump start the required architecture and critical software research that the field needs. The intellectual merit of this project is embodied in the following contribution. First, we intend to create a set of RTL and software design standards that facilitate easy adaptability and integration of hardware and software components in the multi-core architecture. Second, we intend to use these design standards to create a baseline architecture of 1024+ nodes. Third, we will investigate architectures for fast emulation of large scale multiprocessors. For example, what type of memory controller and caches will speedup emulation of 1024 processors. Fourth, we will create systems to observe the MPP behavior without disturbing the computation. This design will be created, distributed for free on the Internet, and supported through full-time staff. The broader impact goal of this project is nothing less than the transformation of the parallel computing community in computer science from a simulation-driven to a prototype-driven discipline. RAMP will enable the rapid iteration across interfaces of the many fields of multiple processors, thereby more quickly ramping up a parallel foundation for large-scale computer systems research in the 21st century.

2025, Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture

Transactional Memory is a promising parallel programming model that addresses the programmability issues of lockbased applications using mechanisms that are transparent to developers. Hardware Transactional Memory (HTM) implements these mechanisms in silicon to obtain better results than fine-grain locking solutions. One of these mechanisms is data version management, that decides how and where the modifications introduced by transactions are stored to guarantee their atomicity and durability. In this paper, we show that aborts are frequent especially for applications with coarse-grain transactions and many threads, and that this severely restricts the scalability of log-based HTMs. To address this issue, we propose the use of a gated store buffer to accelerate eager version management for log-based HTM. Moreover, we propose a novel design, where the store buffer is used to perform lazy version management (similar to Rock [12]) but overflowed transactions execute with a fallback log-based HTM that uses eager version management. Assuming an infinite store buffer, we show that lazy version management is better suited to applications with finegrain transactions while eager version management is better suited to applications with coarse-grain transactions. Limiting the buffer size to 32 entries, we obtain 20.1% average improvement over log-based HTM for applications with fine-grain transactions (using lazy version management) and 54.7% for applications with coarse-grain transactions (using eager version management).

2025, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel parallelism among tasks generated by a sequential thread. Utilizing intuitive programmer annotations of task inputs and outputs, the task superscalar pipeline dynamically detects intertask data dependencies, identifies task-level parallelism, and executes tasks out-of-order. Furthermore, we propose a design for a distributed task superscalar pipeline frontend, that can be embedded into any manycore fabric, and manages cores as functional units. We show that our proposed mechanism is capable of driving hundreds of cores simultaneously with non-speculative tasks, which allows our pipeline to sustain work windows consisting of tens of thousands of tasks. We further show that our pipeline can maintain a decode rate faster than 60ns per task and dynamically uncover data dependencies among as many as ∼50,000 in-flight tasks, using 7MB of on-chip eDRAM storage. This configuration achieves speedups of 95-255x (average 183x) over sequential execution for nine scientific benchmarks, running on a simulated CMP with 256 cores. Task superscalar thus enables programmers to exploit manycore systems effectively, while simultaneously simplifying their programming model.

2025, 2009 18th International Conference on Parallel Architectures and Compilation Techniques

Version management, one of the key design dimensions of Hardware Transactional Memory (HTM) systems, defines where and how transactional modifications are stored. Current HTM systems use either eager or lazy version management. Eager systems that keep new values in-place while they hold old values in a software log, suffer long delays when aborts are frequent because the pre-transactional state is recovered by software. Lazy systems that buffer new values in specialized hardware offer complex and inefficient solutions to handle hardware overflows, which are common in applications with coarse-grain transactions. In this paper, we present FASTM, an eager log-based HTM that takes advantage of the processor's cache hierarchy to provide fast abort recovery. FASTM uses a novel coherence protocol to buffer the transactional modifications in the first level cache and to keep the non-speculative values in the higher levels of the memory hierarchy. This mechanism allows fast abort recovery of transactions that do not overflow the first level cache resources. Contrary to lazy HTM systems, committing transactions do not have to perform any actions in order to make their results visible to the rest of the system. FASTM keeps the pre-transactional state in a software-managed log as well, which permits the eviction of speculative values and enables transparent execution even in the case of cache overflow. This approach simplifies eviction policies without degrading performance, because it only falls back to a software abort recovery for transactions whose modified state has overflowed the cache. Simulation results show that FASTM achieves a speed-up of 43% compared to LogTM-SE, improving the scalability of applications with coarse-grain transactions and obtaining similar performance to an ideal eager HTM with zero-cost abort recovery.

2025

In this paper we extend transactive memory systems (TMS) theory to develop an understanding of the distributed coordination of expertise in high-reliability organizations. We illustrate our conceptual developments in a study of emergency management and response in Greece. We focus on the interaction between operators/dispatchers, ambulance crew, and specialist doctors, including the information and communication technologies (ICT) they use to respond to emergency incidents. Our case contributes to an in-depth understanding of the ways in which high-reliability organizations can sustain a distributed coordination of expertise over the duration of emergency incidents. This is achieved through the cultivation of TMS during a socialization and training period, the dynamic development of trust in emergent actions, and a commitment to shared protocols, which allow for improvisation and bricolage during unexpected incidents. Our findings also explore the role of ICTs in inscribing TMS in computerized protocols, while mediating the development of trust across the team, as well as mediating the construction of running narratives, which enable leaders to coordinate expertise in unexpected incidents.

2025, IEEE Transactions on Computers

Transactional Memory (TM) is a synchronization model for parallel programming which provides optimistic concurrency control. Transactions can run in parallel and are only serialized in case of conflict. In this work we use hardware TM (HTM) to implement an optimistic speculative barrier (SB) to replace the lock-based solution. SBs leverage HTM support to elide barriers speculatively. When a thread reaches an SB, a new SB transaction is started, keeping the updates private to the thread, and letting the HTM system detect potential conflicts. Once the last thread reaches the corresponding SB, the speculative threads can commit their changes. The main contributions of this work are: an API for SBs implemented with HTM extensions; a procedure to check the speculation state in between barriers to enable SBs with non-transactional codes; a HTM SB-aware conflict resolution enhancement where SB transactions stall on a conflict with a standard transaction; and a set of SB use guidelines derived from our experience on using SBs in a variety of applications. We evaluated our proposals in two different architectures with a full-system simulator and an IBM Power8 server. Results show an overall performance improvement of SBs over traditional barriers.

2025, Journal of Parallel and Distributed Computing

• A new optimized hardware transactional memory (HTM) implementation of Lee's and Ruppert's algorithms using privatizing transactions. • A reduction of the privatizing transaction validation set by a small subset of variables representative of the whole private section read set. • A reduction of the transactional section of privatizing transactions as much as possible to meet the hardware constraints of the HTM system. • A discussion of the programming complexity of the privatizing transaction solutions.

2025, The Journal of Supercomputing

Lee's algorithm solves the path-connection problems that arise in logical drawing, wiring diagramming or optimal route finding. Its parallel version has been widely used as a benchmark to test transactional memory systems. It exhibits transactions of large size and duration that stress these systems exposing their limitations. In fact, Lee's algorithm has been proved to perform similar to sequential in commercial hardware transactional memory systems due to persistent capacity overflows. In this paper, we propose a novel approach to Lee's algorithm in the context of commercial hardware transactional memory systems. We show how the majority of the computation of the largest transaction, i.e. grid privatization and path calculation, can be executed out of the boundaries of the transaction, thus reducing the size requirements. We leverage the correctness criteria of lazy subscription fallback locks to ensure a correct execution. This novel approach uses transactional memory extensions from commercial processors from a different point of view, not needing either early release or open-nested transaction features that are not yet implemented in these systems. We propose an application programming interface to facilitate the task of the programmer. Experiments are carried out with the Intel Core and IBM Power8 architectures, showing speedups around 3.5× over both the standard transactional version of the B Ricardo Quislant

2025, Journal of Parallel and Distributed Computing

• Hardware irrevocability with overflow anticipation and transaction stalling. • Two-phase abort to allow certain privileged mode code inside transactions. • Allow asking for irrevocability if privileged code evicts a transactional block. • Privileged-aware cache replacement policy to favour transactions over privilege code. • Evaluation of requester-wins and requester-stalls conflicts resolution policies.

2025, IEEE Transactions on Parallel and Distributed Systems

IBM and Intel now offer commercial systems with Transactional Memory (TM), a programming paradigm whose aim is to facilitate concurrent programming while maximizing parallelism. These TM systems are implemented in hardware and provide a software fallback path to overcome the hardware implementation limitations. They are known as best-effort hardware TM (BE-HTM) systems. The software fallback path must be provided by the user to ensure forward progress, which adds programming complexity to the TM paradigm. We propose a new type of irrevocability (a transactional mode that marks transactions as non-abortable) to deal with BE-HTM limitations in a more efficient manner, and to liberate the user from having to program a fallback path. It is based on the concept of lazy subscription used in the context of software fallback paths, where the fallback lock is checked at the end of the transaction instead of at the beginning. We propose a hardware lazy irrevocability mechanism that does not involve changes in the coherence protocol. It solves the unsafe execution problem of premature commits associated with lazy subscription fallbacks, and can be triggered by the user via an ISA extension, for the sake of versatility. It is compared with its software counterpart, which we propose as an enhanced lazy single global lock with escaped spinning at the end of the transaction. We also propose the lazy irrevocability with anticipation, a mechanism that cannot be implemented in software, which significantly improves the performance of codes with multiple cache evictions of transactional data. The evaluation of the proposals is carried out with the Simics/GEMS simulator along with the STAMP benchmark suite, and we obtain speedups from 14% to 28% over the fallback path approaches.

2025, The Journal of Supercomputing

In hardware transactional memory, signatures have been proposed to keep track of memory locations accessed in a transaction to help conflict detection. Generally, signatures are implemented as Bloom filters that suffer from aliasing, that is, they can give rise to false conflicts. Such conflicts are more likely as signature fills (saturation), and they can lead a parallel application to perform worse than its serial version. Irrevocability is analyzed to address the signature saturation problem. When a transaction reaches a saturation threshold, the transaction enters an irrevocable state that prevents it from being aborted. Hence, such a transaction keeps running while the others are either stalled or allowed to run concurrently. We propose an analytical model that shows this is a good solution to overcome a high contention scenario. In addition, experimental evaluation shows the benefits in performance and power consumption of the proposed irrevocability mechanisms. Different saturation metrics are considered and a fixed threshold is found that yields maximum performance for the benchmarks evaluated.

2025, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing

Signatures have been proposed in transactional memory systems to represent read and write sets and to decouple transaction conflict detection from private caches or to accelerate it. Generally, signatures are implemented as Bloom filters that allow unbounded read/write sets to be summarized in bounded space at the cost of false conflict detection. It is known that this behavior has great impact in parallel performance. In this work, a scalability study of state-of-the-art signature designs is presented, for different orthogonal transactional characteristics, including contention, length, concurrency and spatial locality. This study was accomplished using the Stanford EigenBench benchmark. This benchmark was modified to support spatial locality analysis using a Zipf address distribution. Experimental evaluation on a hardware transactional memory simulator shows the impact of those parameters in the behavior of state-of-the-art signatures.

2025, IEEE Transactions on Parallel and Distributed Systems

Transactional Memory (TM) systems must track memory accesses made by concurrent transactions in order to detect conflicts. Many TM implementations use signatures for this purpose, which summarize reads and writes in fixed-size bit registers at the cost of false positives (detection of nonexisting conflicts). Signatures are commonly implemented as two separate same-sized Bloom filters, one for reads and other for writes. In contrast, transactions frequently exhibit read and write sets of uneven cardinality. This mismatch between data sets and filter storage introduces inefficiencies in the use of signatures that have some impact on performance. This paper presents different signature designs as alternatives to the common scheme to deal with the asymmetry in transactional data sets in an effective way. Basically, we analyze two classes of new signatures, called multiset and reconfigurable asymmetric signatures. The first class uses only one Bloom filter to track both read and write sets, while the second class uses Bloom filters of configurable size for reads and writes. The main focus of this paper is a thorough study of these alternative signature designs, including a statistical analysis of false positives and an experimental evaluation, providing performance results and hardware area, time and energy requirements.

2024

Abstract—Already announced in 2007 for Sun’s Rock proces-sor but later canceled, hardware transactional memory (HTM) finally found its way into general-purpose desktop and server systems and is soon to be expected for embedded and real-time systems. However, although current hardware implementations have their pitfalls, hindering an immediate adoption of HTM as a synchronization primitive for real-time operating-systems, we illustrate on the example of a transactional implementation of the L4/Fiasco.OC inter-process communication (IPC) how extended versions of HTMmay revolutionize kernel design and, in particular, how they may reduce the verification costs of a multi-core kernel to little more than verifying a selectively preemptible uni-processor kernel. Removing L4/Fiasco.OC’s half thousand lines-of-code cross-processor IPC path and making the local path transactional, we benefit from a principal performance boost for sending cross-core messages. However for the average case, we e...

2024, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum

When supported in silicon, transactional memory (TM) promises to become a fast, simple and scalable parallel programming paradigm for future shared memory multiprocessor systems. Among the multitude of hardware TM design points and policies that have been studied so far, lazy conflict resolution designs often extract the most concurrency, but their inherent need for lazy versioning requires careful management of speculative updates. In this paper we study how coherent buffering, in private caches for example, as has been proposed in several hardware TM proposals, can lead to inefficiencies. We then show how such inefficiencies can be substantially mitigated by using complete or partial non-coherent buffering of speculative writes in dedicated structures or suitably adapted standard per-core write-buffers. These benefits are particularly noticeable in scenarios involving large coarse grained transactions that may write a lot of non-contended data in addition to actively shared data. We believe our analysis provides important insights into some overlooked aspects of TM behaviour and would prove useful to designers wishing to implement lazy TM schemes in hardware.

2024, Journal of Systems Architecture

Multiprocessor embedded systems integrates diverse dedicated processing units to handle high performance applications such as in multimedia and network processing. However, lock-based synchronization limits the efficiency of such heterogeneous concurrent systems. Hardware Transactional Memory (HTM) is a promising approach in creating an abstraction layer for multi-threaded programming. However, HTM performance is application-specific and determined by version and conflict management configurations. Most previous HTM implementations for embedded system in literature were built on fixed version management that result in significant performance loss when transaction behaviour changes. In this paper, we propose a HTM targeted for embedded applications which is able to adapt its version management based on application behaviour at runtime. It is prototyped and analysed on Altera Cyclone IV platform. Random requests at different contention levels and different transaction sizes are used to verify the performance of the proposed HTM. Based on our experiments, lazy version management is able to obtain up to 12.82% speed-up compared to eager version management at high contention level. Meanwhile, eager version management obtains up to 37.84% speed-up compared to lazy version management at low contention. The adaptive mechanism is able to switch configuration at runtime based on applications behaviour for maximum performance.