Virtual Memory Research Papers - Academia.edu (original) (raw)

2025, Communications of The ACM

2025, World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering

Most scientific programs have large input and output data sets that require out-of-core programming or use virtual memory management (VMM). Out-of-core programming is very error-prone and tedious; as a result, it is generally avoided. However, in many instance, VMM is not an effective approach because it often results in substantial performance reduction. In contrast, compiler driven I/O management will allow a program's data sets to be retrieved in parts, called blocks or tiles. Comanche (COmpiler MANaged caCHE) is a compiler combined with a user level runtime system that can be used to replace standard VMM for out-of-core programs. We describe Comanche and demonstrate on a number of representative problems that it substantially out-performs VMM. Significantly our system does not require any special services from the operating system and does not require modification of the operating system kernel.

2025, Proceedings of the eighth symposium on Operating systems principles - SOSP '81

A new virtual memory management algorithm WSCLOCK has been synthesized from the local working set (WS) algorithm, the global CLOCK algorithm, and a new load control mechanism for auxiliary memory access. The new algorithm combines the most useful feature of WS-a natural and efti:ctive load control that prevents thrashing-with the simplicity and efficiency of CLOCK. Studies are presented to show that the performance of WS and WSCLOCK are equivalent, even if the savings in overhead are ignored.

2025

Emerging multicore platforms are increasingly deploying distributed scratchpad memories to achieve lower energy and area together with higher predictability; but this requires transparent and efficient software management of these critical resources. In this paper, we introduce SPMVisor, a hardware/software layer that virtualizes the scratchpad memory space in order to facilitate the use of distributed SPMs in an efficient, transparent and secure manner. We introduce the notion of virtual scratchpad memories (vSPMs), which can be dynamically created and managed as regular SPMs. To protect the on-chip memory space, the SPMVisor supports vSPM-level and block-level access control lists. In order to efficiently manage the on-chip real-estate, our SPMVisor supports policy-driven allocation strategies based on privilege levels. Our experimental results on Mediabench/CHStone benchmarks running on various Chip-Multiprocessor configurations and software stacks (RTOS, virtualization, secure execution) show that SPMVisor enhances performance by 71% on average and reduces power consumption by 79% on average.

2025

Nowadays, most computer manufacturers offer chip multiprocessors (CMPs) due to the always increasing chip density. These CMPs have a broad range of characteristics, but all of them support the shared memory programming model. As a result, every CMP implements a coherence protocol to keep local caches coherent. Coherence protocols consume an important fraction of power to determine which coherence action to perform. Specifically, on CMPs with write-through local caches, a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags, we have observed that energy is wasted in the directory due to two main reasons. Firstly, an important fraction of directory lookups are useless, because the target block is not located in any local cache. The power consumed by the directory could be reduce by filtering out useless directory lookups. Secondly, useful directory lookups (there are local copies of the target block) are performed over target blocks that are shared by a small number of processors. The directory power consumption could be reduced by limiting the directory lookups to only the directory entries that have a copy of the block. Along this thesis we propose two filtering mechanisms. Each of these mechanisms is focused on one of the problems described above: while our first proposal focuses on reducing number of directory lookups performed, our second proposal aims at reducing the associativity of directory lookups. Several implementations of both filtering approaches have been proposed and evaluated, having all of them a very limited hardware complexity. Our results show that the power consumed by the directory can be reduced as much as 30%.

2025

Computer system courses have long benefited from simulators in conveying important concepts to students. We have modified the Java source code of the MOSS virtual memory simulator to allow users to easily switch between different page replacement algorithms including FIFO, LRU, and Optimal replacement algorithms. The simulator clearly demonstrates the behavior of the page replacement algorithms in a virtual memory system, and provides a convenient way to illustrate page faults and their corresponding page fault costs. Equipped with a GUI for control and page table visualization, it allows the student to visually see how page tables operate and which pages page replacement algorithms evict in case of a page fault. Moreover, class projects may be assigned requiring operating system students to code new page replacement algorithms which they want to simulate and integrate them into the MOSS VM simulator code files thus enhancing the students' Java coding skills. By running various simulations, students can collect page replacement statistics thus comparing the performance of various replacement algorithms.

2025, Proceedings of the 14th Western Canadian Conference on Computing Education

2025, Lecture Notes in Computer Science

Most previous parallel logic programming systems have been built on top of classicM operating systems. The advances in the area of parallel operating systems have made it possible to explore new execution models that take advantage of their features. In this paper we propose a family of execution models (VM,VMBA,VMHW) that make use of the new virtual memory technologies such as copy-on-write, memory inheritance, and distributed shared memory. Preliminary results fl'om our implementations and simulations are reported.

2025, Lecture Notes in Computer Science

2025

Modern operating systems provide a rich set of interfaces for mapping, sharing, and protecting memory. Different memory management unit (MMU) architectures provide different mechanisms for managing memory translations. Since the same OS usually runs on different MMU architectures, a software "hardware address translation" (hat) layer that abstracts the MMU architecture is normally implemented between MMU hardware and the virtual memory system of the OS. In this paper, we study the impact of the OS and the MMU on the structure and performance of the hat layer. In particular, we concentrate on the role of the hat layer on the scalability of system performance on symmetric multiprocessors with 2-12 CPUs. The results show that, unlike single-user applications, multi-user applications require very careful multi-threading of the hat layer to achieve system performance that scales with the number of CPUs. In addition, multi-threading the hat can result in better performance in lesser amounts of physical memory.

2025, Journal of Applied Crystallography

Structure determination of macromolecules often depends on phase improvement and phase extension by use of real-space averaging of electron density related by noncrystallographic symmetry. Although techniques for such procedures have been described previously [Bricogne (1976). Acta Cryst. A32, 832–847; Johnson (1978). Acta Cryst. B34, 576–577], modern computer architecture and experience with these methods have suggested changes and improvements. Two unit cells are considered: (1) the p-cell corresponding to the actual crystal structure(s) being determined (there would be more than one of these if the molecule crystallizes in more than one crystal form) and (2) the h-cell corresponding to the molecule in a standard orientation with respect to which the molecular symmetry axes are defined. Averaging can proceed entirely within the p-cell, referring to the h-cell only in as far as knowledge of the molecular symmetry is required. It is also possible to place the averaged molecule back ...

2025, arXiv (Cornell University)

Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system con guration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework, the Virtual Block Interface (VBI). We design VBI based on the key idea that delegating memory management duties to hardware can reduce the overheads and software complexity associated with virtual memory. VBI introduces a set of variable-sized virtual blocks (VBs) to applications. Each VB is a contiguous region of the globally-visible VBI address space, and an application can allocate each semantically meaningful unit of information (e.g., a data structure) in a separate VB. VBI decouples access protection from memory allocation and address translation. While the OS controls which programs have access to which VBs, dedicated hardware in the memory controller manages the physical memory allocation and address translation of the VBs. This approach enables several architectural optimizations to (1) e ciently and exibly cater to di erent and increasingly diverse system con gurations, and (2) eliminate key ine ciencies of conventional virtual memory. We demonstrate the bene ts of VBI with two important use cases: (1) reducing the overheads of address translation (for both native execution and virtual machine environments), as VBI reduces the number of translation requests and associated memory accesses; and (2) two heterogeneous main memory architectures, where VBI increases the e ectiveness of managing fast memory regions. For both cases, VBI signi cantly improves performance over conventional virtual memory.

2025

It is known that elementary cellular automaton rule 110 is capable of supporting universal computation by emulating cyclic tag system. Since the whole information necessary to perform computation is stored in the configuration, it is reasonable to investigate the complexity of configuration for the analysis of computing process. In this research we employed Lempel-Ziv complexity as a measure of complexity and calculated it during the evolution of emulating cyclic tag system by rule 110. As a result, we observed the stepwise decline of complexity during the evolution. That is caused by the transformation from table data to moving data and the elimination of table data by a rejector.

2025

2025, International Conference on Recent Trends in Information Technology

Caching is a fundamental technique commonly employed to hide the latency gap between memory and the CPU by exploiting locality in memory accesses. On today's architectures a cache miss may cost several hundred CPU cycles . In a two-level memory hierarchy, a cache performs faster than auxiliary storage, but is more expensive. Cost concerns thus usually limit cache size to a fraction of the auxiliary memory's size.This paper represents a comparative predictability about some of the traditional and new replacement techniques in contrast with OPTIMAL replacement technique.

2025, Ijca Proceedings on International Conference on Recent Trends in Information Technology and Computer Science

2025

High-end embedded systems featuring millions of lines of code, with varying degrees of assurance, are becoming commonplace. These devices are typically expected to meet diverse application requirements within tight resource budgets. Their growing complexity makes it increasingly difficult to ensure that they are secure and robust. One approach is to provide strong guarantees of isolation between components -thereby ensuring that the effects of any misbehaviour are confined to the misbehaving component. This paper focuses on an aspect of the system's behaviour that is critical to any such guarantee: management of physical memory resources. In this paper, we present a secure physical memory management model that gives hard guarantees on physical memory consumption. The model dictates the inkernel mechanisms for allocation, however the allocation policy is implemented outside the kernel. We also argue that exporting allocation to user-level provides the flexibility necessary to implement the diverse resource management policies needed in embedded systems, while retaining the high-assurance properties of a formally verified kernel.

2025, Communications of the ACM

In many recent computer system designs, hardware facilities have been provided for easing the problems of storage allocation. A method of characterizing dynamic storage allocation systems--accordlng to the functional capabilities provided and the underlying techniques used--is presented. The basic purpose of the paper is to provide a useful perspective from which the utility of Various hardware facilities may be assessed. A brief survey of storage allocation facilities in several representative computer systems is included as an appendix.

2025

Hardware-design languages typically impose a rigid communication hierarchy that follows module instantiation. This leads to an undesirable side-effect where changes to a child's interface result in changes to the parents. Soft connections address this problem by allowing the user to specify connection endpoints that are automatically connected at compilation time, rather than by the user.

2025, ACM SIGCSE Bulletin

This paper describes a student project which is a major part of a senior level Operating Systems course at the Federal Institute of Technology in Lausanne. The project consists in conceiving and implementing an entire Operating System, where user jobs benefit from a simulated paged virtual memory on a DEC-LSI/11 based microprocessor. Students program in Portal, a modular high level language similar to Modula. The positive reactions we have obtained from our students center on satisfaction in having participated in defining specifications and having implemented an entire system themselves.

2025, Proceedings of the 18th …

In our prior work we explored the use of a separate cache for I-structure memories within the context of dataflow based multithreaded systems. Istructure memories in dataflow systems are used to store arrays and other indexed or stream data items. This work showed that using separate (data) caches for indexed or stream data and scalar data items could lead to substantial improvements in terms of cache misses. In addition, such a separation allowed for the design of caches that could be tailored to meet the properties exhibited by different data items. In this paper we explore a similar cache organization providing architectural support for distinguishing between memory references that exhibit spatial and temporal locality and mapping them to separate caches. Since significant amounts of compulsory and conflict misses are avoided, the size of each cache (i.e., array and scalar), as well as the combined cache capacity can be reduced. According to the results of our simulations a partitioned 4k scalar cache with the streams (or arrays) mapped to a 2k array cache can be more efficient than a 16k unified data cache.

2025, North-Holland Publishing Co. eBooks

These proceedings are a collection of contributions to computer system performance, selected by the usual refereeing process from papers submitted to the symposium, as well as a few invited papers representing significant novel contributions made during the last year. They represent the thrust and vitality of the subject as well as its capacity to identify important basic problems and major application areas. The main methodological problems appear in the underlying queueing theoretic aspects, in the deterministic analysis of waiting time phenomena, in workload characterization and representation, in the algorithmic aspects of model processing, and in the analysis of measurement data. Major areas for applications are computer architectures, data bases, computer networks, and capacity planning. The international importance of the area of computer system performance was well reflected at the symposium by the presence of participants from nineteen countries. The mixture of participants was also evident in the institutions which they represented: 35% from universities, 25% from governmental research organizations, but also 30% from industry and 10% from non-research government bodies. This proves that the area is reaching a stage of maturity where it can contribute directly to progress in practical problems. The Editors. PREFACE METHODOLOGY A systematical approach to the performance modelling of computer systems M.G. KIENZLE and K.C. SEVCIK Synchronization problems in hierarchically organized multiprocessor computer systems U. HERZOG and W. HOFFMANN v COMPUTATIONAL METHODS FOR QUEUEING NETWORKS Some extensions to multiclass queueing network analysis Y. BARD Mean value analysis of queuing networks -A new look at an old problem M. REISER A computational algorithm for queue distributions via the Pblya theory of enumeration H. KOBAYASHI A direct numerical method for queueing networks W. J. STEWART APPLIED PERFORMANCE ANALYSIS Performance evaluation of the BASIS system R.P. VAN DE RIET A model of a heterogeneous multiple-minicomputer: System M2 -Performance evaluation during developments J.-J. GUILLEMAUD Regime process analysis of a virtual machine operating system W.-T.K. LIN A hybrid simulation/analytical model of a batch computer system D. ASZTALOS An approach to the construction of workload models W. MATERNA vii viii LI ST OF CONTENTS SCHEDULING TECHNIQUES Scheduling under resource constraints -Achievements and prospects J. B~AiEWICZ and J. W~GLARZ Analysis of a class of schedules for computer systems with real time applications A.A. FREDERICKS A load-sensitive scheduler for interactive systems M. RUSCHITZKA Priority batch processing for upper bounded response times U. DE CARLINI, A. MAZZEO, and C. SAVY Waiting-time distributions for deadline-oriented serving B. WALKE and W. ROSENBOHM INFORMATION SYSTEMS An example for an adaptive control method providing data base integrity' LI ST OF CONTENTS ix PERFORMANCE CONTROL Performance evaluation of a cache memory for a mini-computer M. BADEL and J. LEROUDIER Performance improvement by feedback control of the operating system A. GECK A queueing model of a timesliced priority driven task dispatching algorithm P.S. KRITZINGER, A.E. KRZESINSKI, and P. TEUNISSEN A study of a mechanism for controlling multiprogrammed memory in an interactive system A. BRANDWAJN and J .A. HERNANDEZ COMMUNICATION NETWORK MODELLING I Modelling of local computer networks O. SPANIOL A communication protocol and a problem of coupled queues L. BOGUSLAVSKII and E. GELENBE Virtual circuits versus Datagram -Usage of communication resources A. BUTRIMENKO and U. SICHRA COMMUNICATION NETWORK MODELLING II Failsafe distributed loop-free routing in communication networks A. SEGALL Sizing a message store subject to blocking criteria E. ARTHURS and J.S. KAUFMAN AUTHOR INDEX METHODOLOGY Performance of Computer Systems M. Arato, A. Butrimenko. E. Gelenbe (eds.) ©r rASA. North-Holland Pub 1i shi ng Company. r979

2025, The Indian Journal of Technical Education (IJTE)

Operating systems are designed to manage a computer's hardware and software resources. As technology is getting developed day by day, the management of resources and the memory optimization are very important for computing to solve real-world problems has become essential. Over time, advanced researchers continue to study big problems to find solutions within minimal time. To overcome these problems of memory management there are various concepts available in operating systems like memory isolation, segmentation, paging, virtual memory, fragmentation etc. In This paper proposes a custom toy operating system in Rust programming language which aims to solve the problems occurred in memory management and found an effective way of memory optimization using the concept of paging. Because Rust offers a higher level of memory safety without relying on a garbage collector than C and C++, it ischosen over other programming languages which will help in memory optimization.

2025, 2024 IEEE Cloud Summit

Real-time data stream processing at the edge is crucial for time-sensitive tasks within large-scale IoT systems. Task scheduling plays a key role in managing the Quality of Service (QoS), necessitating a prioritization system to distinguish between high and low-priority tasks, thus ensuring efficient data processing on edge nodes. Existing scheduling algorithms rigidly prioritize tasks deemed as high-priority, often at the expense of fairness and overall system efficiency. In this paper, we propose a Priority-aware Fair Task Scheduling (FTS-Hybrid) algorithm that addresses these challenges by managing prioritybased task execution in a controlled manner. Our task scheduling algorithm streamlines resource utilization and enhances system responsiveness, contributing to low latency and high throughput, outperforming competing techniques including First-Come-First-Serve (FCFS), Round Robin (RR), and Priority Scheduling (PS). We implemented FTS-Hybrid on Apache Storm and evaluated its performance using an open-source real-time IoT benchmark (RIoTBench). Experimental results show that the FTS-Hybrid algorithm reduces task execution latency by 24%, 31%, and 26% compared with FCFS, RR, and PS, respectively, by strategically mitigating queuing delays under dynamic workload conditions.

2025

Improper access of data buffers is one of the most common errors in programs written in assembler, C, C++, and several other languages. Existing programs and OSs frequently access the data beyond the allocated buffers or access buffers that were already freed. Such programs and OSs may run for years before their problems can be detected because improper memory accesses frequently result in a silent data corruption. Not surprisingly, most computer worms exploit buffer overflow errors to gain complete control over computer systems. Only after recent worm epidemics, did code developers begin to realize the scale of the problem and the number of potential memory-access violations in existing code. Due to the syntax and flexibility of many programming languages, memory access violation problems cannot be detected at compile time. Tools that verify correctness before every memory access impose unacceptably high overheads. As a result, most of the developed techniques focus on preventing the hijacking of control by hackers and worms due to stack overflows. Consequently, hidden data corruption is given less attention. Memory access violations can be efficiently detected using the hardware support of the paging and virtual memory. Kefence is the general run-time solution we developed that allows to detect and avoid in-kernel overflow, underflow, and stale access problems for internal kernel buffers. Kefence is especially applicable to file system code because file systems operate at a high level of abstraction and require no direct access to the physical memory. At the same time, file systems use a large number of kernel buffers and file system errors are most harmful for users because users' persistent data can be corrupted.

2025, Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture

Transactional Memory is a promising parallel programming model that addresses the programmability issues of lockbased applications using mechanisms that are transparent to developers. Hardware Transactional Memory (HTM) implements these mechanisms in silicon to obtain better results than fine-grain locking solutions. One of these mechanisms is data version management, that decides how and where the modifications introduced by transactions are stored to guarantee their atomicity and durability. In this paper, we show that aborts are frequent especially for applications with coarse-grain transactions and many threads, and that this severely restricts the scalability of log-based HTMs. To address this issue, we propose the use of a gated store buffer to accelerate eager version management for log-based HTM. Moreover, we propose a novel design, where the store buffer is used to perform lazy version management (similar to Rock [12]) but overflowed transactions execute with a fallback log-based HTM that uses eager version management. Assuming an infinite store buffer, we show that lazy version management is better suited to applications with finegrain transactions while eager version management is better suited to applications with coarse-grain transactions. Limiting the buffer size to 32 entries, we obtain 20.1% average improvement over log-based HTM for applications with fine-grain transactions (using lazy version management) and 54.7% for applications with coarse-grain transactions (using eager version management).

2025, Proceedings of the 6th ACM & IEEE International conference on Embedded software - EMSOFT '06

This talk reports on findings gained during roadmapping activities for automotive embedded systems development, conducted in the broader environment of the Artemis Technology Platform (). It discusses key trends in the distributed development of embedded automotive applications, and the challenges arising towards implementing such new development processes. Topics covered range from impact on real-time analysis, safety analysis, distributed control, and human-in-the-loop analysis. The talk will point out resulting research directions and research priorities.

2025, Invited talk at OOPSLA

2025, ACM Transactions on Database Systems

The emergence of monitoring applications has precipitated the need for Data Stream Management Systems (DSMSs), which constantly monitor incoming data feeds (through registered continuous queries), in order to detect events of interest. In this article, we examine the problem of how to schedule multiple Continuous Queries (CQs) in a DSMS to optimize different Quality of Service (QoS) metrics. We show that, unlike traditional online systems, scheduling policies in DSMSs that optimize for average response time will be different from policies that optimize for average slowdown, which is a more appropriate metric to use in the presence of a heterogeneous workload. Towards this, we propose policies to optimize for the average-case performance for both metrics. Additionally, we propose a hybrid scheduling policy that strikes a fine balance between performance and fairness, by looking at both the average- and worst-case performance, for both metrics. We also show how our policies can be ada...

2025, international journal of engineering trends and technology

Changing trends in technologies, notably cheaper and faster memory hierarchies, have made it worthwhile to revisit many hardware-oriented design decisions made in previous decades. Hardware-oriented designs, in which one uses special- purpose hardware to perform some dedicated function, are a response to a high cost of executing instructions out of memory; when caches are expensive, slowor in scarce supply, it is a perfectly reasonable to use hardware that do not compete with user applications for cache space and do not rely on the performance of the caches. In contrast, when the caches are large enough to withstand competition between the application and operating system, the cost of executing operating system functions out of the memory subsystem decreases significantly and software-oriented designs become manageable.

2025

Final assignment from group 10 of the Information Systems course A2 from the Informatics Engineering study program, Muhammadiyah University of Jakarta

2025, Ahmed

About Virtual Memory

2025, Ahmed

About Virtual Memory

2025, ACM SIGOPS Operating Systems Review

The Intel iAPX 432 is an object-based microcomputer which, together with its operating system iMAX, provides a multiprocessor computer system designed around the ideas of data abstraction. iMAX is implemented in Ada and provides, through its interface and facilities, an Ada view of the 432 system. Of paramount concern in this system is the uniformity of approach among the architecture, the operating system, and the language. Some interesting aspects of both the external and internal views of iMAX are discussed to illustrate this uniform approach.

2024, Computer architecture news

We describe a scheme for supporting huge address spaces without the need for long addresses implemented in hardware. Pointers are translated ("swizzled") from a long format to a shorter format (directly supported by normal hardware) at page fault time. No extra hardware is required beyond that normally used by virtual memory systems, and no continual software cost is incurred by presence checks or indirection of pointers. This scheme could be used to fault pages into a normal memory from a persistent store, or simply to avoid extra hardware requirements when supporting large address spaces. It exploits temporal and spatial locality in much the same way as a normal virtual memory, so its performance should be quite good.

2024, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

Recent technology advancements allow for the integration of large memory structures on-die or as a diestacked DRAM. Such structures provide higher bandwidth and faster access time than off-chip memory. Prior work has investigated using the large integrated memory as a cache, or using it as part of a heterogeneous memory system under management of the OS. Using this memory as a cache would waste a large fraction of total memory space, especially for the systems where stacked memory could be as large as off-chip memory. An OS-managed heterogeneous memory system, on the other hand, requires costly usage-monitoring hardware to migrate frequently-used pages, and is often unable to capture pages that are highly utilized for short periods of time. This paper proposes a practical, low-cost architectural solution to efficiently enable using large fast memory as Part-of-Memory (PoM) seamlessly, without the involvement of the OS. Our PoM architecture effectively manages two different types of memory (slow and fast) combined to create a single physical address space. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefit. Our proposed PoM architecture improves performance by 18.4% and 10.5% over static mapping and an ideal OS-based migration, respectively.

2024

research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become... more

2024, WRL Research Report 95/1

This paper presents a design capture system in which schematics are translated into a procedural netlist specification language. The circuit designer draws schematics with a standard structured graphics editor that knows nothing about netlists or schematics. The translator program analyzes the structured graphics output file and translates it into a procedural netlist specification.

2024

To enable virtual machines (VMs) with a large amount of memory to be flexibly migrated, split migration has been proposed. It divides a large-memory VM into small pieces and transfers them to multiple hosts. After the migration, the VM runs across those hosts and exchanges memory data between hosts using remote paging. For such a split-memory VM, however, it becomes difficult to securely run intrusion detection systems (IDS) outside the VM using a technique called IDS offloading. This paper proposes VMemTrans to support transparent IDS offloading for split-memory VMs. In VMemTrans, offloaded IDS can monitor a split-memory VM as if that memory were not distributed. To achieve this, VMem-Trans enables IDS running in one host to transparently access VM's remote memory. To consider a trade-off, it provides two methods for obtaining memory data from remote hosts: self paging and proxy paging. We have implemented VMemTrans in KVM and compared the execution performance between the two methods.

2024

The ability to check memory references against their associated array/buffer bounds helps programmers to detect programming errors involving address overruns early on and thus avoid many difficult bugs down the line. This paper proposes a novel approach called Cash to the array bound checking problem that exploits the segmentation feature in the virtual memory hardware of the X86 architecture. The Cash approach allocates a separate segment to each static array or dynamically allocated buffer, and generates the instructions for array references in such a way that the segment limit check in X86's virtual memory protection mechanism performs the necessary array bound checking for free. In those cases that hardware bound checking is not possible, it falls back to software bound checking. As a result, Cash does not need to pay per-reference software checking overhead in most cases. However, the Cash approach incurs a fixed setup overhead for each use of an array, which may involve multiple array references. The existence of this overhead requires compiler writers to judiciously apply the proposed technique to minimize the performance cost of array bound checking. This paper presents the detailed design and implementation of the Cash compiler, and a comprehensive evaluation of various performance tradeoffs associated with the proposed array bound checking technique. For the set of complicated network applications we tested, including Apache, Sendmail, Bind, etc., the latency penalty of Cash's bound checking mechanism is between 2.5% to 9.8% when compared with the baseline case that does not perform any bound checking.

2024, 2005 International Conference on Dependable Systems and Networks (DSN'05)

2024, Springer eBooks

Considerable research and development has been invested in software Distributed Shared Memory (DSM). The primary focus of this work has traditionally been on high performance and consistency protocols. Unfortunately, clusters present a number of challenges for any DSM systems not solvable through consistency protocols alone. These challenges relate to the ability of DSM systems to adjust to load fluctuations, computers being added/removed from the cluster, to deal with faults, and the ability to use DSM objects larger than the available physical memory. This paper introduces the Synergy DSM System and its integration with the virtual memory, group communication and process migration services of the Genesis Cluster Operating System.

2024

Shared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency. We present Jigsaw, a technique that jointly addresses the scalability and interference problems of shared caches. Hardware lets software define shares, collections of cache bank partitions that act as virtual caches, and map data to shares. Shares give software full control over both data placement and capacity allocation. Jigsaw implements efficient hardware support for share management, monitoring, and adaptation. We propose novel resource-management algorithms and use them to develop a system-level runtime that leverages Jigsaw to both maximize cache utilization and place data close to where it is used. We evaluate Jigsaw using extensive simulations of 16-and 64core tiled CMPs. Jigsaw improves performance by up to 2.2× (18% avg) over a conventional shared cache, and significantly outperforms state-of-the-art NUCA and partitioning techniques.

2024

Fault-tolerance has become an essential concern for processor designers due to increasing soft-error rates. In this study, we are motivated by the fact that Transactional Memory (TM) hardware provides an ideal base upon which to build a fault-tolerant system. We show how it is possible to provide low-cost faulttolerance for serial programs by using a minimallymodified Hardware Transactional Memory (HTM) that features lazy conflict detection, lazy data versioning. This scheme, called FaulTM, employs a hybrid hardware-software fault-tolerance technique. On the software side, FaulTM programming model is able to provide the flexibility for programmers to decide between performance and reliability. Our experimental results indicate that FaulTM produces relatively less performance overhead by reducing the number of comparisons and by leveraging already proposed TM hardware. We also conduct experiments which indicate that the baseline FaulTM design has a good error coverage. To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory.

2024, International Symposium on Parallel and Distributed Processing and Applications

2024

A program run, in the setting of computer architecture and compilers, can be characterized in part by its memory access patterns. We approach the problem of analyzing these patterns using machine learning. We characterize memory accesses using a sequence of cache miss rates, and present a new data set for this task. The data set draws from programs run on various Java virtual machines, and C and Fortran compilers. We work towards answering the scientific question: How predictable is a program’s cache miss rate from interval to interval as it executes? We report the results of three distinct ANN models, which have been shown to be effective in sequence modeling. We show that programs can be differentiated in terms of the predictability of their cache miss rates.