Random access memory Research Papers (original) (raw)
2025, Iraqi Journal of Information Technology
This paper aims to describe a genetic algorithm(G.A) that is designed to find compileroptimization sequences result in enhancing efficiency metrics such as energy, execution time, code size, power peak and power profile. We make use of... more
This paper aims to describe a genetic algorithm(G.A) that is designed to find compileroptimization sequences result in enhancing efficiency metrics such as energy, execution time,
code size, power peak and power profile. We make use of Genetic Algorithm (G.A) to find the
'best" options for compiling programs with the GNU Compiler Collection (GCC).The
algorithm was tested on a variety of benchmarks, and the solutions generated by the algorithm
were compared to "00" -no optimization- and fixed optimization sequence "02", so a new fixed
sequence is developed to get more gain for each metric, and thus, enhancing efficiency. As a
result the GA found optimization sequences that gain the best results for all efficiency metrics.
This shows that the "best" sequence differs from program to program. In particular, if a
program is too large for the available search space and the space cannot be enlarged, it may be
worth running GA for several hours to obtain a custom tailored optimization sequence. Using
GA to find optimization sequences could be a valuable technique for embedded systems
compilers. When compile time is plentiful, an embedded system software designer may want
to consider a GA approach to get the final optimization.
2025
variations of large-scale shared-memory machines that have recently emerged are cache-coherent mmumform-memory-access machines (CC-NUMA) and cacheonly memory architectures (COMA). They both have distributed main memory and use... more
variations of large-scale shared-memory machines that have recently emerged are cache-coherent mmumform-memory-access machines (CC-NUMA) and cacheonly memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memoty level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the gramrhirity of data partitions in the application. We then present quantitative results using simulation studies for eight prtraUeI applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.
2025, Proceedings of the 24th annual international symposium on Microarchitecture - MICRO 24
Cache memory has shown to be the most important technique to bridge the gap between the processor speed and the memory access time. The advent of high-speed RISC and superscalar processors, however, calls for small on-chip data caches.... more
Cache memory has shown to be the most important technique to bridge the gap between the processor speed and the memory access time. The advent of high-speed RISC and superscalar processors, however, calls for small on-chip data caches. Due to physical limitations, these should be simply designed and yet yield good performance. In this paper, we present new cache architectures that address the problems of conflict misses and non-optimal line sizes in the context of direct-mapped caches. Our cache architectures can be reconfigured by software in a way that matches the reference pattern for array data structures. We show that the implementation cost of the reconfiguration capability is neglectable. We also show simulation results !M demons tratc sign i fican t performance improvements for both methods.
2025, Future Generation Computer Systems
Invalidation-based cache coherence protocols have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses... more
Invalidation-based cache coherence protocols have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses still incur a severe performance problem. By contrast, update-based protocols have a potential to reduce both write and read penalties under relaxed memory consistency models because coherence misses can be completely eliminated. The purpose of this paper is to compare update-and invalidation-based protocols for their ability to reduce or hide memory access latencies and for their ease of implementation under relaxed memory consistency models. Based on a detailed simulation study, we find that write-update protocols augmented with simple competitive mechanisms -we call such protocols competitive-update protocols -can hide all the write latency and cut the read penalty by as much as 46% at the cost of some increase in the memory traffic. However, as compared to write-invalidate, update-based protocols require more aggressive memory consistency models and more local buffering in the second-level cache to be effective. In addition, their increased number of global writes may cause increased synchronization overhead in applications with high contention for critical sections.
2025
Processor speeds continue to outpace the memory subsys- tem making it neccesary to proactively acquire and re- tain important data. Current applications have an ever increasing number of dynamically allocated data struc- tures and these... more
Processor speeds continue to outpace the memory subsys- tem making it neccesary to proactively acquire and re- tain important data. Current applications have an ever increasing number of dynamically allocated data struc- tures and these data structures occupy large footprints. A large portion of dynamically allocated data is accessed through pointers in the form of recursive data structures. Loads accessing
2025, IEEE Embedded Systems Letters
Multicore processors (CMPs) represent a good solution to provide the performance required by current and future hard real-time systems. However, it is difficult to compute a tight WCET estimation for CMPs due to interferences that tasks... more
Multicore processors (CMPs) represent a good solution to provide the performance required by current and future hard real-time systems. However, it is difficult to compute a tight WCET estimation for CMPs due to interferences that tasks suffer when accessing shared hardware resources. We propose an analyzable JEDEC-compliant DDRx SDRAM memory controller (AMC) for hard real-time CMPs, that reduces the impact of memory interferences caused by other tasks on WCET estimation, providing a predictable memory access time and allowing the computation of tight WCET estimations.
2025, Proceedings of the eighth symposium on Operating systems principles - SOSP '81
A new virtual memory management algorithm WSCLOCK has been synthesized from the local working set (WS) algorithm, the global CLOCK algorithm, and a new load control mechanism for auxiliary memory access. The new algorithm combines the... more
A new virtual memory management algorithm WSCLOCK has been synthesized from the local working set (WS) algorithm, the global CLOCK algorithm, and a new load control mechanism for auxiliary memory access. The new algorithm combines the most useful feature of WS-a natural and efti:ctive load control that prevents thrashing-with the simplicity and efficiency of CLOCK. Studies are presented to show that the performance of WS and WSCLOCK are equivalent, even if the savings in overhead are ignored.
2025, Journal of Systems Architecture
In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is very complex with variable latency hardware, such as instruction cache... more
In multitasking real-time systems it is required to compute the WCET of each task and also the effects of interferences between tasks in the worst case. This is very complex with variable latency hardware, such as instruction cache memories, or, to a lesser extent, the line buffers usually found in the fetch path of commercial processors. Some methods disable cache replacement so that it is easier to model the cache behavior. The difficulty in these cache-locking methods lies in obtaining a good selection of the memory lines to be locked into cache. In this paper, we propose an ILP-based method to select the best lines to be loaded and locked into the instruction cache at each context switch (dynamic locking), taking into account both intra-task and inter-task interferences, and we compare it with static locking. Our results show that, without cache, the spatial locality captured by a line buffer doubles the performance of the processor. When adding a lockable instruction cache, dynamic locking systems are schedulable with a cache size between 12.5% and 50% of the cache size required by static locking. Additionally, the computation time of our analysis method is not dependent on the number of possible paths in the task. This allows us to analyze large codes in a relatively short time (100 KB with 10 65 paths in less than 3 min).
2025
We have analyzed the register requirements of dynamically scheduled processors using conventional register renaming and running the SPEC2000 benchmarks. As it is well known, the late release policy of conventional renaming increases the... more
We have analyzed the register requirements of dynamically scheduled processors using conventional register renaming and running the SPEC2000 benchmarks. As it is well known, the late release policy of conventional renaming increases the required number of registers by a significant amount. Many registers in the register file contain values that will never be read in the future. This paper presents limits on performance gains that could be reached by assuming a perfect knowledge about the instructions using registers for the last time in program order. Efficient techniques for releasing registers in a precise way could be those that move as soon as possible useless values from the register file to an auxiliary structure placed out of critical processor paths. Values are held in this structure just in case they can be needed by an unexpected use. Releasing registers using such a perfect knowledge, gives either a speedup of 40% for a 64int+64fp register file when executing fp code, or a 55% reduction in the register file size for a given performance level when executing int code. We also show that performance is not sensitive to the potential latency penalty when accessing values stored in the auxiliary register file.
2025
Operating systems enable collecting and extracting rich information on application execution characteristics, including program counter traces, memory access patterns, and operating-system-generated signals. This information can be... more
Operating systems enable collecting and extracting rich information on application execution characteristics, including program counter traces, memory access patterns, and operating-system-generated signals. This information can be exploited to design highly efficient, application-aware reliability mechanisms that are transparent to applications. This paper describes the Reliability MicroKernel framework (RMK), a loadable kernel module for providing application-aware reliability and dynamically configuring reliability mechanisms installed in RMK. The RMK prototype is implemented in Linux and supports detection of application/OS failures and transparent application checkpointing. Experiment results show that the OS hang detection and application hang detection, which exploit characteristics of application and system behavior, can achieve 100% coverage and low false positive rates. Moreover, the performance overhead of RMK and the detection/checkpointing mechanisms is small (0.6% for ...
2025, Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the... more
Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, highlocality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed codeoptimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.
2025, Nanoscale
We report the development of physics based models for resistive random-access memory (RRAM) devices. The models are based on a generalized memristive system framework and can explain the dynamic resistive switching phenomena observed in a... more
We report the development of physics based models for resistive random-access memory (RRAM) devices. The models are based on a generalized memristive system framework and can explain the dynamic resistive switching phenomena observed in a broad range of devices. Furthermore, by constructing a simple subcircuit, we can incorporate the device models into standard circuit simulators such as SPICE. The SPICE models can accurately capture the dynamic effects of the RRAM devices such as the apparent threshold effect, the voltage dependence of the switching time, and multi-level effects under complex circuit conditions. The device and SPICE models can also be readily expanded to include additional effects related to internal state changes, and will be valuable to help in the design and simulation of memory and logic circuits based on resistive switching devices.
2025, Published on Zenodo
This study examines the integration of fundamental physical principles into three-dimensional computational algorithms and analyzes how digital systems simulate physical phenomena from the real world. Through a systematic analysis of... more
This study examines the integration of fundamental physical principles into three-dimensional computational algorithms and analyzes how digital systems simulate physical phenomena from the real world. Through a systematic analysis of next-generation physics engines and their mathematical foundations, this research investigates how Newtonian mechanics, collision dynamics, and force interactions are translated into real-time digital simulations. The review covers a comparative analysis of leading physics engines, an evaluation of numerical integration methods, and an assessment of performance optimization strategies within modern 3D programming environments. The overall findings highlight that effective physics engine simulations require a critical balance between computational accuracy and real-time performance constraints, with significant implications for applications in gaming, engineering simulations, and educational visualization systems.
2025
This study examines the integration of fundamental physical principles into three-dimensional computational algorithms and analyzes how digital systems simulate physical phenomena from the real world. Through a systematic analysis of... more
This study examines the integration of fundamental physical principles into three-dimensional computational algorithms and analyzes how digital systems simulate physical phenomena from the real world. Through a systematic analysis of next-generation physics engines and their mathematical foundations, this research investigates how Newtonian mechanics, collision dynamics, and force interactions are translated into real-time digital simulations. The review covers a comparative analysis of leading physics engines, an evaluation of numerical integration methods, and an assessment of performance optimization strategies within modern 3D programming environments. The overall findings highlight that effective physics engine simulations require a critical balance between computational accuracy and real-time performance constraints, with significant implications for applications in gaming, engineering simulations, and educational visualization systems.
2025, Proceedings of Eighth International Application Specific Integrated Circuits Conference
This paper describes the design process used in developing a Stream Memory Controller (SMC)*. The SMC can reorder processor-memory accesses dynamically to increase the effective memory bandwidth for vector operations. A 132-pin ASIC was... more
This paper describes the design process used in developing a Stream Memory Controller (SMC)*. The SMC can reorder processor-memory accesses dynamically to increase the effective memory bandwidth for vector operations. A 132-pin ASIC was implemented in static CMOS using a 0.75µm process and has been tested at 36MHz.
2025, Proceedings Fifth International Symposium on High-Performance Computer Architecture
Processor speeds are increasing rapidly, and memory speeds are not keeping up. Streaming computations (such as multi-media or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus... more
Processor speeds are increasing rapidly, and memory speeds are not keeping up. Streaming computations (such as multi-media or scientific applications) are among those whose performance is most limited by the memory bottleneck. Rambus hopes to bridge the processor/memory performance gap with a recently introduced DRAM that can deliver up to 1.6Gbytes/sec. We analyze the performance of these interesting new memory devices on the inner loops of streaming computations, both for traditional memory controllers that treat all DRAM transactions as random cacheline accesses, and for controllers augmented with streaming hardware. For our benchmarks, we find that accessing unit-stride streams in cacheline bursts in the natural order of the computation exploits from 44-76% of the peak bandwidth of a memory system composed of a single Direct RDRAM device, and that accessing streams via a streaming mechanism with a simple access ordering scheme can improve performance by factors of 1.18 to 2.25.
2025, IEEE Transactions on Computers
AbstractÐMemory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations... more
AbstractÐMemory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of specialpurpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.
2025
Real-time stereo image matching is an important computer vision task, with applications in robotics, driver assistance, surveillance and other domains. The paper describes the architecture and implementation of an FPGA-based stereo image... more
Real-time stereo image matching is an important computer vision task, with applications in robotics, driver assistance, surveillance and other domains. The paper describes the architecture and implementation of an FPGA-based stereo image processor that can produce 25 dense depth maps per second from pairs of 8-bit grayscale images. The system uses a modification of a previously-reported variablewindow-size method to determine the best match for each image pixel. The adaptation is empirically shown to have negligible impact on the quality of the resulting depth map. The degree of parallelism of the implementation can be adapted to the available resources: increased parallelism enables the processing of larger images at the same frame rate (40 ms per image). The architecture exploits the memory resources available in modern platform FPGAs. Two prototype implementations have been produced and validated. The smaller one can handle pairs of images of size 208 × 480 (on a Virtex-4 LX60 at 100 MHz); the larger one works for images of size 640 × 480 (on a Virtex-5 LX330 at 100 MHz). These results improve on previously-reported ASIC and FPGA-based designs.
2025, International Journal of Communication Networks and Information Security (IJCNIS)
Public-key cryptography algorithms, especially elliptic curve cryptography (ECC) and elliptic curve digital signature algorithm (ECDSA) have been attracting attention from many researchers in different institutions because these... more
Public-key cryptography algorithms, especially elliptic curve cryptography (ECC) and elliptic curve digital signature algorithm (ECDSA) have been attracting attention from many researchers in different institutions because these algorithms provide security and high performance when being used in many areas such as electronic-healthcare, electronic-banking, electronic-commerce, electronic-vehicular, and electronic-governance. These algorithms heighten security against various attacks and the same time improve performance to obtain efficiencies (time, memory, reduced computation complexity, and energy saving) in an environment of constrained source and large systems. This paper presents detailed and a comprehensive survey of an update of the ECDSA algorithm in terms of performance, security, and applications.
2025, 2011 12th European Conference on Radiation and Its Effects on Components and Systems
2025
Array variables are extensively used in many behavioral descriptions especially for digital and image processing applications. During synthesis, these array variables are implemented with memory modules. In this report, we show that... more
Array variables are extensively used in many behavioral descriptions especially for digital and image processing applications. During synthesis, these array variables are implemented with memory modules. In this report, we show that simple one-to-one mapping between the array variables and the memory modules lead to inefficient designs. We propose a new algorithm (MeSA} for efficient allocation and mapping of array variables onto memory modules. MeSA computes, (a) the number of memory modules required, (b) the size of each module (c) the number of ports on each module and {d} and the grouping of array variables that map onto each memory module. It also considers the effects of address translations that are required when two or more array variables are stored in one memory module. While, most previous research efforts have concentrated on optimizing the scalar variables, the primary focus in this report is deriving efficient storage mechanisms for array variables. We show the efficiency of our technique on some standard benchmarks.
2025
This paper presents a hardware solution to the design of general low-density parity-check (LDPC) decoders, which simplifies the delivery network required by the message passing algorithm. While many designs of LDPC decoders for specific... more
This paper presents a hardware solution to the design of general low-density parity-check (LDPC) decoders, which simplifies the delivery network required by the message passing algorithm. While many designs of LDPC decoders for specific classes of codes exist in the literature, the design of a general LDPC decoder capable of supporting random LDPC codes is still challenging. The method proposed in this paper tries to pack different check node (CN) and variable node (VN) messages in the Tanner graph representation of the LDPC code, and is therefore called message packing. This method takes advantage of the fact that for high-rate LDPC's the CN's degree is much larger than the VN's, and two distinct methods for delivering the messages to the CNs and VNs are proposed. Using the proposed interconnection network (IN) results in lower complexity decoding of LDPC codes when compared to other designs.
2025
This paper presents a hardware solution to the design of general low-density parity-check (LDPC) decoders, which simplifies the delivery network required by the message passing algorithm. While many designs of LDPC decoders for specific... more
This paper presents a hardware solution to the design of general low-density parity-check (LDPC) decoders, which simplifies the delivery network required by the message passing algorithm. While many designs of LDPC decoders for specific classes of codes exist in the literature, the design of a general LDPC decoder capable of supporting random LDPC codes is still challenging. The method proposed in this paper tries to pack different check node (CN) and variable node (VN) messages in the Tanner graph representation of the LDPC code, and is therefore called message packing. This method takes advantage of the fact that for high-rate LDPC's the CN's degree is much larger than the VN's, and two distinct methods for delivering the messages to the CNs and VNs are proposed. Using the proposed interconnection network (IN) results in lower complexity decoding of LDPC codes when compared to other designs.
2025, Lecture Notes in Computer Science
The load-store queue (LQ-SQ) of modern superscalar processors is responsible for keeping the order of memory operations. As the performance gap between processing speed and memory access becomes worse, the capacity requirements for the... more
The load-store queue (LQ-SQ) of modern superscalar processors is responsible for keeping the order of memory operations. As the performance gap between processing speed and memory access becomes worse, the capacity requirements for the LQ-SQ increase, and its design becomes a challenge due to its CAM structure. In this paper we propose an efficient load-store queue state filtering mechanism that provides a significant energy reduction (on average 35% in the LSQ and 3.5% in the whole processor), and only incurs a negligible performance loss of less than 0.6%.
2025, Proceedings of the 2006 international symposium on Low power electronics and design - ISLPED '06
2025, Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors
As a 3D scene becomes increasingly complex and the screen resolution increases, the design of effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture,... more
As a 3D scene becomes increasingly complex and the screen resolution increases, the design of effective memory architecture is one of the most important issues for 3D rendering processors. We propose a pixel rasterization architecture, which performs a depth test operation twice, before and after texture mapping. The proposed architecture eliminates memory bandwidth waste caused by fetching unnecessary obscured texture data, by performing the depth test before texture mapping. The proposed architecture reduces the miss penalties of the pixel cache by using a pre-fetch scheme -that is, a frame memory access, due to a cache miss at the first depth test, is done simultaneously with texture mapping. The proposed pixel rasterization architecture achieves memory bandwidth effectiveness and reduces power consumption, producing high-performance gains.
2025, Neural Networks
In this paper an effective memory-processor integrated architecture, called memory-based processor array for artificial neural networks (MPAA), is proposed. The MPAA can be easily integrated into any host system via memory interface.... more
In this paper an effective memory-processor integrated architecture, called memory-based processor array for artificial neural networks (MPAA), is proposed. The MPAA can be easily integrated into any host system via memory interface. Specifically, the MPAA system provides an efficient mechanism for its local memory accesses allowed by the row basis and the column basis using the hybrid row and column decoding, which is suitable for the computation model of ANNs such as the accessing and alignment patterns given for matrix-by-vector operations. Mapping algorithms to implement the multilayer perceptron with backpropagation learning on the MPAA system are also provided. The proposed algorithms support both neuron and layer level parallelisms which allow the MPAA system to operate the learning phase as well as the recall phase in the pipelined fashion. Performance evaluation is provided by detailed comparison in terms of two metrics such as the cost and the number of computation steps. The results show that the performance of the proposed architecture and algorithms is superior to those of the previous approaches, such as one-dimensional single instruction multiple data (SIMD) arrays, two-dimensional SIMD arrays, systolic ring structures, and hypercube machines.
2025, IEEE Transactions on Circuits and Systems for Video Technology
2025, J. Inf. Hiding Multim. Signal Process.
The vector quantization (VQ) concept is widely used in many applications. Side-match vector quantization (SMVQ) is a VQ-based image compression method that offers significantly improved performance of compression rate while maintaining... more
The vector quantization (VQ) concept is widely used in many applications. Side-match vector quantization (SMVQ) is a VQ-based image compression method that offers significantly improved performance of compression rate while maintaining the image quality of decompressed images. To eliminate distortion propagation, SMVQ requires one extra bit to serve as an indicator identifying which blocks are encoded by SMVQ or VQ, and to make sure all image blocks can be successfully reconstructed. To eliminate the indicators generated by SMVQ, a reversible data hiding method is adopted to conceal indicator into the compression code. From experimental results, the proposed method successfully conceals indicators into compression code with similar visual quality performance to SMVQ. In addition, experimental results confirm that the proposed method significantly improves the performance of compression rate.
2025
Reconfigurable optical interconnect technologies will allow the fabrication of run-time adaptable networks for connecting processors and memory modules in shared-memory multiprocessor machines. Since switching is typically slow compared... more
Reconfigurable optical interconnect technologies will allow the fabrication of run-time adaptable networks for connecting processors and memory modules in shared-memory multiprocessor machines. Since switching is typically slow compared to the memory access time, reconfiguration exploits low frequency dynamics in the network traffic patterns. These are however not easily captured in tools employing statistical traffic genereration, which is commonly used for fast design space exploration. Here, we present a technique that can predict network performance based on actual traffic patterns, but without the need to perform slow full-system simulations for every parameter set of interest. This again allows for a quick comparison of different network implementations with good relative accuracy, narrowing down the design space for more detailed examination.
2025
I declare that this written submission represents my ideas in my own words and where others' ideas or words have been included, I have adequately cited and referenced the original sources. I also declare that I have adhered to all... more
I declare that this written submission represents my ideas in my own words and where others' ideas or words have been included, I have adequately cited and referenced the original sources. I also declare that I have adhered to all principles of academic honesty and integrity and have not misrepresented or fabricated or falsified any idea/data/fact/source in my submission. I understand that any violation of the above will be cause for disciplinary action by the Institute and can also evoke penal action from the sources which have thus not been properly cited or from whom proper permission has not been taken when needed.
2025, Integrated Computer-Aided Engineering
The memory required for the implementation of the 2D wavelet transform typically incurs relatively high power consumption and limits the speed performances. In this paper we propose an optimized architecture of the 1D/2D wavelet... more
The memory required for the implementation of the 2D wavelet transform typically incurs relatively high power consumption and limits the speed performances. In this paper we propose an optimized architecture of the 1D/2D wavelet transform, that reduces the memory size cost with one order of magnitude compared to classical implementation styles. This so-called Local Wavelet Transform also minimizes the memory access cost, thanks to its spatially localized processing. Furthermore, the proposed architecture introduces concurrency in the data transfer mechanism, resulting in speed performances that are not limited by data transfer delays to/from main (off-chip) memory. Finally, the production of parent-children trees in indivisible clusters, makes an easy interfacing to Zero-Tree encoder modules possible, while keeping Region-of-Interest functionalities. Practical implementations of the 1D and 2D Local Wavelet Transform with up to 9/7-tap wavelet filters and a large number of levels (e.g. 4, 5), can process 10 Msamples/s, with an internal processing clock of 40 MHz, in a very modest 0.7 µm CMOS process.
2025, IEEE Transactions on Computers
AbstractÐMemory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations... more
AbstractÐMemory bandwidth is rapidly becoming the limiting performance factor for many applications, particularly for streaming computations such as scientific vector processing or multimedia (de)compression. Although these computations lack the temporal locality of reference that makes traditional caching schemes effective, they have predictable access patterns. Since most modern DRAM components support modes that make it possible to perform some access sequences faster than others, the predictability of the stream accesses makes it possible to reorder them to get better memory performance. We describe a Stream Memory Controller (SMC) system that combines compile-time detection of streams with execution-time selection of the access order and issue. The SMC effectively prefetches read-streams, buffers write-streams, and reorders the accesses to exploit the existing memory bandwidth as much as possible. Unlike most other hardware prefetching or stream buffer designs, this system does not increase bandwidth requirements. The SMC is practical to implement, using existing compiler technology and requiring only a modest amount of specialpurpose hardware. We present simulation results for fast-page mode and Rambus DRAM memory systems and we describe a prototype system with which we have observed performance improvements for inner loops by factors of 13 over traditional access methods.
2025, Journal IJETRM
This project focuses on the design and simulation of Random Access Memory (RAM) and Read-Only Memory (ROM) modules using Verilog HDL. The RAM module supports read and write operations, while the ROM module allows only read operations with... more
This project focuses on the design and simulation of Random Access Memory (RAM) and Read-Only Memory (ROM) modules using Verilog HDL. The RAM module supports read and write operations, while the ROM module allows only read operations with pre-initialized data. Both memory modules are designed to accommodate an 8-bit data width and handle up to 8 memory allocations. A comprehensive testbench was developed to verify the functionality of the RAM and ROM designs. The testbench simulates memory operations, including address selection, data read/write, and output validation. Simulation waveforms were analyzed to confirm the correct behavior of the memory modules. This project was implemented on EDA Playground, utilizing its simulation environment to debug and verify the design. The work provides a foundational understanding of digital memory design, Verilog HDL programming and simulation methods, with the potential for future expansion and integration into more advanced systems, including FPGA-based applications. The project successfully shows how RAM and ROM components can be efficiently modeled, tested, and validated using hardware description languages, contributing to the development of robust and scalable digital systems.
2025
Efficient management of concurrent access to shared resources is crucial in modern multi-threaded systems to avoid race conditions and performance bottlenecks. Traditional locking mechanisms, such as standard read-write locks, often... more
Efficient management of concurrent access to shared resources is crucial in modern multi-threaded systems to avoid race conditions and performance bottlenecks. Traditional locking mechanisms, such as standard read-write locks, often introduce substantial overhead in read-heavy workloads due to their blocking nature. To address these challenges, we introduce the LRW lock: lightweight read-write lock. It allows concurrent read access and ensures exclusive write access. The lock considers atomic operations to provide active readers and writers. This paper, initially presents algorithms to acquire read and write locks using a locking object of the LRW lock. Later, it provides the design of non-blocking methods tryReadLock() and tryWriteLock() for read and write operations. The methods offer flexibility for timesensitive applications. To understand the efficiency of the LRW lock, we consider different concurrent data structures and the state-of-the-art locking object. Experimental results show that the implementation that considers LRW lock outperforms the state-of-the-art locking object and consumes less memory footprint. The LRW lock leverages atomic operations to track active readers and writers efficiently.
2025
We extend the offline memory correctness checking scheme presented by Blum et. al [BEG + 91] to develop an offline checker that can detect attacks by active adversaries. We introduce the concept of incremental multiset hashes, and detail... more
We extend the offline memory correctness checking scheme presented by Blum et. al [BEG + 91] to develop an offline checker that can detect attacks by active adversaries. We introduce the concept of incremental multiset hashes, and detail one example: MSet-XOR MAC, which uses a secret key, and is efficient as updating the hash costs a few hash and XOR operations. Using multiset hashes as our underlying cryptographic tool, we introduce a primitive, bag integrity checking, to explain offline integrity checking; we demonstrate how this primitive can be used to build cryptographically secure integrity checking schemes for random access memories and disks. Recent papers describe processors, file systems, and databases in which hash trees are used to verify the integrity of data in untrusted storage. Checkers using hash trees are referred to as online checkers, as the trees are used to check, after each operation, whether the storage behaved correctly. The offline checker we describe is designed for checking sequences of operations on an untrusted storage, and, for some applications, performs better and uses less space than a checker using a hash tree. In this paper, we also introduce a hybrid checker, which can capture the best of both the online and offline schemes. The hybrid checker can operate mainly as an online checker when integrity checks need to be performed frequently, and as an offline checker when checks can be performed less frequently. The performance of the checker is expected to be close to the better scheme for every checking period.
2025, IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Power is becoming a critical constraint for designing embedded applications. Current power analysis techniques based on circuit-level or architectural-level simulation are either impractical or inaccurate to estimate the power cost for a... more
Power is becoming a critical constraint for designing embedded applications. Current power analysis techniques based on circuit-level or architectural-level simulation are either impractical or inaccurate to estimate the power cost for a given piece of application software. In this paper, an instruction-level power analysis model is developed for an embedded DSP processor based on physical current measurements. Signi cant points of di erence have been observed between the software power model for this custom DSP processor and the power models that have been developed earlier for some general-purpose commercial microprocessors 1, 2]. In particular, the e ect of circuit state on the power cost of an instruction stream is more marked in the case of this DSP processor. In addition, the processor has special architectural features that allow dual-memory accesses and packing of instructions into pairs. The energy reduction possible through the use of these features is studied. The on-chip Booth multiplier on the processor is a major source of energy consumption for DSP programs. A microarchitectural power model for the multiplier is developed and analyzed for further power minimization. In order to exploit all of the above e ects, a scheduling technique based on the new instruction-level power model is proposed. Several example programs are provided to illustrate the e ectiveness of this approach. Energy reductions varying from 26% to 73% have been observed. These energy savings are real and have been veri ed through physical measurement. It should be noted that the energy reduction essentially comes for free. It is obtained through software modi cation, and thus, entails no hardware overhead. In addition, there is no loss of performance since the running times of the modi ed programs either improve or remain unchanged.
2025
The memory consistency model of a shared-memory system determines the order in which memory accesses can be executed by the system, and greatly a ects the implementation and performance of the system. To aid system designers, memory... more
The memory consistency model of a shared-memory system determines the order in which memory accesses can be executed by the system, and greatly a ects the implementation and performance of the system. To aid system designers, memory models either directly specify, or are accompanied by, a set of low-level system conditions that can be easily translated into a correct implementation. These su cient conditions play a key role in helping the designer determine the architecture and compiler optimizations that may be safely exploited under a speci c model. Therefore, these conditions should obey three important properties. First, they should be unambiguous. Second, they should be feasibly aggressive; i.e., they should not prohibit practical optimizations that do not violate the semantics of the model. Third, it should be relatively straightforward to convert the conditions into e cient implementations, and conversely, to verify if an implementation obeys the conditions. Most previous approaches in specifying system requirements for a model are lacking in at least one of the above aspects. This paper presents a methodology for specifying the system conditions for a memory model that satis es the above goals. A key attribute of our methodology is the exclusion of ordering constraints among memory operations to di erent locations by observing that such constraints are unnecessary for maintaining the semantics of a model. To demonstrate the exibility of our approach, we specify the conditions for several proposed memory models within this framework. Compared to the original speci cation for each model, the new speci cation allows more optimizations without violating the original semantics and, in many cases, is more precise.
2025, Proceedings of the 1991 …
The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses.... more
The memory consistency model supported by a multiprocessor directly affects its performance. Thus, several attempts have been made to relax the consistency models to allow for more buffering and pipelining of memory accesses. Unfortunately, the potential increase in performance afforded by relaxing the consistency model is accompanied by a more complex programming model. This paper introduces two general implementation techniques that provide higher performance for all the models. The first technique involves prefetching values for accesses that are delayed due to consistency model constraints. The second technique employs speculative execution to allow the processor to proceed even though the consistency model requires the memory accesses to be delayed. When combined, the above techniques alleviate the limitations imposed by a consistency model on buffering and pipelining of memory accesses, thus significantly reducing the impact of the memory consistency model on performance.
2025, International Journal of Computer Science and Information Security (IJCSIS), Vol. 23, No. 2, March-April
This paper presents an upcoming nonvolatile memories (NVM) overview. Non-volatile memory devices are electrically programmable and erasable to store charge in a location within the device and to retain that charge when voltage supply from... more
This paper presents an upcoming nonvolatile memories (NVM) overview. Non-volatile memory devices are electrically programmable and erasable to store charge in a location within the device and to retain that charge when voltage supply from the device is disconnected. The non-volatile memory is typically a semiconductor memory comprising thousands of individual transistors configured on a substrate to form a matrix of rows and columns of memory cells. Non-volatile memories are used in digital computing devices for the storage of data. In this paper we have given introduction including a brief survey on upcoming NVM's such as FeRAM, MRAM, CBRAM, PRAM, SONOS, RRAM, Racetrack memory and NRAM. In future Non-volatile memory may eliminate the need for comparatively slow forms of secondary storage systems, which include hard disks.
2025, IEEE Transactions on Nuclear Science
Multiple Cell Upsets (MCUs) are becoming a growing concern with the advent of the newest FPGA devices. In this paper we present a methodology suitable for analyzing the sensitivity of circuits implemented in SRAM-based FPGAs, and adopting... more
Multiple Cell Upsets (MCUs) are becoming a growing concern with the advent of the newest FPGA devices. In this paper we present a methodology suitable for analyzing the sensitivity of circuits implemented in SRAM-based FPGAs, and adopting the TMR mitigation scheme. Data about the layout of the adopted FPGA are obtained by means of laser testing. Then static analysis algorithm uses the collected data to predict the impact of MCUs on designs implemented on SRAM-based FPGAs. Thanks to this approach MCUs affecting physically adjacent cells are considered, only. We report data focusing on a Virtex-II device, showing the capabilities of the proposed method.
2025
Providing the possibility of installing extensions has become a must-have feature for all major browsers. Extensions allow users to enhance and customise the browser functionalities by, for example, modifying the appearance of the web... more
Providing the possibility of installing extensions has become a must-have feature for all major browsers. Extensions allow users to enhance and customise the browser functionalities by, for example, modifying the appearance of the web pages, providing security suites or blocking ads. In this work, we make a first step towards monitoring web content alterations coming from extensions. In particular, we focus on the identification of relations between the mutations performed by different extensions. The study is motivated by the sequential and event-driven execution model running on web pages. That model entails that browser extensions can react to web content alterations performed by other extensions; hence, extensions have access to the data introduced by other extensions. We implement our prototype as a couple of logging extensions running on a modified version of Chromium. The approach relies on dynamic analysis of extensions and a simulation of a user surfing the web. Our system ...
2025, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
Currently, several of the high performance processors used in a PC cluster have a DVS (Dynamic Voltage Scaling) architecture that can dynamically scale processor voltage and frequency. Adaptive scheduling of the voltage and frequency... more
Currently, several of the high performance processors used in a PC cluster have a DVS (Dynamic Voltage Scaling) architecture that can dynamically scale processor voltage and frequency. Adaptive scheduling of the voltage and frequency enables us to reduce power dissipation without a performance slowdown during communication and memory access. In this paper, we propose a method of profiledbased power-performance optimization by DVS scheduling in a high-performance PC cluster. We divide the program execution into several regions and select the best gear for power efficiency. Selecting the best gear is not straightforward since the overhead of DVS transition is not free. We propose an optimization algorithm to select a gear using the execution and power profile by taking the transition overhead into account. We have built and designed a power-profiling system, PowerWatch. With this system we examined the effectiveness of our optimization algorithm on two types of power-scalable clusters (Crusoe and Turion). According to the results of benchmark tests, we achieved almost 40% reduction in terms of EDP (energy-delay product) without performance impact (less than 5%) compared to results using the standard clock frequency.
2025, Proceedings. 15th IEEE International Workshop on Rapid System Prototyping, 2004.
Growing demand for mobility and multimedia capacity in handheld devices translates into very complex systems where time-to-market development is critical. High data rates and improved system capacity for IP-based services increase... more
Growing demand for mobility and multimedia capacity in handheld devices translates into very complex systems where time-to-market development is critical. High data rates and improved system capacity for IP-based services increase controlling and dynamic scheduling of already complex modem technologies. This paper describes the prototyping of a UMTS terminal subsystem, which deals with several physical layer tasks such as channel multiplexing or interleaving. Our modeling strategy relies on a transaction-level platform methodology built on top of the SystemC language. The system-on-chip dynamics is modeled with behavioral modules, generic bus transactions and memory accesses. Several examples demonstrate that such a prototyping makes it possible to validate highly dynamic systems; this reduces system-on-chip development costs as it is recognized that about 70% of it lies today into verification. This prototyping also offers a large potential for fast architecture exploration with the capacity to dimension the memory components or evaluate the bus contention.
2025, IEEE Transactions on Magnetics
Sb 80 Te 20 has the merit of high crystallization speed yet with low crystallization temperature ( 132 C), and hence, not suitable for use as a medium of phase change random access memory (PCRAM). We proposed to add refractory metals to... more
Sb 80 Te 20 has the merit of high crystallization speed yet with low crystallization temperature ( 132 C), and hence, not suitable for use as a medium of phase change random access memory (PCRAM). We proposed to add refractory metals to solve this problem. It was found that W-added Sb 80 Te 20 show increased up to 233 C with increasing W. More important, the melting temperature of W-Sb-Te materials, 536-539 C irrespective of W content, is more than 80 C lower than that of Ge 2 Sb 2 Te 5 . They show a two to four orders of magnitude lowering in resistance during phase change from an amorphous to a crystalline state. With these promising properties, the composition Sb 80 Te 17 W 3 is recommended as a potential candidate for PCRAM.
2025, ACM Transactions on Storage
This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In... more
This article presents a new Fast Hash-based File Existence Checking (FHFEC) method for archiving systems. During the archiving process, there are many submissions which are actually unchanged files that do not need to be re-archived. In this system, instead of comparing the entire files, only digests of the files are compared. Strong cryptographic hash functions with a low probability of collision can be used as digests. We propose a fast algorithm to check if a certain hash, that is, a corresponding file, is already stored in the system. The algorithm is based on dividing the whole domain of hashes into equally sized regions, and on the existence of a pointer array, which has exactly one pointer for each region. Each pointer points to the location of the first stored hash from the corresponding region and has a null value if no hash from that region exists. The entire structure can be stored in random access memory or, alternatively, on a dedicated hard disk. A statistical performa...
2025
Today's largest supercomputers consist of tens of thousands of nodes equipped with one or more multi-core microprocessors. A challenge for performance tools is that bottlenecks in programs executing on these systems may arise from a... more
Today's largest supercomputers consist of tens of thousands of nodes equipped with one or more multi-core microprocessors. A challenge for performance tools is that bottlenecks in programs executing on these systems may arise from a myriad of causes. To address this problem, Rice University is developing HPCTOOLKIT -an integrated suite of tools that supports sampling-based measurement, analysis, attribution, and presentation of application performance for fully-optimized parallel programs. This paper provides a brief overview of performance analysis challenges on supercomputers with node-level parallelism, describes how HPCToolkit supports a variety of performance analysis strategies that can pinpoint and quantify impediments to scalable high performance in parallel applications both within and across nodes, and outlines some remaining challenges ahead.
2025, Journal of Digital Imaging
To develop a personal computer (PC)-based software package that allows portability of the electronic imaging record. To create custom software that enhances the transfer of images in two fashions. Firstly, to an end user, whether... more
To develop a personal computer (PC)-based software package that allows portability of the electronic imaging record. To create custom software that enhances the transfer of images in two fashions. Firstly, to an end user, whether physician or patient, provide a browser capable of viewing digital images on a conventional personal computer. Second, to provide the ability to transfer the archived Digital Imaging and Communications in Nledicine (DICOM) images to other institutional picture archiving and communications systems (PACS) through a transfer engine. Method/materials: Radiologic studies are provided on a CD-ROM. This CD-ROM contains a copy of the browser to view images, a DICOM-based engine to transfer images to the receiving institutional PACS, and copies of all pertinent imaging studies for the particular patient. The host computer system in an Intel based Pentium 90 MHz PC with Microsoft Windows 95 software (Microsoft Inc, Seattle, WA). The system has 48 MB of random access memory, a 3.0 GB hard disk, anda Smart and Friendly CD-R 2006 CD-ROM recorder (Smart and Friendly Inc, Chatsworth, CA). Results: Each CD-ROM disc can hold 640 MB of data. In our experience, this houses anywhere from, based on Table , 12 to 30 computed tomography (CT) examinations, 24 to 80 magnetic resonance (MR) examinations, 60 to 128 ultrasound examinations, 32 to 64 computed radiographic examinations, 80 digitized x-rays, or five digitized mammography examinations. We have been able to successfully transfer DICOM images from one DICOM-based PACS to another DICOM-based PACS. This is accomplished by inserting the created CD-ROM onto a CD drive attached to the receiving PACS and running the transfer engine application. Conclusions: Providing copies of radiologic studies performed to the patient is a necessity in every radiology department. Conventionally, film libraries have provided copies to the patient generating issues of cost of Ioss of film, as well as mailing costs. This software package saves costs and Ioss of studies, as well as improving patient care by enabling the patient to maintain an archive of their electronic imaging record.
2025, Sharif Journal of Civil Engineering (SJCE)
Three-stage automatic operational modal analysis using mathematical mode elimination by density-based clustering method Document Type : Article Authors A. Salar Mehrabad 1 A. Shooshtari 2 1 PhD Student, Engineering Faculty, Ferdowsi... more