High Performance Computing Research Papers (original) (raw)
2025
Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been... more
Distributed-memory systems are a key to achieve high performance computing and the most favorable architectures used in advanced research problems. Mesh connected multicomputer are one of the most popular architectures that have been implemented in many distributed-memory systems. These systems must support communication operations efficiently to achieve good performance. The wormhole switching technique has been widely used in design of distributed-memory systems in which the packet is divided into small flits. Also, the multicast communication has been widely used in distributed-memory systems which is one source node sends the same message to several destination nodes. Fault tolerance refers to the ability of the system to operate correctly in the presence of faults. Development of fault tolerant multicast routing algorithms in 2D mesh networks is an important issue. This dissertation presents, new fault tolerant multicast routing algorithms for distributed-memory systems perform...
2025, Lecture Notes in Computer Science
Cloud Computing is emerging today as a commercial infrastruc ture that eliminates the need for maintaining expensive computing hardware. Through the use of virtualization, clouds promise to address with the same shared set of physical... more
Cloud Computing is emerging today as a commercial infrastruc ture that eliminates the need for maintaining expensive computing hardware. Through the use of virtualization, clouds promise to address with the same shared set of physical resources a large user base with different needs. Thus, clouds promise to be for scientists an alternative to clusters , grids, and supercomput ers. However, virt ualization may induce significant performance penalties for the demanding scientific comput ing workloads. In this work we present an evaluation of th e usefulness of the current cloud computing services for scientific computing. We analyze the performance of the Amazon EC2 platform using micro-benchmarks and kernels.While clouds are still changing, our results indicate that the current cloud services need an order of magnitud e in performance improvement to be useful to the scientific community.
2025, Companion of the 2023 ACM/SPEC International Conference on Performance Engineering
We explore the potential of the Graph-Massivizer project funded by the Horizon Europe research and innovation program of the European Union to boost the impact of extreme and sustainable graph processing for mitigating existing urgent... more
We explore the potential of the Graph-Massivizer project funded by the Horizon Europe research and innovation program of the European Union to boost the impact of extreme and sustainable graph processing for mitigating existing urgent societal challenges. Current graph processing platforms do not support diverse workloads, models, languages, and algebraic frameworks. Existing specialized platforms are difficult to use by non-experts and suffer from limited portability and interoperability, leading to redundant efforts and inefficient resource and energy consumption due to vendor and even platform lock-in. While synthetic data emerged as an invaluable resource overshadowing actual data for developing robust artificial intelligence analytics, graph generation remains a challenge due to extreme dimensionality and complexity. On the European scale, this practice is unsustainable and, thus, threatens the possibility of creating a climate-neutral and sustainable economy based on graph data. Making graph processing sustainable is essential but needs credible evidence. The grand vision of the Graph-Massivizer project is a technological solution, coupled with field experiments and experience-sharing, for a high-performance and sustainable graph processing of extreme data with a proper response for any need and organizational size by 2030.
2025, Ultrascale Computing Systems
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or... more
HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
2025, 2009 First Asian Himalayas International Conference on Internet
We have developed a library for simulation of DC traction system. The modeling, installation and reliability aspects of Traction system in purview of current developments in HPC and Virtual reality can be implemented on the proposed... more
We have developed a library for simulation of DC traction system. The modeling, installation and reliability aspects of Traction system in purview of current developments in HPC and Virtual reality can be implemented on the proposed protocol. We have proposed a model to predict and analyze traction system more comprehensively and put a modeling structure regarding the same. Distribution of various computations is shown in an HPC setup and has been implemented. The advantages of Virtual reality in installation in city and intercity, driving has been implemented on C#. Installation of VR framework for installation of DC traction thus has been implemented in the current technologies that are being developed in areas of HPC and VR.
2025, 2009 6th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology
The essence of High performance computing (HPC) in the field of Nanotechnology and problems encountered by HPC arrangement in applying HPC to Nano-enabled calculations have been presented in the paper. A proposal to optimize computations... more
The essence of High performance computing (HPC) in the field of Nanotechnology and problems encountered by HPC arrangement in applying HPC to Nano-enabled calculations have been presented in the paper. A proposal to optimize computations in an HPC setup and distribution of work in various clusters has been formulated to make Nanotechnology computations more effective and realistic on a Windows Cluster Server based framework. Results and findings in the expected setup and the computation complexities that will be needed in its implementation have been suggested with an algorithm to take advantage of inbuilt powerful parallelization and distribution capabilities of Windows Server 2003 Compute Cluster Edition making large scale simulation possible. Connection of four nodes with the help of Microsoft Compute Cluster Server 2003 (MCCS 2003) has been carried out and algorithms were constructed in C# using Visual Studio IDE. In addition to the .NET Framework, Extreme Optimization Numerical Library for .NET has been used for performing high speed mathematical calculations. MPI .NET library has been employed to build parallel algorithms and breaking of computations into small tasks. Microsoft's implementation of Message Passing Interface (MPI) included in MCCS was used for running computation application tests. Implementation of HPC in measuring reliability of Nanotechnology-based devices and computations of certain complex techniques in Nanotechnology is presented with a significant improvement in performance as compared to the last work which was implemented using distributive computing toolbox in MATLAB. Besides its use in large-scale computations, C# also offers more control over programming, runtime and execution of the application. A description of the progress in this area of research, future works and an extended approach in the same field is shown.
2025, Proceedings of The International Symposium on Grids and Clouds and the Open Grid Forum — PoS(ISGC 2011 & OGF 31)
2025, Computing in Science & Engineering
lthough the scientific computing community has questioned graphics processing unit (GPU) efficiency when it comes to energy per operation, the latest Green500 list (www.green500.org) should put to rest these concerns. The Green500 sorts... more
lthough the scientific computing community has questioned graphics processing unit (GPU) efficiency when it comes to energy per operation, the latest Green500 list (www.green500.org) should put to rest these concerns. The Green500 sorts systems on the Top-500 list (www.top500.org) according to their power efficiency; on the latest list, released in November 2010, the number two and three systems were GPU-based. Here, we introduce the system that ranked at number three: EcoG, a 128-node GPU-accelerated cluster computer deployed at the US National Center for Supercomputing Applications (NCSA). EcoG was built through a joint effort by Nvidia, the
2025, Mathematics
Remora Optimization Algorithm (ROA) is a recent population-based algorithm that mimics the intelligent traveler behavior of Remora. However, the performance of ROA is barely satisfactory; it may be stuck in local optimal regions or has a... more
Remora Optimization Algorithm (ROA) is a recent population-based algorithm that mimics the intelligent traveler behavior of Remora. However, the performance of ROA is barely satisfactory; it may be stuck in local optimal regions or has a slow convergence, especially in high dimensional complicated problems. To overcome these limitations, this paper develops an improved version of ROA called Enhanced ROA (EROA) using three different techniques: adaptive dynamic probability, SFO with Levy flight, and restart strategy. The performance of EROA is tested using two different benchmarks and seven real-world engineering problems. The statistical analysis and experimental results show the efficiency of EROA.
2025, Computer Modeling in Engineering & Sciences
Many complex optimization problems in the real world can easily fall into local optimality and fail to find the optimal solution, so more new techniques and methods are needed to solve such challenges. Metaheuristic algorithms have... more
Many complex optimization problems in the real world can easily fall into local optimality and fail to find the optimal solution, so more new techniques and methods are needed to solve such challenges. Metaheuristic algorithms have received a lot of attention in recent years because of their efficient performance and simple structure. Sine Cosine Algorithm (SCA) is a recent Metaheuristic algorithm that is based on two trigonometric functions Sine & Cosine. However, like all other metaheuristic algorithms, SCA has a slow convergence and may fail in sub-optimal regions. In this study, an enhanced version of SCA named RDSCA is suggested that depends on two techniques: random spare/replacement and double adaptive weight. The first technique is employed in SCA to speed the convergence whereas the second method is used to enhance exploratory searching capabilities. To evaluate RDSCA, 30 functions from CEC 2017 and 4 real-world engineering problems are used. Moreover, a nonparametric test called Wilcoxon signed-rank is carried out at 5% level to evaluate the significance of the obtained results between RDSCA and the other 5 variants of SCA. The results show that RDSCA has competitive results with other metaheuristics algorithms.
2025, Mathematics
Remora Optimization Algorithm (ROA) is a recent population-based algorithm that mimics the intelligent traveler behavior of Remora. However, the performance of ROA is barely satisfactory; it may be stuck in local optimal regions or has a... more
Remora Optimization Algorithm (ROA) is a recent population-based algorithm that mimics the intelligent traveler behavior of Remora. However, the performance of ROA is barely satisfactory; it may be stuck in local optimal regions or has a slow convergence, especially in high dimensional complicated problems. To overcome these limitations, this paper develops an improved version of ROA called Enhanced ROA (EROA) using three different techniques: adaptive dynamic probability, SFO with Levy flight, and restart strategy. The performance of EROA is tested using two different benchmarks and seven real-world engineering problems. The statistical analysis and experimental results show the efficiency of EROA.
2025, The Journal of Supercomputing
Over the past years, researchers drew their attention to propose optoelectronic architectures, including optical transpose interconnection system (OTIS) networks. On the other hand, there are limited attempts devoted to design parallel... more
Over the past years, researchers drew their attention to propose optoelectronic architectures, including optical transpose interconnection system (OTIS) networks. On the other hand, there are limited attempts devoted to design parallel algorithms for applications that could be mapped on such optoelectronic architectures. Thus, exploiting the attractive features of OTIS networks and investigating their performance in solving combinatorial optimization problems become a great necessity. In this paper, a parallel repetitive nearest neighbor algorithm for solving the symmetric traveling salesman problem on OTIS-Hypercube and OTIS-Mesh optoelectronic architectures is presented. This algorithm has been evaluated analytically and by simulation on both optoelectronic architectures in terms of number of communication steps, parallel run time, speedup, efficiency, cost and communication cost. The simulation results attained almost near-linear speedup and high efficiency among the two selected optoelectronic architectures, where OTIS-Hypercube gained better results in comparison with OTIS-Mesh.
2025, Proceedings Fourth International Conference on High-Performance Computing
The number of physical registers is one of the critical issues of current superscalar out-of-order processors. Conventional architectures allocate in the decode stage a new storage location (e.g. physical register) for each operation that... more
The number of physical registers is one of the critical issues of current superscalar out-of-order processors. Conventional architectures allocate in the decode stage a new storage location (e.g. physical register) for each operation that has a destination register. When an instruction is committed, it frees the physical register allocated to the previous instruction that had the same destination logical register. Thus, an additional register (i.e. in addition to the number of logical registers) is used for each instruction with a destination register from the time it is decoded until it is committed. In this paper we propose a novel register organization that allocates physical registers when instructions complete execution. In this way, the register pressure is significantly reduced since the additional register is only spent from the time execution completes until the instruction is committed. For some long latency instructions (e.g. load with a cache miss) and for parts of the code with small amount ofparallelism, the savings could be very high. We have evaluated the new scheme for a superscalar processor and obtained a significant speedup.
2025, Lecture Notes in Computer Science
Indirect branch prediction is a performance limiting factor for current computer systems, preventing superscalar processors from exploiting the available ILP. Indirect branches are responsible for 55.7% of mispredictions in our benchmark... more
Indirect branch prediction is a performance limiting factor for current computer systems, preventing superscalar processors from exploiting the available ILP. Indirect branches are responsible for 55.7% of mispredictions in our benchmark set, although they only stand for 15.5% of dynamic branches. Moreover, a 10.8% average IPC speedup is achievable by perfectly predicting all indirect branches. The Multi-Stage Cascaded Predictor (MSCP) is a mechanism proposed for improving indirect branch prediction. In this paper, we show that a MSCP can replace a BTB and accurately predict the target address of both indirect and non-indirect branches. We do a detailed analysis of MSCP behavior and evaluate it in a realistic setup, showing that a 5.7% average IPC speedup is achievable.
2025
İnsan sinir sistemi ile dijital bilgi işleme sistemleri (CPU mimarileri) arasındaki benzerlikler, disiplinlerarası bir perspektiften anlamlı eşleşmeler sunar. Bu metin, insan sinir sisteminin mimari özelliklerini, iki temel dijital sistem... more
İnsan sinir sistemi ile dijital bilgi işleme sistemleri (CPU mimarileri) arasındaki benzerlikler, disiplinlerarası bir perspektiften anlamlı eşleşmeler sunar. Bu metin, insan sinir sisteminin mimari özelliklerini, iki temel dijital sistem modeli olan Von Neumann ve Harvard mimarileriyle karşılaştırmalı olarak değerlendirir. 1. 0 1 F 4 D 0 Von Neumann ve Harvard Mimarisinin Temel Özellikleri: • Von Neumann Mimarisi (1945):-Tek bir veri yolu üzerinden hem komut hem de veri taşınır.-Bellek erişimleri sıralıdır (von Neumann bottleneck).-Modern PC ve mikroişlemcilerde sıklıkla kullanılır. Referans: Von Neumann, J. (1945). First Draft of a Report on the EDVAC. • Harvard Mimarisi:-Ayrı kod ve veri yolları bulunur.-Bellek erişimi paralel yürütülebilir.-Mikrodenetleyiciler (örneğin ARM Cortex-M, AVR), dijital sinyal işleyicilerde yaygındır. Referans: Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach. 2. 0 1 F 9 E 0 İnsan Sinir Sistemi: Mimarik Yaklaşım • Sinir sistemi, girdiler (afferent) ve çıktılar (efferent) için ayrı yollar kullanır. • Duyu organları → spinal cord → beyin → motor sistem akışı yönlü ve eş zamanlıdır. • Kod (motor komut) ve veri (duyusal sinyaller) ayrıdır. Bu yapı, Harvard mimarisi ile örtüşür. Referans:
2025, Journal of Signal Processing Systems
Low-Density Parity-Check (LDPC) codes are very powerful channel coding schemes with a broad range of applications. The existence of low complexity (i.e., linear time) iterative message passing decoders with close to optimum error... more
Low-Density Parity-Check (LDPC) codes are very powerful channel coding schemes with a broad range of applications. The existence of low complexity (i.e., linear time) iterative message passing decoders with close to optimum error correction performance is one of the main strengths of LDPC codes. It has been shown that the performance of these decoders can be further enhanced if the LDPC codes are extended to higher order Galois fields, yielding so called non-binary LDPC codes. However, this performance gain comes at the cost of rapidly increasing decoding complexity. To deal with this increased complexity, we present an efficient implementation of a signed-log domain FFT decoder for non-binary irregular LDPC codes which exploits the inherent massive parallelization capabilities of message passing decoders. We employ Nvidia's Compute Unified Device Architecture (CUDA) to incorporate the available processing power of Parts of this paper have been presented at the 2013 IEEE Workshop on Signal Processing System under the title "High Speed Decoding of Non-Binary Irregular LDPC Codes Using GPUs" [3] M. Beermann ( ) • P. Vary
2025, Deep Science Publishing
Cloud computing has transformed many industries, including in bioinformatics, on how scientific research and analytics are done today, and those are complementary to disciplines in biology and genetics like genomic analytics. The... more
Cloud computing has transformed many industries, including in bioinformatics, on how scientific research and analytics are done today, and those are complementary to disciplines in biology and genetics like genomic analytics. The exponential growth in sequencing technologies has reduced genome sequencing cost by two orders of magnitude compared to 10 years ago, making it increasingly affordable for even small research groups to collect very large datasets. The transformative effect of genomic cloud-computing based solutions for the end-user offers non-expert researchers to safely and efficiently leverage their biological "wet" lab data over massive, distributed, and complex "dry" lab big data through the scalable and cloud-based analysis platforms, services, and tools by "clicking" on the web browser, even on smart devices. Traditional way to analyze data seriously constrains what can be realistically pursued in the time constraints of a single research project, or even several educational and grant cycles. The cost of purchasing the necessary software and hardware infrastructure is well out of reach for small research groups or under-funded academic institutions. Even impressive HPC resources may not be enough to tackle complex questions and perform sophisticated analyses that rival state-of-the-art research, and be under-utilized for the routine daily needs of biologists, in most cases the end-users of genomics data. In some cases, because of bioethical and administrative reasons, it is impossible to send data over the Internet or the requested services can't be found on the market. Prohibitive cost and difficult recruitment of capable bioinformaticians in these regions presents a bottleneck in exploiting these datasets and creates a widening gap in biomedical research to major Deep Science Publishing
2025, 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These are high-quality assignments, previously tested in class, that are readily adoptable. This collection of assignments includes... more
Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These are high-quality assignments, previously tested in class, that are readily adoptable. This collection of assignments includes face recognition, finding the electrical potential of a square wire, and heat diffusion. All of these come with sample assignment sheets and the necessary starter code.
2025, Applied Medical Informatics
For long time High-Performance Computing (HPC) has been critical for running large-scale modeling and simulation using numerical models. The big data analytics domain (BDA) has been rapidly developed over the last years to process... more
For long time High-Performance Computing (HPC) has been critical for running large-scale modeling and simulation using numerical models. The big data analytics domain (BDA) has been rapidly developed over the last years to process torrents of data now being generated in various domains. But, in general, the data analytics software was not developed inside the scientific computing community, and new approches were adopted by BDA specialists. Dataintensive applications are needed in varied field ranges from advanced research-as genomics, proteomics, epidemiology and systems biology-to commercial initiatives to develop new drugs and medical treatments, agricultural pesticides and other bio-products. Big data processing is still needed in the more HPC traditional domains as physics, climate, and astronomy, but even there adopting data-driven paradigms could bring important advantages. On the other side BDA needs the infrastructure and the fundamentals of HPC in order to face with the needed computational challenges. There are important differences in the approaches of these two domains: those that are working in BDA focus on the 4Vs of big data which are: volume, velocity, variety, and veracity, while HPC scientists tend to focus on performance, scaling, and the power efficiency of a computation. As we are heading towards extreme-scale HPC coupled with data intensive analytics, the integration of BDA and HPC is a necessity and a current hot topic of research.
2025
We present preliminary results with the TyTra design flow. Our aim is to create a parallelising compiler for high-performance scientific code on heterogeneous platforms, with a focus on Field-Programmable Gate Arrays (FPGAs). Using the... more
We present preliminary results with the TyTra design flow. Our aim is to create a parallelising compiler for high-performance scientific code on heterogeneous platforms, with a focus on Field-Programmable Gate Arrays (FPGAs). Using the functional language Idris, we show how this programming paradigm facilitates generation of different correctby-construction program variants through type transformations. We have developed a custom Intermediate Representation (IR) language, the TyTra-IR, which is similar to the LLVM IR, with extensions to express parallelism, allowing us to designs variants associated with each program variant. The key innovation of the TyTra-IR is the ability to construct and cost design variants for FPGAs. Our prototype compiler generates Verilog code for FPGA synthesis from a given IR description. Using a real-world Successive Over-Relaxation (SOR) kernel, we illustrate generation of program variants in Idris, their representation in TyTra-IR, and evaluation of variants using our cost-model. We compare the estimates from the cost-model with results from synthesis and simulation of equivalent HDL.
2025
High-Performance Computing (HPC) platforms present a significant programming challenge, especially because the key users of HPC resources are scientists, not parallel programmers. We contend that compiler technology has to evolve to... more
High-Performance Computing (HPC) platforms present a significant programming challenge, especially because the key users of HPC resources are scientists, not parallel programmers. We contend that compiler technology has to evolve to automatically create the best program variant by transforming a given original program. We have developed a novel methodology based on type transformations for generating correct-by-construction design variants, and an associated light-weight cost model for evaluating these variants for implementation on FPGAs. In this paper we present a key enabler of our approach, the cost model. We discuss how we are able to quickly derive accurate estimates of performance and resource-utilization from the design's representation in our intermediate language. We show results confirming the accuracy of our cost model by testing it on three different scientific kernels. We conclude with a case-study that compares a solution generated by our framework with one from a conventional high-level synthesis tool, showing better performance and power-efficiency using our cost model based approach.
2025
Many numerical simulation applications from the scientific, financial and machine-learning domains require large amounts of compute capacity. They can often be implemented with a streaming data-flow architecture. Field Programmable Gate... more
Many numerical simulation applications from the scientific, financial and machine-learning domains require large amounts of compute capacity. They can often be implemented with a streaming data-flow architecture. Field Programmable Gate Arrays (FPGA) are particularly power-efficient hardware architectures suitable for streaming data-flow applications. Although numerous programming languages and frameworks target FPGAs, expert knowledge is still required to optimise the throughput of such applications for each target FPGA device. The process of selecting which optimising transformations to apply, and where to apply them is dubbed Design Space Exploration (DSE). We contribute an elegant and efficient compiler based DSE strategy for FPGAs by merging information sourced from the compiled application's semantic structure, an accurate cost-performance model and a description of hardware resource limits for particular FPGAs. Our work leverages developments in functional programming and dependent type theory to bring performance portability to the realm of High-Level Synthesis (HLS) tools targeting FPGAs. We showcase our approach by presenting achievable speedups for three example applications. Results indicate considerable improvements in throughput of up to 58× in one example. These results are obtained by traversing a minute fraction of the total Design Space.
2025, 2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
Many numerical simulation applications from the scientific, financial and machine-learning domains require large amounts of compute capacity. They can often be implemented with a streaming data-flow architecture. Field Programmable Gate... more
Many numerical simulation applications from the scientific, financial and machine-learning domains require large amounts of compute capacity. They can often be implemented with a streaming data-flow architecture. Field Programmable Gate Arrays (FPGA) are particularly power-efficient hardware architectures suitable for streaming data-flow applications. Although numerous programming languages and frameworks target FPGAs, expert knowledge is still required to optimise the throughput of such applications for each target FPGA device. The process of selecting which optimising transformations to apply, and where to apply them is dubbed Design Space Exploration (DSE). We contribute an elegant and efficient compiler based DSE strategy for FPGAs by merging information sourced from the compiled application's semantic structure, an accurate cost-performance model and a description of hardware resource limits for particular FPGAs. Our work leverages developments in functional programming and dependent type theory to bring performance portability to the realm of High-Level Synthesis (HLS) tools targeting FPGAs. We showcase our approach by presenting achievable speedups for three example applications. Results indicate considerable improvements in throughput of up to 58× in one example. These results are obtained by traversing a minute fraction of the total Design Space.
2025, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
High-Performance Computing (HPC) platforms present a significant programming challenge, especially because the key users of HPC resources are scientists, not parallel programmers. We contend that compiler technology has to evolve to... more
High-Performance Computing (HPC) platforms present a significant programming challenge, especially because the key users of HPC resources are scientists, not parallel programmers. We contend that compiler technology has to evolve to automatically create the best program variant by transforming a given original program. We have developed a novel methodology based on type transformations for generating correct-by-construction design variants, and an associated light-weight cost model for evaluating these variants for implementation on FPGAs. In this paper we present a key enabler of our approach, the cost model. We discuss how we are able to quickly derive accurate estimates of performance and resource-utilization from the design's representation in our intermediate language. We show results confirming the accuracy of our cost model by testing it on three different scientific kernels. We conclude with a case-study that compares a solution generated by our framework with one from a conventional high-level synthesis tool, showing better performance and power-efficiency using our cost model based approach.
2025
We present the TyTra-IR, a new intermediate language intended as a compilation target for high-level language compilers and a front-end for HDL code generators. We develop the requirements of this new language based on the design-space of... more
We present the TyTra-IR, a new intermediate language intended as a compilation target for high-level language compilers and a front-end for HDL code generators. We develop the requirements of this new language based on the design-space of FPGAs that it should be able to express and the estimation-space in which each configuration from the design-space should be mappable in an automated design flow. We use a simple kernel to illustrate multiple configurations using the semantics of TyTra-IR. The key novelty of this work is the cost model for resource-costs and throughput for different configurations of interest for a particular kernel. Through the realistic example of a Successive Over-Relaxation kernel implemented both in TyTra-IR and HDL, we demonstrate both the expressiveness of the IR and the accuracy of our cost model.
2025
While relational databases have become critically important in business applications and web services, they have played a relatively minor role in scientific computing, which has generally been concerned with modeling and simulation... more
While relational databases have become critically important in business applications and web services, they have played a relatively minor role in scientific computing, which has generally been concerned with modeling and simulation activities. However, massively parallel database architectures are beginning to offer the ability to quickly search through terabytes of data with hundred-fold or even thousand-fold speedup over server-based architectures. These new machines may enable an entirely new class of algorithms for scientific applications, especially when the fundamental computation involves searching through abstract graphs. Three examples are examined and results are reported for implementations on a novel, massively parallel database computer, which enabled very high performance. Promising results from (1) computation of bibliographic couplings, (2) graph searches for sub-circuit motifs within integrated circuit netlists, and (3) a new approach to word sense disambiguation in natural language processing, all suggest that the computational science community might be able to make good use of these new database machines.
2025, Journal of Parallel and Distributed Computing
• A new content selection mechanism for Shared Last-Level Caches (SLLC) in chip multiprocessor systems is proposed. • The mechanism leverages the reuse locality embedded in the SLLC request stream. • By the addition of a Reuse Detector... more
• A new content selection mechanism for Shared Last-Level Caches (SLLC) in chip multiprocessor systems is proposed. • The mechanism leverages the reuse locality embedded in the SLLC request stream. • By the addition of a Reuse Detector (ReD), located in between each L2 cache and the SLLC, the mechanism discovers useless L2 evicted blocks, bypassing them. • The ReD mechanism is designed to overcome as much as possible problems affecting previous state-of-the-art proposals as low accuracy, reduced visibility window and detector thrashing.
2025, Proceedings of the 2006 conference on Specification and verification of component-based systems
Experiments in the use of tau-simulations for the components-verification of real-time systems 33
2025, ArXiv
We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field... more
We present a set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo. We use this experimental campaign to improve the understanding of failure causes and propagation that we observed in the field failure data analysis of NCSA's Blue Waters. We use the data collected from the logs and from network performance counter data 1) to characterize the fault-error-failure sequence and recovery mechanisms in the Gemini network and in the Cray compute nodes, 2) to understand the impact of failures on the system and the user applications at different scale, and 3) to identify and recreate fault scenarios that induce unrecoverable failures, in order to create new tests for system and application design. The faults were injected through special input commands to bring down network links, directional connections, nodes, and blades. We present extensions that will be needed to apply our methodologies of injection and analysis to the Cray XC (Aries) systems.
2025
While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect... more
While it is widely acknowledged that network congestion in High Performance Computing (HPC) systems can significantly degrade application performance, there has been little to no quantification of congestion on credit-based interconnect networks. We present a methodology for detecting, extracting, and characterizing regions of congestion in networks. We have implemented the methodology in a deployable tool, Monet, which can provide such analysis and feedback at runtime. Using Monet, we characterize and diagnose congestion in the world’s largest 3D torus network of Blue Waters, a 13.3petaflop supercomputer at the National Center for Supercomputing Applications. Our study deepens the understanding of production congestion at a scale that has never been evaluated before.
2025
Das hier vorliegende Handbuch ist im Rahmen des Projektes „Vernetztes Forschungsdatenmanagement an Hochschulen für angewandte Wissenschaften am Beispiel der HTW Dresden – FoDaMa-HTWD“ entstanden.1 Es stellt eine kurze und übersichtliche... more
Das hier vorliegende Handbuch ist im Rahmen des Projektes „Vernetztes Forschungsdatenmanagement an Hochschulen für angewandte Wissenschaften am Beispiel der HTW Dresden – FoDaMa-HTWD“ entstanden.1 Es stellt eine kurze und übersichtliche Zusammenfassung der wichtigsten Erkenntnisse dar, welche während der Projektlaufzeit an der Hochschule für Technik und Wirtschaft Dresden (HTWD) zum Forschungsdatenmanagement (FDM) gewonnen wurden. Die Autor/innen möchten mit diesem Handbuch andere Hochschulen für angewandte Wissenschaften (HAW) bei der Strategieentwicklung und dem notwendigen FDM-Strukturaufbau unterstützen. Es richtet sich demnach vorrangig an Personen, die sich an Hochschulen mit der strategischen Weiterentwicklung im Bereich Forschung beschäftigen und sich vielleicht die Frage stellen, welche unterstützenden FDM-Services und Maßnahmen ergriffen werden sollten, damit die Forschenden der eigenen Institution der zunehmenden Forderung nach offener und nachhaltiger Arbeitsweise im Umg...
2025, 2007 IEEE International Symposium on Industrial Electronics
Decimal arithmetic supported by digital computers has been gaining renewed importance over the last few years. However, the development of high performance radix 10-based systems is still incipient. In this paper, a modification of the... more
Decimal arithmetic supported by digital computers has been gaining renewed importance over the last few years. However, the development of high performance radix 10-based systems is still incipient. In this paper, a modification of the CORDIC method for decimal arithmetic is proposed. The resulting algorithm works with radix 10 operands and combines decimal arithmetic with elementary angles so as to reduce the number of iterations required to achieve certain precision. Different experiments showing the advantages of the new method compared with the original decimal CORDIC method are also described. Finally, an architecture for the method implemented on FPGA is proposed.
2025
RESUMEN En la actualidad, para abordar problemas de mayor tamaño y complejidad estudios de ciencia básica y aplicada utilizan Computación de Altas Prestaciones (HPC High Performance Computing). El HPC permite mejorar la capacidad,... more
RESUMEN En la actualidad, para abordar problemas de mayor tamaño y complejidad estudios de ciencia básica y aplicada utilizan Computación de Altas Prestaciones (HPC High Performance Computing). El HPC permite mejorar la capacidad, velocidad y precisión en el procesamiento de datos. Con el proyecto que da origen a este trabajo se propone abordar seis estudios desde la perspectiva del HPC, para explorar los aspectos centrales del paralelismo aplicado desde las Ciencias de la Computación en otras disciplinas. Algunas de los estudios que se abordan ya vienen desarrollándose en la Universidad Nacional de Chilecito, otros se inician a partir de cooperaciones con otras instituciones o para formalizar trabajos finales de postgrado. En todos los casos, el HPC será abordado a través de un proceso metodológico organizado para: ● Consolidar una infraestructura de experimentación, desarrollo y producción de soluciones a problemas de HPC ● Desarrollar las capacidades científico-tecnológicas del e...
2025, Lecture Notes in Computer Science
Multicast topology inference from end-to-end measurements has been widely used recently. Algorithms of inference on loss distribution show good performance in inference accuracy and time complexity. However, to our knowledge, the existing... more
Multicast topology inference from end-to-end measurements has been widely used recently. Algorithms of inference on loss distribution show good performance in inference accuracy and time complexity. However, to our knowledge, the existing results produce logical topology structures that are only in the complete binary tree form, which differ in most cases significantly from the actual network topology. To solve this problem, we propose an algorithm that makes use of an additional measure of hop count. The improved algorithm of incorporating hop count in binary tree topology inference is helpful to reduce time complexity and improve inference accuracy. Through comparison and analysis, it is obtained that the time complexity of our algorithm in the worst case is O(l 2 ) that is much better than O(l 3 ) required by the previous algorithm. The expected time complexity of the algorithm is estimated at O(l • log 2 l), while that of the previous algorithm is O(l 3 ).
2025, Built Environment Project and Asset Management
Purpose-Despite ongoing safety efforts, construction sites remain some of the most hazardous workplaces. This study introduces an innovative occupational safety and health administration (OSHA) training approach by creating a realistic... more
Purpose-Despite ongoing safety efforts, construction sites remain some of the most hazardous workplaces. This study introduces an innovative occupational safety and health administration (OSHA) training approach by creating a realistic virtual construction environment using unmanned aerial vehicle (UAV) imagery and a game engine. Integrating OSHA regulations makes safety instructions more effective than traditional training. Design/methodology/approach-The research employs UAV-derived photogrammetry to generate a 3D textured mesh model of an active construction site. This model is integrated into a game engine to develop an interactive, first-person simulation where users explore the site and receive safety instructions at hazard points. Validation was conducted through questionnaire surveys of 13 construction professionals and 25 undergraduate students. Findings-The study shows that interactive game-based learning significantly improves trainees' ability to identify and understand site-specific hazards. Survey responses from students and construction professionals indicated that the game is more effective in teaching safety protocols than traditional OSHA 30 training. Practical implications-The study demonstrates that integrating UAV photogrammetry with game engines enhances construction safety training by improving hazard recognition and knowledge retention. Survey results show higher effectiveness than traditional training. This approach enables realistic, site-specific safety instruction, supporting OSHA compliance and reducing accidents through interactive, immersive learning. Originality/value-This research enhances safety training by integrating high-fidelity 3D models from UAV photogrammetry with a game engine to develop an interactive learning platform. Unlike traditional methods with generic simulations, this approach reflects the specific conditions and hazards of active construction sites, offering tailored safety instructions.
2025
TRIPOLI® and TRIPOLI-4® are registered trademarks of CEA.International audienceAlthough the Monte Carlo (MC) codes are natural users of the fast growing capacities in High PerformanceComputing (HPC), adapting production level codes such... more
TRIPOLI® and TRIPOLI-4® are registered trademarks of CEA.International audienceAlthough the Monte Carlo (MC) codes are natural users of the fast growing capacities in High PerformanceComputing (HPC), adapting production level codes such as TRIPOLI-4 to the hexascale is very challenging. Wepresent here the dual strategy we follow new thoughts and developments for the next versions of TRIPOLI-4, aswell as insights on a prototype of next generation Monte Carlo (NMC) designed from the beginning with hexascalein mind.The Random Generators of the code will also be presented as well as the strategy of verificationof the parallelism
2025
Periodicity can change materials properties in a very unintuitive way. Many wave propagation phenomena, such as waveguides, light bending structures or frequency filters can be modeled through finite periodic structures designed using... more
Periodicity can change materials properties in a very unintuitive way. Many wave propagation phenomena, such as waveguides, light bending structures or frequency filters can be modeled through finite periodic structures designed using optimization techniques. Two different kind of problems can be found: those involving linear waves and those involving nonlinear waves. The former have been widely studied and analyzed within the last few years and many interesting results have been found: cloaking devices, superlensing, fiber optics The latter is a topic of high interest nowadays and a lot of work still needs to be done, since it is far more complicated and very little is known. Nonlinear wave phenomena include acoustic amplitude filters, sound bullets or elastic shock mitigation structures, among others. The wave equation can be solved accurately using the Hybridizable Discontinuous Galerkin Method both in time and in frequency domain. Furthermore, convex optimization techniques can be used to obtain the desired material properties. Thus, the path to follow is to implement a wave phenomena simulator in 1 and 2 dimensions and then formulate specific optimization problems that will lead to materials with some particular and special properties. Within the optimization problems that can be found, there are eigenvalue optimization problems as well as more general optimal control topology optimization problems. This thesis is focused on linear phenomena. An HDG simulation code has been developed and optimization problems for the design of some model devices have also been formulated. A series of numerical results are also included showing how effective and unintuitive such designs are.
2025, Rodolfo Pitti
The Quantum Tessellation Algorithm (QTA) represents a paradigm shift in computational physics and financial modeling, introducing a unified framework that bridges quantum field theory simulations with econophysics applications. This... more
The Quantum Tessellation Algorithm (QTA) represents a paradigm shift in computational physics and financial modeling, introducing a unified framework that bridges quantum field theory simulations with econophysics applications. This revolutionary approach leverages adaptive space-time tessellation, advanced spectral methods, and quantum-inspired operators to achieve unprecedented computational efficiency and accuracy. Through rigorous validation against six fundamental quantum field theory problems-including free scalar fields, φ⁴ theory, Yukawa interactions, quantum electrodynamics (QED), quantum chromodynamics (QCD), and scalar QED-QTA demonstrates exceptional performance with errors consistently below 10⁻⁴ and computational speedups ranging from 8 to 12 times faster than traditional Monte Carlo methods. The algorithm's financial applications prove equally impressive, with the hybrid QTA-financial model extending to sophisticated multi-asset portfolio optimization incorporating transaction costs, real-time data feeds, and advanced risk management protocols. Empirical validation on historical market data spanning 2020-2022 for S&P 500, NASDAQ, and gold futures reveals prediction error reductions of 18% and Sharpe ratio improvements of 13% compared to conventional models. A pioneering Qiskit-based quantum computing implementation demonstrates QTA's readiness for next-generation quantum hardware, while interactive Plotly visualizations enhance interpretability and practical application across diverse scientific and financial domains.
2025
conducted a tour of the facilities and answered questions for an elementary school class researching advanced life support on December 1 st . Yang Yang, Gioia Massa, and Cary Mitchell met with Jerry Shephard in the Central Machine shop to... more
conducted a tour of the facilities and answered questions for an elementary school class researching advanced life support on December 1 st . Yang Yang, Gioia Massa, and Cary Mitchell met with Jerry Shephard in the Central Machine shop to discuss plans for the development of the cropcanopy gas-exchange cuvette system, Minitron III. In addition, discussions were pursued with Al Heber and Connie Li to determine what, if any, gas contaminants may be produced by the LED lighting system. Plans are underway to capture and analyze the cooling air passing through the lightsicles. Following an analysis of light output and current levels for both LED lighting systems, a side-by-side cowpea experiment was planted December 20 th in the growth chamber comparing intracanopy and overhead LED lighting. In addition, this experiment will be part of a collaboration with systems analyst Jim Russell to examine transpirational burdens under different lighting conditions. The conditions set for this experiment are 30 plants per growth area (0.23 m 2 ) with light levels set to approximately 300 µmols/m 2 /s at 2.5 cm from the light engines. Plants are being grown for 32 days, with pH and conductivity adjusted every other day. The treatments will be harvested January 20th. Lettuce plants were planted Dec. 21 st in a side-by-side comparison of manual versus automated hydroponic pH adjustment. This will be the first test with plants of the automated pH control system developed by Moeed Muhktar in George Chiu's lab. Harvests of experiments occurred November 29 th and December 15 th for carrot and December 7th for sweetpotato. Harvested biomass was sent to Lisa Mauer in Food Science for use in antioxidant studies. A second batch of carrots and sweetpotatoes has been replanted for replication. Final dry weight of paired basil/wheat straw residues composted with P. ostreatus 'Grey Dove' was taken and data analyzed. Samples were prepared for analyses of residual lignin content, cellulose and hemicellulose following 80-90 days of fungal colonization. In continuation of our collaborative work with the Food Safety team at AAMU, growth and maintenance of radish and lettuce using a nutrient film technique is on-going. Water samples from the nutrient film including leaf samples are being examined for waterborne food pathogens.
2025
Progress of each ALS-NSCORT project given by each project lead. 9 pages.
2025, arXiv (Cornell University)
Morphological features of small vessels provide invaluable information regarding underlying tissue, especially in cancerous tumors. This paper introduces methods for obtaining quantitative morphological features from microvasculature... more
Morphological features of small vessels provide invaluable information regarding underlying tissue, especially in cancerous tumors. This paper introduces methods for obtaining quantitative morphological features from microvasculature images obtained by non-contrast ultrasound imaging. Those images suffer from the artifact that limit quantitative analysis of the vessel morphological features. In this paper we introduce processing steps to increase accuracy of the morphological assessment for quantitative vessel analysis in presence of these artifact. Specifically, artificats are reduced by additional filtering and vessel segments obtained by skeletonization of the regularized microvasculature images are further analyzed to satisfy additional constraints, such as diameter, and length of the vessel segments. Measurement of some morphological metrics, such as tortuosity, depends on preserving large vessel trunks that may be broken down into multiple branches. We propose two methods to address this problem. In the first method, small vessel segments are suppressed in the vessel filtering process via adjusting the size scale of the regularization. Hence, tortuosity of the large trunks can be more accurately estimated by preserving longer vessel segments. In the second approach, small connected vessel segments are removed by a combination of morphological erosion and dilation operations on the segmented vasculature images. These methods are tested on representative in vivo images of breast lesion microvasculature, and the outcomes are discussed. This paper provides a tool for quantification of microvasculature image from non-contrast ultrasound imaging may result in potential biomarkers for diagnosis of some diseases.
2025
LIDE (the DSPCAD Lightweight Dataflow Environment) is a flexible, lightweight design environment that allows designers to experiment with dataflow-based approaches for design and implementation of digital signal processing (DSP) systems.... more
LIDE (the DSPCAD Lightweight Dataflow Environment) is a flexible, lightweight design environment that allows designers to experiment with dataflow-based approaches for design and implementation of digital signal processing (DSP) systems. LIDE contains libraries of dataflow graph elements (primitive actors, hierarchical actors, and edges) and utilities that assist designers in modeling, simulating, and implementing DSP systems using formal dataflow techniques. The libraries of dataflow graph elements (mainly actors) contained in LIDE provide useful building blocks that can be used to construct signal processing applications, and that can be used as examples that designers can adapt to create their own, customized LIDE actors. Furthermore, by using LIDE along with the DSPCAD Integrative Command Line Environment (DICE), designers can efficiently create and execute unit tests for user-designed actors. This report provides an introduction to LIDE. The report includes details on the process for setting up the LIDE environment, and covers methods for using pre-designed libraries of graph elements, as well as creating user-designed libraries and associated utilities using the C language. The report also gives an introduction to the C language plug-in for dicelang. This plug-in, called dicelang-C, provides features for efficient C-based project development and maintenance that are useful to apply when working with LIDE.
2025, Lecture Notes in Computer Science
Existing approaches to train neural networks that use large images require to either crop or down-sample data during pre-processing, use small batch sizes, or split the model across devices mainly due to the prohibitively limited memory... more
Existing approaches to train neural networks that use large images require to either crop or down-sample data during pre-processing, use small batch sizes, or split the model across devices mainly due to the prohibitively limited memory capacity available on GPUs and emerging accelerators. These techniques often lead to longer time to convergence or time to train (TTT), and in some cases, lower model accuracy. CPUs, on the other hand, can leverage significant amounts of memory. While much work has been done on parallelizing neural network training on multiple CPUs, little attention has been given to tune neural network training with large images on CPUs. In this work, we train a multiscale convolutional neural network (M-CNN) to classify large biomedical images for high content screening in one hour. The ability to leverage large memory capacity on CPUs enables us to scale to larger batch sizes without having to crop or down-sample the input images. In conjunction with large batch sizes, we find a generalized methodology of linearly scaling of learning rate and train M-CNN to state-of-the-art (SOTA) accuracy of 99% within one hour. We achieve fast time to convergence using 128 two socket Intel Xeon 6148 processor nodes with 192GB DDR4 memory connected with 100Gbps Intel Omnipath architecture.
2025
There are considered problems of turbocode using in special telecommunication systems in the article. There are showed basic principles of turbocode building and they characteristics.Рассмотрены вопросы применения турбокодов в специальных... more
There are considered problems of turbocode using in special telecommunication systems in the article. There are showed basic principles of turbocode building and they characteristics.Рассмотрены вопросы применения турбокодов в специальных телекоммуникационных системах. Указаны основные принципы построения турбокодов и их характеристики
2025, Journal of Big Data
Significant investments to upgrade and construct large-scale scientific facilities demand commensurate investments in R&D to design algorithms and computing approaches to enable scientific and engineering breakthroughs in the big data... more
Significant investments to upgrade and construct large-scale scientific facilities demand commensurate investments in R&D to design algorithms and computing approaches to enable scientific and engineering breakthroughs in the big data era. Innovative Artificial Intelligence (AI) applications have powered transformational solutions for big data challenges in industry and technology that now drive a multi-billion dollar industry, and which play an ever increasing role shaping human social patterns. As AI continues to evolve into a computing paradigm endowed with statistical and mathematical rigor, it has become apparent that single-GPU solutions for training, validation, and testing are no longer sufficient for computational grand challenges brought about by scientific facilities that produce data at a rate and volume that outstrip the computing capabilities of available cyberinfrastructure platforms. This realization has been driving the confluence of AI and high performance computing (HPC) to reduce time-to-insight, and to enable a systematic study of domain-inspired AI architectures and optimization schemes to enable data-driven discovery. In this article we present a summary of recent developments in this field, and describe specific advances that authors in this article are spearheading to accelerate and streamline the use of HPC platforms to design and apply accelerated AI algorithms in academia and industry.
2025
Viene presentato lo sviluppo di un cluster ad architettura PC Linux per calcolo ad alte prestazioni nell'ambito del progetto AMS.
2025, Proceedings 16th International Parallel and Distributed Processing Symposium
During the last few years, the concepts of cluster computing and heterogeneous networked systems have received increasing interest. The popularity of using Java for developing parallel and distributed applications that run on... more
During the last few years, the concepts of cluster computing and heterogeneous networked systems have received increasing interest. The popularity of using Java for developing parallel and distributed applications that run on heterogeneous distributed systems has also grown rapidly. This paper is a survey of the current projects in parallel and distributed Java. These projects' main common objective is to utilize the available heterogeneous systems to provide high performance computing using Java. These projects were studied, compared and classified based on the approaches used. The study shows three major approaches. One is to develop a system that replaces the Java virtual machine (JVM) or utilizes the available parallel infrastructure such as MPI or PVM. Another is to provide seamless parallelization of multi-threaded applications. The third is to provide a pure Java implementation by adding classes and features that support parallel Java programming. In addition, a number of open issues are
2025, Empirical evaluation of parallel implementations of MergeSort
Sorting algorithms are a fundamental piece in the development of computer systems. MergeSort is a well-known sorting algorithm, much appreciated due to its efficiency, relative simplicity, and other features. This... more
Sorting algorithms are a fundamental piece in the development of computer systems. MergeSort is a well-known sorting algorithm, much appreciated due to its efficiency, relative simplicity, and other features. This article presents an empirical evaluation of parallel versions of MergeSort, applying shared and distributed memory, on a high-performance computing infrastructure. The main result indicates that parallelization of recursive invocations combined with a parallel merge operation offers better speedup than just parallelization of recursive invocations. Moreover, better speedup was achieved in a shared memory environment.
2025, IEEE Access
The main goals of fifth generation (5G) systems are to significantly increase the network capacity and to support new 5G service requirements. Ultra network densification with small cells is among the key pillars for 5G evolution. The... more
The main goals of fifth generation (5G) systems are to significantly increase the network capacity and to support new 5G service requirements. Ultra network densification with small cells is among the key pillars for 5G evolution. The inter-small-cell 5G backhaul network involves massive data traffic. Hence, it is important to have a centralized, efficient multi-hop routing protocol for backhaul networks to manage and speed up the routing decisions among small cells, while considering the 5G service requirements. This paper proposes a parallel multi-hop routing protocol to speed up routing decisions in 5G backhaul networks. To this end, we study the efficiency of utilizing the parallel platforms of cloud computing and high-performance computing (HPC) to manage and speed up the parallel routing protocol for different communication network sizes and set recommendations for utilizing cloud resources to adopt the parallel protocol. Our numerical results indicate that the HPC parallel implementation outperforms the cloud computing implementation, in terms of routing decision speed-up and scalability to large network sizes. In particular, for a large network size with 2048 nodes, our HPC implementation achieves a routing speed-up of 37x. However, the best routing speed-up achieved using our cloud computing implementation is 15.5x, and is recorded using one virtual machine (VM) for a network size of 1024 nodes. In summary, there is a trade-off between a better performance for HPC vs. flexible resources of cloud computing. Thus, choosing best fit platform for 5G routing protocols depends on the deployment scenarios at 5G core or edge network. INDEX TERMS 5G routing protocol, Cloud Radio Access Networks, Cloud computing, HPC, Ultra-Dense Network.