Parallel Algorithms Research Papers - Academia.edu (original) (raw)

2025, Revista …

Revista Internacional de M etodos Num ericos para C alculo y Dise no en Ingenier a ... רÖÙ ØÙÖ Ð Ð ØÓ× Ö Ò Ð Ñ Ö Ò ... Maria Dolors Mart nez Departamento de F sica Aplicada (ETSAB) UPC, Avenida Diagonal, 649 08028 Barcelona, Espa na... more

Revista Internacional de M etodos Num ericos para C alculo y Dise no en Ingenier a ... רÖÙ ØÙÖ Ð Ð ØÓ× Ö Ò Ð Ñ Ö Ò ... Maria Dolors Mart nez Departamento de F sica Aplicada (ETSAB) UPC, Avenida Diagonal, 649 08028 Barcelona, Espa na Tel.: 34-93-401 63 78, ...

2025

ERS . Los datos brutos son una matriz de 26800 lineas por 5616 columnas o pixels. Los pixels son nurneros cornplejos, codificados en 1 + I bytes, y el tamano total es de aproximadamente 300 MB. Los calculos son llevados a cabo en formato... more

ERS . Los datos brutos son una matriz de 26800 lineas por 5616 columnas o pixels. Los pixels son nurneros cornplejos, codificados en 1 + I bytes, y el tamano total es de aproximadamente 300 MB. Los calculos son llevados a cabo en formato de punto flotante, resultando una matriz de proceso de 1.2 GB. La imagen SAR obtenida tiene 25000 lineas par 4912 columnas. Los pixels son complejos y codificados en 2 + 2 bytes, resultando cerca de 500 MB. A parallel SAR processor is presented in this paper. The target configuration is a cluster of lJNIX workstations, available in most user sites. This fact allows to obtain an increased computing performance without the need of dedicated hardware investment.

2025

We study approximability of the edge dominating set problem. It has been known, besides its NPNPNP -hardness, that a solution of size at most twice larger than the smallest one can be efficiently computed, due to its close relationship to... more

We study approximability of the edge dominating set problem. It has been known, besides its NPNPNP -hardness, that a solution of size at most twice larger than the smallest one can be efficiently computed, due to its close relationship to minimum maximal matching. In general when graphs are edge weighted, however, such a nice relationship breaks down. and no edge dominating set of small weight is obtainable from any maximal mat,ching. In this paper, after showing that mathrmwmathrmemathrmimathrmgmathrmhmathrmtprimemathrmemathrmd\mathrm{w}\mathrm{e}\mathrm{i}\mathrm{g}\mathrm{h}\mathrm{t}\prime \mathrm{e}\mathrm{d}mathrmwmathrmemathrmimathrmgmathrmhmathrmtprimemathrmemathrmd edge domination is as hard to approximate as weighted vertex cover is, we consider two natural strategies, one reducing edge dominating set to vertex cover and the other to edge cover,

2025, 2010 IEEE International Conference on Cluster Computing

Computational aeroacoustics (CAA) has emerged as a tool to complement theoretical and experimental approaches for robust and accurate prediction of sound levels from aircraft airframes and engines. CAA, unlike computational fluid dynamics... more

Computational aeroacoustics (CAA) has emerged as a tool to complement theoretical and experimental approaches for robust and accurate prediction of sound levels from aircraft airframes and engines. CAA, unlike computational fluid dynamics (CFD), involves the accurate prediction of smallamplitude acoustic fluctuations and their correct propagation to the far field. In that respect, CAA poses significant challenges for researchers because the computational scheme should have high accuracy, good spectral resolution, and low dispersion and diffusion errors. A high-order compact finite difference scheme, which is implicit in space, can be used for such simulations because it fulfills the requirements for CAA. Usually, this method is parallelized using a transposition scheme; however, that approach has a high communication overhead. In this paper, we discuss the use of a parallel tridiagonal linear system solver based on the truncated SPIKE algorithm for reducing the communication overhead in our large eddy simulations. We report experimental results collected on two parallel computing platforms.

2025, Periodicals of Engineering and Natural Sciences (PEN)

The tendency of miniaturizing semiconductor products towards nano-size transistor in integrated chips has motivated this work on the semiconductor package. Consequently, Four Fuzzy PID controller architectures based on type 2 FLC are... more

The tendency of miniaturizing semiconductor products towards nano-size transistor in integrated chips has motivated this work on the semiconductor package. Consequently, Four Fuzzy PID controller architectures based on type 2 FLC are developed; the Interval Type-2 Fuzzy Logic PID, IT2FLC PID MOALO-based, IT2FLC PI-PD, and IT2FLC PI-PD MOALO controllers. These architectures are improved to overcome the inherent nonlinearity in X-Y table models and capacitate the uncertainties of the parameters and the disturbances. Both controllers are designed to improve the desired position specification at minimum settling time (Ts), rise time (Tr), overshoot through minimization of oscillation and friction rejection during tracking the desired position trajectory. The ant lion optimization (ALO) algorithm has been efficiently solved optimization problems with minimum parameters and execution time. Hence, Multi-Objective Ant Lion Optimizer (MOALO) has been implemented to size the gains of the proposed controllers to get the desired position trajectory according to the required specification. A comparison with a related existing work shows minimal numerical values of improved transient specification response of Tr, Mp% and Ts for the MOALO-Based developed IT2 FLC PID and IT2 FLC PI-PD architectures. Observation of a higher Maximum Percentage of Enhancement settling time is noticed in both axes within the IT2FLC PI-PD architecture. Accordingly, transient performances of the four architectures have been significantly improved. The improvement is noticeable within the response of IT2FLC PI-PD architecture. The Maximum Percentage of Enhancement in the X-axis and Y-axis has been improved more than eight-fold and six-fold respectively using IT2FLC PI-PD architecture.

2025

Maria A. Murazzo, Maria Fabiana Piccoli, Nelson R. Rodriguez, Diego Medel, Jorge N. Mercado, Federico Sanchez, Ana Laura Molina, Martin Tello Departamento de Informática – FCEFy N, UNSJ. Departamento de Informática – FCFMy N, UNSL.... more

Maria A. Murazzo, Maria Fabiana Piccoli, Nelson R. Rodriguez, Diego Medel, Jorge N. Mercado, Federico Sanchez, Ana Laura Molina, Martin Tello Departamento de Informática – FCEFy N, UNSJ. Departamento de Informática – FCFMy N, UNSL. Departamento de Matemática – FI, UNSJ. Alumno Avanzado de la Carrera Licenciatura en Ciencias de la Computación. Alumno Avanzado de la Carrera Licenciatura en Sistemas de Información.

2025

Resumen. En este trabajo se propone una implementación 2D-compatible del algoritmo paralelo de Cannon para multiplicación de matrices en su versión 2,5D. Dicha implementación fue realizada utilizando una distribución 2D de matrices en una... more

Resumen. En este trabajo se propone una implementación 2D-compatible del algoritmo paralelo de Cannon para multiplicación de matrices en su versión 2,5D. Dicha implementación fue realizada utilizando una distribución 2D de matrices en una grilla 2,5D de procesos. El objetivo consiste en evaluar el rendimiento de esta implementación en comparación con un algoritmo paralelo 1D previamente diseñado. Para tal fin, los desarrollos fueron ejecutados sobre un cluster homogéneo, conformado por 8 nodos, para diversos tamaños de problema. Los resultados obtenidos confirman que esta nueva alternativa 2D-compatible supera, en términos de rendimiento, a la solución 1D. La nueva implementación arroja una reducción del 6%, como mínimo, del tiempo de ejecución para todos los escenarios estudiados.

2025

Resumen. Cloud Computing en un modelo de computación que permite que los recursos computacionales, tales como infraestructura, aplicaciones, software o procesamiento puedan ser ofrecidos y consumido bajo demanda como un servicio más en... more

Resumen. Cloud Computing en un modelo de computación que permite que los recursos computacionales, tales como infraestructura, aplicaciones, software o procesamiento puedan ser ofrecidos y consumido bajo demanda como un servicio más en Internet. Esta capacidad se logra mediante la abstracción de los recursos físicos, generando un pool de recursos virtualizados, que pueden ser aprovisionados dinámicamente. Es por ello se están migrando datos y aplicaciones al cloud. Sin embargo, uno de los aspectos que se debe tener en cuenta a la hora de realizar la migración es el costo en términos de degradación de performance. En este sentido se debe analizar cómo impacta en el cloud el proceso de la virtualización de los recursos, por ello el objetivo de este trabajo es analizar cuánto se degrada la performance de algoritmos cuando se corren en el cloud.

2025

A computação do fecho transitivo de um grafo é um problema que foi considerado pela primeira vez em 1959. Muitos algoritmos sequenciais para solução deste problema foram propostos e algoritmos paralelos foram considerados a partir de... more

A computação do fecho transitivo de um grafo é um problema que foi considerado pela primeira vez em 1959. Muitos algoritmos sequenciais para solução deste problema foram propostos e algoritmos paralelos foram considerados a partir de 1990. Apresentamos um algoritmo paralelo para computação do fecho transitivo em grafos gerais (orientados ou não). O algoritmo proposto utiliza o modelo BSP/CGM, tendo como base a utilização de busca em largura em grafos (BFS). Apresentamos três implementações paralelas baseadas no algoritmo BFS: uma em MPI, uma em OpenMP e uma híbrida (MPI/OpenMP). Para avaliar a eficiência do algoritmo proposto e das implementações desenvolvidas, foram realizados experimentos com grafos de tamanhos e características variadas. A complexidade computacional do algoritmo é O( n 2 p (n + m)).

2025

Existing concurrent priority queues do not allow to update the priority of an element after its insertion. As a result, algorithms that need this functionality, such as Dijkstra's single source shortest path algorithm, resort to... more

Existing concurrent priority queues do not allow to update the priority of an element after its insertion. As a result, algorithms that need this functionality, such as Dijkstra's single source shortest path algorithm, resort to cumbersome and inefficient workarounds. We report on a heap-based concurrent priority queue which allows to change the priority of an element after its insertion. We show that the enriched interface allows to express Dijkstra's algorithm in a more natural way, and that its implementation, using our concurrent priority queue, outperform existing algorithms. A priority queue data structure maintains a collection (multiset) of items which are ordered according to a priority associated with each item. Priority queues are amongst the most useful data structures in practice, and can be found in a variety of applications ranging from graph algorithms [21, 5] to discrete event simulation [8] and modern SAT solvers [4]. The importance of priority queues has m...

2025, Asian Journal of Computer Science and Technology

In this paper, we have reviewed and discussed about the challenges, advantages, disadvantages of the parallelprogram, which occurs during the conversion of the sequential program into parallel program.The conversion ofthe existing... more

In this paper, we have reviewed and discussed about the challenges, advantages, disadvantages of the parallelprogram, which occurs during the conversion of the sequential program into parallel program.The conversion ofthe existing sequential programs and algorithm into parallel have limitations like partitioning the task, sharing input and output data, dependencies of the output of one subtask to next subtask, synchronization of the output of the subprogram, and shifting the subprogram from the failure processor to the active processor, communication among the processors, load balancing, etc. The parallel programming routines, libraries and management software for parallel programming have been implemented, and these have several limitations, and these are also discussed.

2025

Spatio-temporal databases have reached great interest. However, there is a scarce work of these databases in distributed environments. This paper describes two parallel strategies for the MVR-tree spatio-temporal access method on CREW... more

Spatio-temporal databases have reached great interest. However, there is a scarce work of these databases in distributed environments. This paper describes two parallel strategies for the MVR-tree spatio-temporal access method on CREW PRAM parallel model. Experiments to compare our strategies with sequential approximation were carried out. Preliminary results show significant savings both in accessed nodes as well as in the execution time in a CREW PRAM parallel environment.

2025, Proceedings of International Conference on Image Processing

Thesis: "On the modeling of the complete heat transfer problem (conductive, convective and radiative) that takes place in a copper refining furnace". (Awarded a 100/100 grade score).

2025

This note offers a critique of support for parallelism in Fortran 2008 based on co-arrays. We believe that there are some significant shortcomings in current design of co-array features that affect their suitability for mapping onto a... more

This note offers a critique of support for parallelism in Fortran 2008 based on co-arrays. We believe that there are some significant shortcomings in current design of co-array features that affect their suitability for mapping onto a range of parallel systems, expressing a wide range of parallel applications, supporting the development of parallel libraries, and providing an extensible framework for developing sophisticated parallel applications. Based on these shortcomings, we believe that it is premature to recommend to the WG5 committee that the collection of co-array features described in Working Draft J3/07-007r3 be incorporated into the language standard without significant refinements.

2025, Journal of Offshore Mechanics and Arctic Engineering

Free-surface flows occur in several problems in hydrodynamics, such as fuel or water sloshing in tanks, waves breaking in ships, offshore platforms, harbors, and coastal areas. The computation of such highly nonlinear flows is... more

Free-surface flows occur in several problems in hydrodynamics, such as fuel or water sloshing in tanks, waves breaking in ships, offshore platforms, harbors, and coastal areas. The computation of such highly nonlinear flows is challenging, since free-surfaces commonly present merging, fragmentation, and breaking parts, leading to the use of interface-capturing Eulerian approaches. In such methods the surface between two fluids is captured by the use of a marking function, which is transported in a flow field. In this work we discuss computational techniques for efficient implementation of 3D incompressible streamline-upwind/Petrov–Galerkin (SUPG)/pressure-stabilizing/Petrov–Galerkin finite element methods to cope with free-surface problems with the volume-of-fluid method (Elias, and Coutinho, 2007, “Stabilized Edge-Based Finite Element Simulation of Free-Surface Flows,” Int. J. Numer. Methods Fluids, 54, pp. 965–993). The pure advection equation for the scalar marking function was s...

2025, International Conference on Parallel Processing

Demands in computational power, particularly in the area of computational fluid dynamics (CFD), have lead NASA Ames Research Center to study advanced computer architectures. based on research done by Jack B. Dennis at Massachusetts... more

Demands in computational power, particularly in the area of computational fluid dynamics (CFD), have lead NASA Ames Research Center to study advanced computer architectures. based on research done by Jack B. Dennis at Massachusetts Institute of Technology. To improve understanding of this architecture, a static data flow simulator, written in Pascal, has been implemented for use on a Cray X-MP/48. two-dimensional fast Fourier transform (FFT), two algorithms used in CFD work at Ames Research Center, have been run on the simulator. factor of more than 2 depending on the partitioning method used to assign instructions to processing elements. Service time for matching tokens has proved to be a major bottleneck. the execution time. The best sustained MFLOPS rates were less than 50% of the maximum capability of the machine.

2025

Sparse LU factorization offers some potential for parallelism, but at a level of very fine granularity. However, most current distributed memory MIMD architectures have too high communication latencies for exploiting all parallelism... more

Sparse LU factorization offers some potential for parallelism, but at a level of very fine granularity. However, most current distributed memory MIMD architectures have too high communication latencies for exploiting all parallelism available. To cope with this, latencies must be avoided by coarsening the granularity and by message fusion. However, both techniques limit the concurrency, thereby reducing the scalability. In this paper, an implementation of a parallel LU decomposition algorithm for linear programming bases is presented for distributed memory parallel computers with noticable communication latencies. Several design decisions due to latencies, including data distribution and load balancing techniques, are discussed. An approximate performance model is set up for the algorithm, which allows to quantify the impact of latencies on its performance. Finally, experimental results for an Intel iPSC/860 parallel computer are reported and discussed.

2025, arXiv (Cornell University)

2025, IEEE Transactions on Parallel and Distributed Systems

The Optical Transpose Interconnection System (OTIS) is a recently proposed model of computing that exploits the special features of both electronic and optical technologies. In this paper we present efficient algorithms for packet... more

The Optical Transpose Interconnection System (OTIS) is a recently proposed model of computing that exploits the special features of both electronic and optical technologies. In this paper we present efficient algorithms for packet routing, sorting, and selection on the OTIS-Mesh.

2025

Two plausible ways to implement Floyd's all pairs shortest paths algorithm on a hypercube mutiprocessor are considered. These are evaluated experimentally. A comparison with using Dijkstra's single source all desitination algorithm on... more

Two plausible ways to implement Floyd's all pairs shortest paths algorithm on a hypercube mutiprocessor are considered. These are evaluated experimentally. A comparison with using Dijkstra's single source all desitination algorithm on each processor is also done. parallel algorithms, all pairs shortest paths problem, hypercube multiprocessors __________________

2025, IEEE Transactions on Parallel and Distributed Systems

We develop two algorithms to perform the q step shrinking and expanding of an N×N binary image on a pyramid computer with an N×N base. The time complexity of both algorithms is O(√ q ). However, one uses O(√ q ) space per processor while... more

We develop two algorithms to perform the q step shrinking and expanding of an N×N binary image on a pyramid computer with an N×N base. The time complexity of both algorithms is O(√ q ). However, one uses O(√ q ) space per processor while the per processor space requirement of the other is O (1).

2025, IEEE Transactions on Parallel and Distributed Systems

We develop two algorithms to perform the q step shrinking and expanding of an N×N binary image on a pyramid computer with an N×N base. The time complexity of both algorithms is O(√ q ). However, one uses O(√ q ) space per processor while... more

We develop two algorithms to perform the q step shrinking and expanding of an N×N binary image on a pyramid computer with an N×N base. The time complexity of both algorithms is O(√ q ). However, one uses O(√ q ) space per processor while the per processor space requirement of the other is O (1).

2025, Proceedings of the International Conference on Parallel Processing Workshops

This paper addresses the Microarray Gene Ordering problem. It consists in ordering a set of genes, grouping together the ones with similar behavior. This behavior can be measured as the gene's activity level across a number of... more

This paper addresses the Microarray Gene Ordering problem. It consists in ordering a set of genes, grouping together the ones with similar behavior. This behavior can be measured as the gene's activity level across a number of measurements. The Gene Ordering problem belongs to the NP-hard class and has strong implications in genetic and medical areas. The method employed is a Memetic Algorithm, which is a variant of the well known Genetic Algorithms. The algorithm employs several features like population structure, problem-specific crossover and mutation operators, local search, and parallel processing. The instances utilized are extracted from the literature and represent real systems with 106 up to 979 genes. The algorithm has a superior performance, successfully grouping the genes. Moreover, in this paper we evaluate the impact of parallel processing in the performance of the algorithm, especially for the larger instances, which required more computational effort.

2025

We present different classes of solutions to the Firing Squad Synchronization Problem on networks of different shapes. The nodes are finite state processors that work at unison discrete steps. The networks considered are the line, the... more

We present different classes of solutions to the Firing Squad Synchronization Problem on networks of different shapes. The nodes are finite state processors that work at unison discrete steps. The networks considered are the line, the ring and the square. For all of these models we have considered one and two-way communication modes and also constrained the quantity of information that adjacent processors can exchange each step. We are given a particular time expressed as a function of the number of nodes of the network, f (n) and present synchronization algorithms in time n 2 , n log n, n √ n, 2 n . The solutions are presented as signals that are used as building blocks to compose new solutions for all times expressed by polynomials with nonnegative coefficients.

2025, Future Generation Computer Systems

2025, Information Processing Letters

The traditional zero-one principle for sorting networks states that "if a network with n input lines sorts all 2 n binary sequences into nondecreasing order, then it will sort any arbitrary sequence of n numbers into nondecreasing order".... more

The traditional zero-one principle for sorting networks states that "if a network with n input lines sorts all 2 n binary sequences into nondecreasing order, then it will sort any arbitrary sequence of n numbers into nondecreasing order". We generalize this to the situation when a network sorts almost all binary sequences and relate it to the behavior of the sorting network on arbitrary inputs. We also present an application to mesh sorting.

2025, IEEE Transactions on Computers

Shear-sort opened new avenues in the research of sorting techniques for mesh-connected processor arrays. The algorithm is extremely simple and converges to a snake-like sorted sequence with a time complexity which is suboptimal by a... more

Shear-sort opened new avenues in the research of sorting techniques for mesh-connected processor arrays. The algorithm is extremely simple and converges to a snake-like sorted sequence with a time complexity which is suboptimal by a logarithmic factor. The techniques used for analyzing shear-sort have been used to derive more efficient algorithms, which have important ramifications both from practical and theoretical viewpoints. Although the algorithms described apply to any general two-dimensional computational model, the focus of most discussions is on mesh-connected computers which are now commercially available. In spite of a rich history of O ( n ) sorting algorithms on an n x n SIMD mesh, the constants associated with the leading term (i.e., n ) are fairly large. This had led researchers to speculate about the tightness of the lower bound. The work in this paper sheds some more light on this problem as a 4n-step algorithm is shown to exist for a model slightly more powerful than the conventional SIMD model. Moreover, this algorithm has a running time of 3n steps on the more powerful MIMD model, which is "truly" optimal for such a model. Index Terms-Distance bound, lower bound, mesh-connected network, parallel algorithm, sorting, time complexity, upper bound. WO-DIMENSIONAL sorting is defined as the ordering of T a rectangular array of numbers such that every element is routed to a distinct position of the array predetermined by some indexing scheme. Some of the standard indexing schemes are illustrated in Fig. . The simplest computational model onto which this problem can be mapped is the meshconnected processor array (mesh for short). The simplicity of the interconnection pattern, and the locality of communication, makes the mesh easy to build and program and was the basis of one of the earliest parallel computers (ILLIAC IV). Since then, there have been more machines built on a much larger scale including the MPP and the DAPP using similar interconnection patterns. This simple architecture further motivates the idea of dealing with a given set of numbers as a rectangular array rather than as a linear sequence. More recently, Scherson [15] and Tseng et al. [22] have independently proposed a network which they call the orthogonal access architecture and the reduced-mesh network, respectively. It consists of p processors which are connected by a shared memory of p -q x p -q locations, where each Manuscript

2025, IEEE Transactions on Computers

2025, 2006 International Conference on Parallel Processing (ICPP'06)

2025

The paper describes the commercial aspects of developments in Nanotechnology. We have elucidated the implications of reliability modeling, multi scale modeling and HPC modeling (High performance computing) for product commercialization of... more

The paper describes the commercial aspects of developments in Nanotechnology. We have elucidated the implications of reliability modeling, multi scale modeling and HPC modeling (High performance computing) for product commercialization of Nano devices. Various modeling options are discussed that can aid the modeling of commercially viable nano products. A model is proposed taking in account the most major aspects of nanotech commercialization and is linked with other technical aspects as well social factors. This work gives deeper insight on understanding the issues involved and business prospectus of nanotechnology. The motivation is taken from our earlier work on reliability engineering and the result shows that the theory of exponential models for Survival Rate function after proper ramification can be applied on the TTC (Time to commercialize model) modeling.Svarstant nanotechnologijų komercinius aspektus, nušviečiamos nanoįrenginių patikimumo ir daugiaskalio bei kompiuterinio m...

2025, ACM Transactions on Mathematical Software

Special challenges exist in writing reliable numerical library software for heterogeneous computing environments. Although a lot of software for distributed-memory parallel computers has been written, porting this software to a network of... more

Special challenges exist in writing reliable numerical library software for heterogeneous computing environments. Although a lot of software for distributed-memory parallel computers has been written, porting this software to a network of workstations requires careful consideration. The symptoms of heterogeneous computing failures can range from erroneous results without warning to deadlock. Some of the problems are straightforward to solve, but for others the solutions are not so obvious, or incur an unacceptable overhead. Making software robust on heterogeneous systems often requires additional communication. We describe and illustrate the problems encountered during the development of ScaLAPACK and the NAG Numerical PVM Library. Where possible, we suggest ways to avoid potential pitfalls, or if that is not possible, we recommend that the software not be used on heterogeneous networks.

2025, journal of the college of basic education

Locating transmitters optimally in a radio network, to guarantee a stipulated quality of service (QOS), is a network performance (NP) hard combinational problem. Regarding a known area, choosing transmitter locations among alternatives is... more

Locating transmitters optimally in a radio network, to guarantee a stipulated quality of service (QOS), is a network performance (NP) hard combinational problem. Regarding a known area, choosing transmitter locations among alternatives is tackled by a coarse-grained parallel genetic algorithm, which maximize the coverage together with reducing the number of utilized transmitters. In this paper, an effective local search operator is raised, and the affection of the neighbor topology is compared. Simulations on a dedicated cluster demonstrate that contrasting to existent algorithms, the parallel GA improves the optimizing quality and speed greatly.

2025, International Conference on Telecommunications

Three parallel sorting algorithms have been implemented and compared in terms of their overall execution time. The algorithms implemented are the odd-even transposition sort, parallel merge sort and parallel rank sort. A homogeneous... more

Three parallel sorting algorithms have been implemented and compared in terms of their overall execution time. The algorithms implemented are the odd-even transposition sort, parallel merge sort and parallel rank sort. A homogeneous cluster of workstations has been used to compare the algorithms implemented. The MPI library has been selected to establish the communication and synchronization between the processors. The time complexity for each parallel sorting algorithm will also be mentioned and analyzed.

2025, IEE Proceedings - Computers and Digital Techniques

The paper describes a new interconnection network for massively parallel systems, referred to as star-connected cycles (SCC). The SCC graph presents an I/O-bounded structure that results in several advantages over variabledegree graphs... more

The paper describes a new interconnection network for massively parallel systems, referred to as star-connected cycles (SCC). The SCC graph presents an I/O-bounded structure that results in several advantages over variabledegree graphs like the star and the hypercube. The description of the SCC graph includes issues such as labelling of nodes, degree, diameter and symmetry. The paper also presents an optimal routeing algorithm for the SCC and efficient broadcasting algorithms with O(n) running time, with n being the dimensionality of the graph. A comparison with the cube-connected cycles (CCC) and other interconnection networks is included, indicating that, for even n, an n-SCC and a CCC of similar sizes have about the same diameter. In addition, it is shown that one-port broadcasting in an n-SCC graph can be accomplished with a running time better than or equal to that required by an n-star containing (n -1) times fewer nodes.

2025, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing

The star-connected cycles (SCC) graph was recently proposed as an attractive interconnection network for parallel processing, using a star graph to connect cycles of nodes. This paper presents an analytical solution for the problem of the... more

The star-connected cycles (SCC) graph was recently proposed as an attractive interconnection network for parallel processing, using a star graph to connect cycles of nodes. This paper presents an analytical solution for the problem of the average distance of the SCC graph. We divide the cost of a route in the SCC graph into three components, and show that one of such components is affected by the routing algorithm being used. Three routing algorithms for the SCC graph are presented, which respectively employ random, greedy and optimal routing rules. The computational complexities of the algorithms, and the average costs of the paths they produce, are compared. Finally, we discuss how source-based and distributed versions of the algorithms presented in this paper can be used in association with wormhole routing.

2025

This paper proposes the use of Stochastic Automata Networks (SAN) to develop models that can be efficiently applied to a large class of parallel implementations: master/slave (m/s) programs. We focus our technique in the description of... more

This paper proposes the use of Stochastic Automata Networks (SAN) to develop models that can be efficiently applied to a large class of parallel implementations: master/slave (m/s) programs. We focus our technique in the description of the communication between master and slave nodes considering two standard behaviors: synchronous and asynchronous interactions. Although the SAN models may help the pre-analysis of implementations, the main contribution of this paper is to point out advantages and problems of the proposed modeling technique.

2025, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing

This paper presents a theoretical study to evaluate the performance of a family of parallel implementations of the propagation algorithm. The propagation algorithm is used to an image interpolation application. The theoretical performance... more

This paper presents a theoretical study to evaluate the performance of a family of parallel implementations of the propagation algorithm. The propagation algorithm is used to an image interpolation application. The theoretical performance analysis is based on the construction of generic models using Stochastic Automata Networks (SAN) formalism to describe each implementation scheme. The prediction results can be compared to the achieved performance in some real test cases to verify the accuracy of our modeling technique. The main contribution of this paper is to point out the advantages and problems of our approach to the development of generic models of parallel implementations.

2025

Una de las áreas de mayor ~n.terés y desarrQllo . . ,4,entro de la Informática en' los últimos anos es la referida al procesamiento paraieío y distribuido. Gran parte de los problemas reales contienen un paralelismo implfcito, que puede... more

Una de las áreas de mayor ~n.terés y desarrQllo . . ,4,entro de la Informática en' los últimos anos es la referida al procesamiento paraieío y distribuido. Gran parte de los problemas reales contienen un paralelismo implfcito, que puede ser explotad~ a través de la distribución de tareas que trabajen cooperativa mente en distintos procesadores [Andr91) [Coff92). Los temUl~~)nve~~gacl6I)"y, d.esar.rollo 8On,va~iad.os,. e ¡~~?Iuyen j~~uite.cturas orientadas al multiprocesamiento [Hwan93), problemas de especificación y verificación de algoritmos [Fort85) [Hoar85]; ' optimización ,algorltmica, etc [Heer91) Las áreas de dominio del procesamiento paralelo son numerosas. Entre ellas se encuentran aplicaciones cientlficas y matemáticas, simulación de sistemas, procesamiento de sena les digitales, tratamiento de imágenes, visión por computadora, bases de datos distribuidas, etc [Laws92) [Huss91).

2025

The reason for recent focus on communication avoidance is that high rates of data movement become infeasible due to excessive power dissipation. However, shifting the responsibility of minimizing data movement to the parallel algorithm... more

The reason for recent focus on communication avoidance is that high rates of data movement become infeasible due to excessive power dissipation. However, shifting the responsibility of minimizing data movement to the parallel algorithm designer comes at significant costs to programmer's productivity, as well as: (i) reduced speedups and (ii) the risk of repelling application developers from adopting parallelism. The UMD Explicit Multi-Threading (XMT) framework has demonstrated advantages on ease of parallel programming through its support of PRAM-like programming, combined with strong, often unprecedented speedups. Such programming and speedups involve considerable data movement between processors and shared memory. Another reason that XMT is a good test case for a study of data movement is that XMT permits isolation and direct study of most of its data movement (and its power dissipation). Our new results demonstrate that an XMT single-chip many-core processor with tens of thousands of cores and a high throughput network on chip is thermally feasible, though at some cost. This leads to a perhaps game-changing outcome: instead of imposing upfront strict restrictions on data movement, as advocated in a recent report from the National Academies, opt for due diligence that accounts for the full impact on cost. For example, does the increased cost due to communication avoidance (including programmer's productivity, reduced speedups and desertion risk) indeed offset the cost of the solution we present? More specifically, we investigate in this paper the design of an XMT many-core for 3D VLSI with microfluidic cooling. We used state-ofthe-art simulation tools to model the power and thermal properties of such an architecture with 8k to 64k lightweight cores, requiring between 2 and 8 silicon layers. Inter-chip communication using silicon compatible photonics is also considered. We found that, with the use of microfluidic cooling, power dissipation becomes a cost issue rather than a feasibility constraint. Robustness of the results is also discussed.

2025, Journal of Algorithms

We give an efficient deterministic parallel approximation algorithm for the minimumweight vertex-and set-cover problems and their duals (edge/element packing). The algorithm is simple and suitable for distributed implementation. It fits... more

We give an efficient deterministic parallel approximation algorithm for the minimumweight vertex-and set-cover problems and their duals (edge/element packing). The algorithm is simple and suitable for distributed implementation. It fits no existing paradigm for fast, efficient parallel algorithms -it uses only "local" information at each step, yet is deterministic. (Generally, such algorithms have required randomization.) The result demonstrates that linear-programming primal-dual approximation techniques can lead to fast, efficient parallel algorithms. The presentation does not assume knowledge of such techniques.

2025

Settling on a simple abstraction that programmers aim at, and hardware and software systems people enable and support, is an important step towards convergence to a robust many-core platform. The current paper: (i) advocates incorporating... more

Settling on a simple abstraction that programmers aim at, and hardware and software systems people enable and support, is an important step towards convergence to a robust many-core platform. The current paper: (i) advocates incorporating a quest for the simplest possible abstraction in the debate on the future of many-core computers, (ii) suggests "immediate concurrent execution (ICE)" as a new abstraction, and (iii) argues that an XMT architecture is one possible demonstration of ICE providing an easyto-program general-purpose many-core platform.

2025

The sudden shift from single-processor computer systems to many-processor parallel computing systems requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires... more

The sudden shift from single-processor computer systems to many-processor parallel computing systems requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires convergence to a robust parallel general-purpose platform that provides good performance and is easy to program. Unfortunately, this same objective has eluded decades of parallel computing research. Now, continued delays and uncertainty could start affecting important sectors of the economy. This paper advocates a minimalist stepping-stone: settle first on a simple abstraction that encapsulates the new interface between programmers, on one hand, and system builders, on the other hand. This paper also makes several concrete suggestions: (i) the Immediate Concurrent Execution (ICE) abstraction as a candidate for the new abstraction, and (ii) the Explicit Multi-Threaded (XMT) general-purpose parallel platform, under development at the University of Maryland, as a possible embodiment of ICE. ICE and XMT build on a formidable body of knowledge, known as PRAM (for parallel randomaccess machine, or model) algorithmics, and a latent, though not widespread, familiarity with it. Ease-of-programming, strong speedups and other attractive properties of the approach suggest that we may be much better prepared for the challenges ahead than many realize.

2025

Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We... more

Our earlier parallel algorithmics work on the parallel random-access-machine/model (PRAM) computation model led us to a PRAM-On-Chip vision: a comprehensive many-core system that can look to the programmer like the abstract PRAM model. We introduced the eXplicit Multi-Threaded (XMT) design and prototyped it in hardware and software. XMT comprises a programmer's workflow that advances from work-depth, a standard PRAM theory abstraction, to an XMT program, and, if desired, to its performance tuning. XMT provides strong performance for programs developed this way due to its hardware support of very fine-grained threads and the overhead of handling them. XMT has also shown unique promise when it comes to ease-ofprogramming, the biggest problem that has limited the impact of all parallel systems to date. For example, teachability of XMT programming has been demonstrated at various levels from rising 6 th graders to graduate students, and students in a freshman class were able to program 3 parallel sorting algorithms. The main purpose of the current paper is to stimulate discussion on the following somewhat open-ended question. Now that we made significant progress on a system devoted to supporting PRAM-like programming, is it possible to incorporate our hardware support as an add-on into other current and future many-core systems? The paper considers a concrete proposal for doing that: recasting our work as a hardware-enhanced programmer's workflow "module" that can then be essentially imported into the other systems.

2025, International Journal of Mathematics and Computer Research

The paper uses the structure and math of Prime Generator Theory to show there are an infinity of twin primes, proving the Twin Prime Conjecture, as well as establishing the infinity of other k-tuples of primes.

2025, Proceedings International Symposium on Parallel Architectures, Algorithms and Networks. I-SPAN'02

We implemented four basic cellular automata (CA) algorithms on a Beowulf cluster with 8 processors. The CA algorithms are, namely, (1) Game of Life, (2) Greenburg-Hasting, (3) Cyclic Space, and (4) Hodgepodge Machine.

2025, Turkish Journal of Electrical Engineering and Computer Sciences

The advent of high-performance computing via many-core processors and distributed processing emphasizes the possibility for exhaustive search by multiple search agents. Despite the occurrence of elegant algorithms for solving complex... more

The advent of high-performance computing via many-core processors and distributed processing emphasizes the possibility for exhaustive search by multiple search agents. Despite the occurrence of elegant algorithms for solving complex problems, exhaustive search has retained its significance since many real-life problems exhibit no regular structure and exhaustive search is the only possible solution. Here we analyze the performance of exhaustive search when it is conducted by multiple search agents. Several strategies for joint search with parallel agents are evaluated. We discover that the performance of the search improves with the increase in the level of mutual help between agents. The same search performance can be achieved with homogeneous and heterogeneous search agents provided that the lengths of subregions allocated to individual search regions follow the differences in the speeds of heterogeneous search agents. We also demonstrate how to achieve the optimum search performance by means of increasing the dimensions of the search region.