Parallel Algorithms Research Papers - Academia.edu (original) (raw)
It is explicated that human brain functions as ‘radio’ capable of projecting thought and emotional frequencies that may be categorized to operate on two distinct bands: the low frequency band (similar to the AM band) and the high... more
It is explicated that human brain functions as ‘radio’ capable of projecting thought and emotional frequencies that may be categorized to operate on two distinct bands: the low frequency band (similar to the AM band) and the high frequency band (analogous to the FM band). The low frequency band concentrates on the affairs indigenous to the physical time-space continuum that are spearheaded by the active (physical) conscious mind that are piloted by the ‘rational’ brain. This is referred to as the ‘ego mind’ that is focused on a single reality, the one we identify our SELF with, the life sojourns that we are currently consciously experiencing and experimenting with. The high frequency band, on the other hand, operates in a much broader sense through the sub-conscious mind, which is of ethereal nature, traversing other dimensions of reality that may be classified as ‘parallel’ or ‘alternate,’ comprised of a multitude of consciousness states ordinarily not perceived by the physical brain. It is clarified that such parallel or alternate continuums of time and space are just as ‘real’ as the one we actively perceive by our ‘logical’ brain. In fact, we traverse and experience these dimensions of consciousness every night in dream state, when we are able to ‘turn off’ our physical consciousness and switch on our sub-conscious mind. These dimensional realities or alternate continuums of time and space are part of the holographic software programs of consciousness that run simultaneously with the one in our physical reality in what is referred to as the NOW moment. It is elucidated that we are all each part of a family of ‘soul aspects’ that are governed by a central higher dimensional consciousness named as our HIGHERSELF or soul. And, although the central soul or our HIGHERSELF is simultaneously conscious of all our life sojourns in all the parallel and alternate continuums of time and space, we purposefully remain unaware of our other concurrent ‘soul aspects’ in the interest of obtaining objective and unbiased life experiences. This condition also suits our limited capacity as biological entities to process all our life experience data due to our low light quotient-natural frequency. And, once our natural frequency is uplifted as to meet the threshold required for obtaining ‘full consciousness,’ we come to the realization that we have the power to enter any hologram of time and space and willfully improvise and live our lives based upon any historical character reflected in the ‘Akashic Records’ that we wish (e.g. Mozart, Da Vinci, Tesla, etc.). This way, we can go in the ‘past’ and create new parallel timelines from then on, adding to the repertory of already existing Akashic Records. In this regard, the distinction among past, present, and future become ambiguous at best, with all transpiring within the same moment of ‘NOW.’
Frequent itemset mining is a classic problem in data mining. It is a non-supervised process which concerns in finding frequent patterns (or itemsets) hidden in large volumes of data in order to produce compact summaries or models of the... more
Frequent itemset mining is a classic problem in data mining. It is a non-supervised process which concerns in finding frequent patterns (or itemsets) hidden in large volumes of data in order to produce compact summaries or models of the database. These models are typically used to generate association rules, but recently they have also been used in far reaching domains like e-commerce and bio-informatics. Because databases are increasing in terms of both dimension (number of attributes) and size (number of records), one of the main issues in a frequent itemset mining algorithm is the ability to analyze very large databases. Sequential algorithms do not have this ability, especially in terms of run-time performance, for such very large databases. Therefore, we must rely on high performance parallel and distributed computing. We present new parallel algorithms for frequent itemset mining. Their efficiency is proven through a series of experiments on different parallel environments, that range from shared-memory multiprocessors machines to a set of SMP clusters connected together through a high speed network.
Page 1. A Parallel Genetic Algorithm for Rule Discovery in Large Databases Dieferson Luis Alves de Araujo' , Heitor S. Lopes', Alex A. Freitas2 CEFET-PR - Centro... more
Page 1. A Parallel Genetic Algorithm for Rule Discovery in Large Databases Dieferson Luis Alves de Araujo' , Heitor S. Lopes', Alex A. Freitas2 CEFET-PR - Centro Federal de EducaGBo Tecnol6gica do Paranh CPGEI - Curso ...
A coloring of a graph G is an assignment of colors to its vertices so that no two adjacent vertices have the same color. We study the problem of coloring permutation graphs using certain properties of the lattice representation of a... more
A coloring of a graph G is an assignment of colors to its vertices so that no two adjacent vertices have the same color. We study the problem of coloring permutation graphs using certain properties of the lattice representation of a permutation and relationships between permutations, directed acyclic graphs and rooted trees having speciÿc key properties. We propose an e cient parallel algorithm which colors an n-node permutation graph in O(log 2 n) time using O(n 2 =log n) processors on the CREW PRAM model. Speciÿcally, given a permutation we construct a tree T * [ ], which we call coloring-permutation tree, using certain combinatorial properties of. We show that the problem of coloring a permutation graph is equivalent to ÿnding vertex levels in the coloring-permutation tree.
Many natural structures can be naturally represented by complex networks. Discovering network motifs, which are overrepresented patterns of inter-connections, is a computationally hard task related to graph isomorphism. Sequential methods... more
Many natural structures can be naturally represented by complex networks. Discovering network motifs, which are overrepresented patterns of inter-connections, is a computationally hard task related to graph isomorphism. Sequential methods are hindered by an exponential execution time growth when we increase the size of motifs and networks. In this article we study the opportunities for parallelism in existing methods and propose new parallel strategies that adapt and extend one of the most efficient serial methods known ...
V. SUMMARY We have given parallel algorithms for recognizing and parsing context-free languages on a hypercube of p PE's, 1 5 p 5 n. The algorithms are both time-wise and space-wise optimal with respect to the most efficient... more
V. SUMMARY We have given parallel algorithms for recognizing and parsing context-free languages on a hypercube of p PE's, 1 5 p 5 n. The algorithms are both time-wise and space-wise optimal with respect to the most efficient sequential algorithm. The recognition algorithms were ...
Shared-address-space multiprocessors are effective vehicles for speeding up visualization and image synthesis algorithms. This article demonstrates excellent parallel speedups on some well-known sequential algorithms. S everal recent... more
Shared-address-space multiprocessors are effective vehicles for speeding up visualization and image synthesis algorithms. This article demonstrates excellent parallel speedups on some well-known sequential algorithms. S everal recent algorithms have substantially sped up complex and timeconsuming visualization tasks. In particular, novel algorithms for radiosity computation' and volume r e n d e r i r~g~.~ have demonstrated performance far superior to earlier methods. Despite these advances, visualization of complex scenes or data sets remains computationally expensive. Rendering a 256 x 256 x 256-voxel volume data set takes about 5 seconds per frame on a 100-MHz Silicon Graphics Indigo workstation using Levoy's ray-casting algorithm2 and about a second per frame using a new shear-warp algorithm.' These times are much larger than the 0.03 second per frame required for real-time rendering or the 0.1 second per frame required for interactive rendering. Realistic radiosity and ray-tracing computations are much more time-consuming.
A Pr ufer code of a labeled free tree with n nodes is a sequence of length n − 2 constructed by the following sequential process: for i ranging from 1 to n − 2 insert the label of the neighbor of the smallest remaining leaf into the ith... more
A Pr ufer code of a labeled free tree with n nodes is a sequence of length n − 2 constructed by the following sequential process: for i ranging from 1 to n − 2 insert the label of the neighbor of the smallest remaining leaf into the ith position of the sequence, and then delete the leaf. Pr ufer codes provide an alternative to the usual representation of trees. We present an optimal O(log n) time, n=log n processor EREW-PRAM algorithm for determining the Pr ufer code of an n-node labeled chain and an O(log n) time, n processor EREW-PRAM algorithm for constructing the Pr ufer code of an n-node labeled free tree. This resolves an open question posed by Wang et al.
Most scientific data analyses comprise analyzing voluminous data collected from various instruments. Efficient parallel/concurrent algorithms and frameworks are the key to meeting the scalability and performance requirements entailed in... more
Most scientific data analyses comprise analyzing voluminous data collected from various instruments. Efficient parallel/concurrent algorithms and frameworks are the key to meeting the scalability and performance requirements entailed in such scientific data analyses. The recently introduced MapReduce technique has gained a lot of attention from the scientific community for its applicability in large parallel data analyses. Although there are many evaluations of the MapReduce technique using large textual data collections, there have been only a few evaluations for scientific data analyses. The goals of this paper are twofold. First, we present our experience in applying the MapReduce technique for two scientific data analyses: (i) High Energy Physics data analyses; (ii) Kmeans clustering. Second, we present CGL-MapReduce, a streaming-based MapReduce implementation and compare its performance with Hadoop.
The irregular shape packing problem is a combinatorial optimization problem that consists of arranging items on a container in such way that no item overlaps. In this paper we adopt a solution that places the items sequentially, touching... more
The irregular shape packing problem is a combinatorial optimization problem that consists of arranging items on a container in such way that no item overlaps. In this paper we adopt a solution that places the items sequentially, touching the already placed items or the container. To place a new item without overlaps, the collision free region for the new item is robustly computed using non manifold Boolean operations. A simulated annealing algorithm controls the items sequence of placement, the item's placement and orientation. In this work, the placement occurs at collision free region's vertices. Several results with benchmark datasets obtained from the literature are reported. Some of them are the best already reported in the literature. To improve the computational cost performance of the algorithm, a parallelization method to determine the collision free region is proposed. We demonstrated two possible algorithms to compute the collision free region, and only one of them can be parallelized. The results showed that the parallelized version is better than the sequential approach only for datasets with very large number of items. The computational cost of the non manifold Boolean operation algorithm is strongly dependent on the number of vertices of the original polygons.
In this paper a two-phase filter for removing "salt and pepper" noise is proposed. In the first phase, an adaptive median filter is used to identify the set of the noisy pixels; in the second phase, these pixels are restored according to... more
In this paper a two-phase filter for removing "salt and pepper" noise is proposed. In the first phase, an adaptive median filter is used to identify the set of the noisy pixels; in the second phase, these pixels are restored according to a regularization method, which contains a data-fidelity term reflecting the impulse noise characteristics. The algorithm, which exhibits good performance both in denoising and in restoration, can be easily and effectively parallelized to exploit the full power of multi-core CPUs and GPGPUs; the proposed implementation based on the FastFlow library achieves both close-to-ideal speedup and very good wall-clock execution figures.
This paper presents the first experimental results of the use of our new adaptive tool for synchronization, based on ordered read-write locks, ORWL. They provide a new synchronizing method for data-oriented parallel algorithms and are... more
This paper presents the first experimental results of the use of our new adaptive tool for synchronization, based on ordered read-write locks, ORWL. They provide a new synchronizing method for data-oriented parallel algorithms and are particularly suited for iterative pipelined algorithms with out-of-core data. We conducted experiments with the classic benchmarking Livermore Kernel 23 algorithm to validate the theoretical model and measure the efficiency of the first available implementation of ORWL in the parXXL library. They show that this tool is able to efficiently control an IO bound application running on 64 parallel POSIX threads with tight data dependencies between them.
One of the most significant challenges in Computing Determinant of Rectangular Matrices is high time complexity of its algorithm. Among all definitions of determinant of rectangular matrices, used definition has special features which... more
One of the most significant challenges in Computing Determinant of Rectangular Matrices is high time complexity of its algorithm. Among all definitions of determinant of rectangular matrices, used definition has special features which make it more notable. But in this definition, C(n m) sub matrices of the order m*m needed to be generated that put this problem in NP hard class. On the other hand, any row or column reduction operation may hardly lead to diminish the volume of calculation. Therefore, in this paper we try to present the parallel algorithm which can decrease the time complexity of computing the determinant of non-square matrices to O(pow(n,2)).
An approach to the test of the sensor information fusion Kalman filter is proposed. It is based on the introduced statistics of mathematical expectation of the spectral norm of a normalized innovation matrix. The approach allows for... more
An approach to the test of the sensor information fusion Kalman filter is proposed. It is based on the introduced statistics of mathematical expectation of the spectral norm of a normalized innovation matrix. The approach allows for simultaneous test of the mathematical expectation and the variance of innovation sequence in real time and does not require a priori information on values of the change in its statistical characteristics under faults. Using this approach, fault detection algorithm for the sensor information fusion Kalman filter is developed. ᭧
This paper presents a parallel implementation of geodesic distance transform using OpenMP. We show how a sequential-based chamfer distance algorithm can be executed on parallel processing units with shared memory such as multiple cores on... more
This paper presents a parallel implementation of geodesic distance transform using OpenMP. We show how a sequential-based chamfer distance algorithm can be executed on parallel processing units with shared memory such as multiple cores on a modern CPU. Experimental results show a speedup of 2.6 times on a quad-core machine can be achieved without loss in accuracy. This work forms part of a C implementation for geodesic superpixel segmentation of natural images.
This note presents a new algorithm for computing the product of two elements in a finite field F by means of sums and products in a fixed subfield F and F (ex. F = GF(2 m) and F = GF(2)). The algorithm is based on a normal basis... more
This note presents a new algorithm for computing the product of two elements in a finite field F by means of sums and products in a fixed subfield F and F (ex. F = GF(2 m) and F = GF(2)). The algorithm is based on a normal basis representation of fields and assumes that the dimension m of F over F is a highly composite number. A very fast parallel implementation and a considerable reduction in the number of computations is allowed, in comparison with some methods discussed in the literature.
W e present parallel shared m.emory algorithms for counting the num.ber of partitaon,s of a given. integer N , where th,e partitions m.uy be s,ubject to restrictions, such as beinq composed of distinct parts, of IL given number of parts:... more
W e present parallel shared m.emory algorithms for counting the num.ber of partitaon,s of a given. integer N , where th,e partitions m.uy be s,ubject to restrictions, such as beinq composed of distinct parts, of IL given number of parts: and/or of parts belonging to a specified set. W e shoui that this can be done in polylogan'thmic parallel time, although the dgorithm requires an excessive number of processors. W e also present more practical algorithms that run, in. tim,e O(&?(log N)') but use m.uch fewer processors. The technique used in these algorith,ms can' be used to obtain adaptive, optimal algorithms for the case when a iamited number of processors is available. Parallel logarithmac time a1gon'thm.s th,at generate purtitions un,iformly at random., u s k y the q.iimntities com.puted by the coun<ting algorithms. are ulso presmted.
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large... more
A number of parallel formulations of dense matrix multiplication algorithm have been developed. For arbitrarily large number of processors, any of these algorithms or their variants can provide near linear speedup for sufficiently large matrix sizes and none of the algorithms can be clearly claimed to be superior than the others. In this paper we analyze the performance and scalability of a number of parallel formulations of the matrix multiplication algorithm and predict the conditions under which each formulation is better than the others.
Recommender systems aim to personalize the shopping experience of a user by suggesting related products, or products that are found to be in the general interests of the user. The information available for users and products is... more
Recommender systems aim to personalize the shopping experience of a user by suggesting related products, or products that are found to be in the general interests of the user. The information available for users and products is heterogenous, and many systems use one or some of the information. The information available include the user's interactions history with the products and categories, textual information of the products, a hierarchical classification of the products into a taxonomy, user interests based on a questionnaire, the demographics of a user, inferred interests based on product reviews given by a user, interests based on the physical location of a user and so on. Taxonomy discovery for personalized recommendation is work published in 2014 which uses the first three information sources { the user's interaction history, textual information of the products and optionally, an existing taxonomy of the products. In this paper, we describe a parallel implementation o...
Structured parallel programming is recognised as a viable and effective means of tackling parallel programming problems. Recently, a set of simple and powerful parallel building blocks (RISC-pb 2 l) has been proposed to support modelling... more
Structured parallel programming is recognised as a viable and effective means of tackling parallel programming problems. Recently, a set of simple and powerful parallel building blocks (RISC-pb 2 l) has been proposed to support modelling and implementation of parallel frameworks. In this work we demonstrate how that same parallel building block set may be used to implement both general purpose parallel programming abstractions, not usually listed in classical skeleton sets, and more specialized domain specific parallel patterns. We discuss the associated implementation techniques and present experimental evidence of the feasibility and e ciency of the approach. Keywords Algorithmic skeleton, parallel design patterns, programming frameworks, RISC-pb 2 l, parallel building blocks.
On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a... more
On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bit masked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combi...
This paper discusses opportunities to parallelize graph based path planning algorithms in a time varying environment. Parallel architectures have become commonplace, requiring algorithm to be parallelized for efficient execution. An... more
This paper discusses opportunities to parallelize graph based path planning algorithms in a time varying environment. Parallel architectures have become commonplace, requiring algorithm to be parallelized for efficient execution. An additional focal point of this paper is the inclusion of inaccuracies in path planning as a result of forecast error variance, accuracy of calculation in the cost functions and a different observed vehicle speed in the real mission than planned. In this context, robust path planning algorithms will be described. These algorithms are equally applicable to land based, aerial, or underwater mobile autonomous systems. The results presented here provide the basis for a future Research project in which the parallelized algorithms will be evaluated on multi and many core systems such as the dual core ARM Panda board and the 48 core Single-chip Cloud Computer (SCC). Modern multi and many core processors support a wide range of performance vs. energy tradeoffs th...
Parallel algorithms for solving geometric problems on two array processor models-the mesh-connected computer (MCC) and a two-dimensional systolic array-are presented. We illustrate a recursive divide-and-conquer paradigm for MCC... more
Parallel algorithms for solving geometric problems on two array processor models-the mesh-connected computer (MCC) and a two-dimensional systolic array-are presented. We illustrate a recursive divide-and-conquer paradigm for MCC algorithms by presenting a time-optimal solution for the problem of finding the nearest neighbors of a set of planar points represented by their Cartesian coordinates. The algorithm executes on a ~/n • x/n MCC, and requires an optimal O(x/n) time. An algorithm for constructing the convex hull of a set of planar points and an update algorithm for the disk placement problem on an nZ/3x n 2/3 twodimensional systolic array are presented. Both these algorithms require O(n 2/3) time steps. The advantage of the systolic solutions lies in their suitability for direct hardware implementation.
This paper presents a new approach for automatically pipelining sequential circuits. The approach repeatedly extracts a computation from the critical path, moves it into a new stage, then uses speculation to generate a stream of values... more
This paper presents a new approach for automatically pipelining sequential circuits. The approach repeatedly extracts a computation from the critical path, moves it into a new stage, then uses speculation to generate a stream of values that keep the pipeline full. The newly generated circuit retains enough state to recover from incorrect speculations by flushing the incorrect values from the pipeline, restoring the correct state, then restarting the computation.
The k-way merging problem is to find a new sorted array as an output from k sorted arrays as an input. In this paper, we consider the elements of the k sorted arrays are data record, where the value of the key for each record is a serial... more
The k-way merging problem is to find a new sorted array as an output from k sorted arrays as an input. In this paper, we consider the elements of the k sorted arrays are data record, where the value of the key for each record is a serial number. The problem is used to design efficient external sorting algorithm. We proposed two optimal parallel algorithms for k merging. The first one is based on merging k sorted arrays of n records in a new sorted array of length n. The second one is based on merging k sorted arrays of n records in a new sorted array of length n+o(n) which is called padded merging. The running time for each algorithm is O(log n) and O(1) under EREW and CRCW PRAM respectively.
An efficient method for computing the discrete cosine transform (DCT) is proposed. Based on direct decomposition of the DCT, the recursive properties of the DCT for even length input sequence is derived, which is a generalization of the... more
An efficient method for computing the discrete cosine transform (DCT) is proposed. Based on direct decomposition of the DCT, the recursive properties of the DCT for even length input sequence is derived, which is a generalization of the radix 2 DCT algorithm. Based on the recursive property, a new DCT algorithm for even length sequence is obtained. The proposed algorithm is very structural and requires fewer computations when compared with others. The regular structure of the proposed algorithm is suitable for fast parallel algorithm and VLSI implementation.
In this paper, we address parallel machine scheduling problems with an objective of minimizing the maximum weighted absolute lateness. Memetic algorithms are applied to solve this problem. The proposed method is compared with genetic... more
In this paper, we address parallel machine scheduling problems with an objective of minimizing the maximum weighted absolute lateness. Memetic algorithms are applied to solve this problem. The proposed method is compared with genetic algorithms and heuristics on randomly generated test problems. The results show that the memetic algorithm outperforms the others
On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a... more
On multicore architectures, the ratio of peak memory bandwidth to peak floating-point performance (byte:flop ratio) is decreasing as core counts increase, further limiting the performance of bandwidth limited applications. Multiplying a sparse matrix (as well as its transpose in the unsymmetric case) with a dense vector is the core of sparse iterative methods. In this paper, we present a new multithreaded algorithm for the symmetric case which potentially cuts the bandwidth requirements in half while exposing lots of parallelism in practice. We also give a new data structure transformation, called bitmasked register blocks, which promises significant reductions on bandwidth requirements by reducing the number of indexing elements without introducing additional fill-in zeros. Our work shows how to incorporate this transformation into existing parallel algorithms (both symmetric and unsymmetric) without limiting their parallel scalability. Experimental results indicate that the combined benefits of bitmasked register blocks and the new symmetric algorithm can be as high as a factor of 3.5x in multicore performance over an already scalable parallel approach. We also provide a model that accurately predicts the performance of the new methods, showing that even larger performance gains are expected in future multicore systems as current trends (decreasing byte:flop ratio and larger sparse matrices) continue.
An algorithm based on a recurrent neural Wang’s network and the WTA (“Winner takes all”) principle is applied to the construction of Hamiltonian cycles in graphs of distributed computer systems (CSs). The algorithm is used for: 1) regular... more
An algorithm based on a recurrent neural Wang’s network and the WTA (“Winner takes all”) principle is applied to the construction of Hamiltonian cycles in graphs of distributed computer systems (CSs). The algorithm is used for: 1) regular graphs (2D- and 3D-tori, and hypercubes) of distributed CSs and 2) 2D-tori disturbed by removing an arbitrary edge. The neural network parameters for the construction of Hamiltonian cycles and suboptimal cycles with a length close to that of Hamiltonian ones are determined. Our experiments show that the iterative method (Jacobi, Gauss-Seidel, or SOR) used for solving the system of differential equations describing a neural network strongly affects the process of cycle construction and depends on the number of torus nodes.
This paper focuses on the use of Network-on-Chip (NoC) accelerators for Barnes-Hut N-Body simulations. NoCbased architecture is proposed to solve the communication bottleneck of processors with hundreds or even thousands of cores. An... more
This paper focuses on the use of Network-on-Chip (NoC) accelerators for Barnes-Hut N-Body simulations. NoCbased architecture is proposed to solve the communication bottleneck of processors with hundreds or even thousands of cores. An N-body simulation approximates the evolution of a system of bodies, e.g. an astrophysical system where each body represents a star or a galaxy. Despite the fact that the behaviour of Barnes-Hut algorithm has been studied on conventional multicore systems, graphics processing units and other accelerators, we explore key performance issues in the context of NoC platform. We investigate serial and parallel implementations, where the parallel version is analyzed in terms of network traffic. The results revealed that hot-spot and bursty traffic can congest the network, while long distance communication deteriorated system performance further. We propose algorithmic and interconnection optimizations. These include improved data locality, proper mapping and partially diagonal network. Evaluation results show that, compared with the original implementation, the average execution time and energy delay product are reduced by 25.3% and 31.6% respectively. The proposed design achieved 55.4x speed-up over 64 threads.
Cloud computing is used to utilize the shared resources to a maximum possible extent. These resources are shared by more than one user at a time as per their need. This improves computing power for using various applications at various... more
Cloud computing is used to utilize the shared resources to a maximum possible extent. These resources are shared by more than one user at a time as per their need. This improves computing power for using various applications at various places with additional advantage of cost saving on storage space and electricity consumption. Virtualization is one of the important technologies of cloud computing, which directs
to act virtually instead of actually doing something. Through virtualization software, sharing of the system resources is done across multiple environments. This paper mainly focuses on introduction to virtualization, advantages & disadvantages of virtualization and types of virtualization like: server, desktop, application, programming language, storage, network virtualization etc. Server virtualization involves partitioning a main server into different virtual servers. Desktop virtualization jointly with application virtualization is used to provide a desktop environment management system. Applications in the application virtualization are executed on different operating system as if it is executing on original operating system. Storage virtualization makes available data from multiple network storage devices into one single storage unit.
Network virtualization is the process of combining hardware and software network resources into a single virtual network through software. In addition to this, the paper provides future research direction in virtualization.
In this work, we have implemented the classic algorithms to compute the eigenvalues of a n x n symmetric matrix A with the characteristic of being an sparse matrix, besides which its order surpasses the thousands, which does necessary to... more
In this work, we have implemented the classic algorithms to compute the eigenvalues of a n x n symmetric matrix A with the characteristic of being an sparse matrix, besides which its order surpasses the thousands, which does necessary to use an efficient structure for the handling of sparse matrices and to adapt classic numerical methods to this structure, algorithms will allow to obtain all the eigenvalues and possible eigenvectors from the matrix. Parallelism is used in the implementation of the algorithms, in search of reducting the execution times. Francis's QR or QL algorithms use similarity transformations to convert the matrix in diagonal form, reason why the eigenvalues are preserved in each iteration, in addition to be a robust method for the computing of eigenvalues and its associated eigenvectors. In this context, the study is carried out in modern supercomputers that allow to execute more than one instruction and that simultaneously allows to process manifold data and altogether with the UCSparseLib library of Universidad de Carabobo wich already has the necessary and eficient structures for the handling of sparse structures, we search to improve the execution time of the serial algorithm applying multithreading using OpenMP library.
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug.... more
Debugging large-scale parallel applications is challenging. In most HPC applications, parallel tasks progress in a coordinated fashion, and thus a fault in one task can quickly propagate to other tasks, making it difficult to debug. Finding the least-progressed tasks can significantly reduce the effort to identify the task where the fault originated. However, existing approaches for detecting them suffer low accuracy and large overheads; either they use imprecise static analysis or are unable to infer progress dependence inside loops. We present a loop-aware progress-dependence analysis tool, Prodometer, which determines relative progress among parallel tasks via dynamic analysis. Our fault-injection experiments suggest that its accuracy and precision are over 90% for most cases and that it scales well up to 16,384 MPI tasks. Further, our case study shows that it significantly helped diagnosing a perplexing error in MPI, which only manifested at large scale.
In today’s network-based cloud computing era, software applications are playing big role. The security of these software applications is paramount to the successful use of these applications. These applications utilize cryptographic... more
In today’s network-based cloud computing era, software applications are playing big role. The security of
these software applications is paramount to the successful use of these applications. These applications utilize cryptographic algorithms to secure the data over the network through encryption and decryption processes. The use of parallel processors is now common in both mobile and cloud computing scenarios. Cryptographic algorithms are compute intensive and can significantly benefit from parallelism. This paper introduces a parallel approach to symmetric stream cipher security algorithm known as RC4A, which is
one of the strong variants of RC4. We present an efficient parallel implementation to the compute intensive PRGA that is pseudo-random generation algorithm portion of the RC4A algorithm and the resulted algorithm will be named as PARC4-I. We have added some functionality in terms of lookup tables.
Modified algorithm is having four lookup tables instead of two and is capable of returning four distinct output bytes at each iteration. Further, with the help of Parallel Additive Stream Cipher Structure and loop unrolling method, encryption/decryption is being done on multi core machine. Finally, the results shows that PARC4-I is a time efficient algorithm.
primes-utils is a Rubygem which provides a suite of utility methods to list|count primes over ranges, factoring, finding the nth prime, and primality testing. This handbook explains the use of Prime Generators, which are used as the... more
primes-utils is a Rubygem which provides a suite of utility methods to list|count primes over ranges, factoring, finding the nth prime, and primality testing. This handbook explains the use of Prime Generators, which are used as the mathematical foundation for most of the methods, and provides the Ruby source code for the gem.
A grid computing environment provides a type of distributed computation that is unique because it is not centrally managed and it has the capability to connect heterogeneous resources. A grid system provides location-independent access to... more
A grid computing environment provides a type of distributed computation that is unique because it is not centrally managed and it has the capability to connect heterogeneous resources. A grid system provides location-independent access to the resources and services of geographically distributed machines. An essential ingredient for supporting location-independent computations is the ability to discover resources that have been requested by the users. Because the number of grid users can increase and the grid environment is continuously changing, a scheduler that can discover decentralized resources is needed. Grid resource scheduling is considered to be a complicated, NP-hard problem because of the distribution of resources, the changing conditions of resources, and the unreliability of infrastructure communication. Various artificial intelligence algorithms have been proposed for scheduling tasks in a computational grid. This paper uses the imperialist competition algorithm (ICA) to address the problem of independent task scheduling in a grid environment, with the aim of reducing the makespan. Experimental results compare ICA with other algorithms and illustrate that ICA finds a shorter makespan relative to the others. Moreover, it converges quickly, finding its optimum solution in less time than the other algorithms.
In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information, generate knowledge and... more
In the Information Technology era information plays vital role in every sphere of the human life. It is very important to gather data from different data sources, store and maintain the data, generate information, generate knowledge and disseminate data, information and knowledge to every stakeholder. Due to vast use of computers and electronics devices and tremendous growth in computing power and storage capacity, there is explosive growth in data collection. The storing of the data in data warehouse enables entire enterprise to access a reliable current database. To analyze this vast amount of data and drawing fruitful conclusions and inferences it needs the special tools called data mining tools. This paper gives overview of the data mining systems and some of its applications.
The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular... more
The widespread usage of the discrete wavelet transform (DWT) has motivated the development of fast DWT algorithms and their tuning on all sorts of computer systems. Several studies have compared the performance of the most popular schemes, known as Filter Bank Scheme (FBS) and Lifting Scheme (LS), and have always concluded that LS is the most efficient option. However, there is no such study on streaming processors such as modern Graphics Processing Units (GPUs). Current trends have transformed these devices into powerful stream processors with enough flexibility to perform intensive and complex floating-point calculations. The opportunities opened up by these platforms, as well as the growing popularity of the DWT within the computer graphics field, make a new performance comparison of great practical interest. Our study indicates that FBS outperforms LS in current-generation GPUs. In our experiments, the actual FBS gains range between 10 percent and 140 percent, depending on the problem size and the type and length of the wavelet filter. Moreover, design trends suggest higher gains in future-generation GPUs.
Matrix transpose operation (Å Ì) is used frequently in many multimedia and high performance applications. Therefore, using a faster Å Ì operation results in a shorter execution time of these applications. In this paper, we propose two new... more
Matrix transpose operation (Å Ì) is used frequently in many multimedia and high performance applications. Therefore, using a faster Å Ì operation results in a shorter execution time of these applications. In this paper, we propose two new Å Ì algorithms. The algorithms exploit diagonal register properties to achieve a linear-time execution of Å Ì operation using vector processor that supports diagonal registers. We demonstrate the algorithms as well as proofs, examples, and various enhancements to the proposed algorithms. A performance evaluation shows that the proposed algorithms are at least twice as fast as one of the leading Å Ì algorithms such as an algorithm that is implemented using Motorola's AltiVec architecture (Ò ½). We believe that our work opens new doors to improve the execution time of many two-dimensional operations such as DCT, DFT, and Shearsort.
Computer graphics and animation have has become a key technology in determining future research and development activities in many academic and industrial branches. The aim of this journal is to be an international peer-reviewed open... more
Computer graphics and animation have has become a key technology in determining future research and development activities in many academic and industrial branches. The aim of this journal is to be an international peer-reviewed open access forum for scientific and technical presentations and discus the latest advances in Computer graphics and animation.
A new parallel distributed algorithm for Golomb Ruler derivation is presented. This algorithm was used to prove computationally the optimality of three rulers. Two of these were previously proven but yet unpublished, and the authors'... more
A new parallel distributed algorithm for Golomb Ruler derivation is presented. This algorithm was used to prove computationally the optimality of three rulers. Two of these were previously proven but yet unpublished, and the authors' independent derivation confirmed these results. The last ruler, of 19 marks and size 246, was known to be near-optimal and was computationally proven optimal in this work.