An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system (original) (raw)
Related papers
Parallel merge sort for distributed memory architectures
1994
Cole presented a parallel merge sort for the PRAM model that performs in O(logn) parallel steps using n processors. He gave an algorithm for the CREW PRAM model for which the constant in the running time is small. He also gave a more complex version of the algorithm for the ...
Sorting with linear speedup on a pipelined hypercube
IEEE Transactions on Computers, 1992
A distributed-memory parallel architecture known as a pipelined hypercube is introduced. A parallel sorting algorithm on this model is presented. Using p processors, n data items can be sorted with time complexity 0 (y + log'p), including both communication and computation costs. The algorithm achieves linear speedup for 1 5 p 5 log n * Zndex Terms-Conflict-free routing, hypercube, linear speedup, merge sort, parallel algorithm, pipelining, sorting.
Overlapping Computations, Communications and I/O in Parallel Sorting
Journal of Parallel and Distributed Computing, 1995
In this paper we present a new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node. This algorithm is shown to be of similar complexity to known e cient sorting algorithms. The pipelining e ect exploited by our algorithm should lead to higher levels of performance on distributed memory parallel processors. In order to achieve the best results using this strategy, the CPU, network and disk operations must take comparable time. We suggest acceptable levels of system balance for sorting machines and analyze the performance of the sorting algorithm as system parameters vary.
High-performance sorting algorithms for the CRAY T3D parallel computer
The Journal of Supercomputing, 1997
In this paper we study the sorting performance of a 128-processor CRAY T3D and discuss the efficient use of the toroidal network connecting the processors. The problems we consider range from that of sorting one word per processor to sorting the entire memory of the machine, and we give efficient algorithms for each case. In addition, we give both algorithms that make assumptions about the distribution of the data and those that make no assumptions. The clear winner, if data can be assumed to be uniformly distributed, is a method that we call a hash-and-chain sort. The time for this algorithm to sort one million words per processor over 64 processors is less than two seconds, which compares favorably to about four seconds using a 4-processor CRAY C90 and about 17 seconds using a 64-processor Thinking Machines CM-5.
Modular Design of High-Throughput, Low-Latency Sorting Units
High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing high-throughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.
The effect of local sort on parallel sorting algorithms
Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing
We show the importance of sequential sorting in the context of in memory parallel sorting of large data sets of 64 bit keys. First, we analyze several sequential strategies like Straight Insertion, Quick sort, Radix sort and CC-Radix sort. As a consequence of the analysis, we propose a new algorithm that we call Sequential Counting Split Radix sort, SCS-Radix sort. SCS-Radix sort is a combination of some of the algorithms analyzed and other new ideas. There are three important contributions in SCS-Radix sort. First, the work saved by detecting data skew dynamically. Second, the exploitation of the memory hierarchy done by the algorithm. Third, the execution time stability of SCS-Radix when sorting data sets with different characteristics. We evaluate the use of SCS-Radix sort in the context of a parallel sorting algorithm on an SGI Origin 2000. The parallel algorithm is from 1:2 to 45 times faster using SCS-Radix sort than using Radix sort or Quick sort.
Parallel sorting with cooperating heaps in a linear array of processors
Parallel Computing, 1990
A parallel sorting algorithm using cooperating heaps in a linear array of processors is presented. It can sort a sequence whose length is much larger than the number of processors. Because the output begins one step after all the items have been input, sorting n items requires 2n + 1 steps. Two independent modifications of the algorithm are possible; one tries to reduce the number of processors used, and the other can sort more items on the same array.
A new deterministic parallel sorting algorithm with an experimental evaluation
Journal of Experimental Algorithmics, 1998
We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed. We present a novel variation on the approach of sorting by regular sampling which leads to a new deterministic sorting algorithm that achieves optimal computational speedup with very little communication. Our algorithm exchanges the single step of irregular communication used by previous implementations for two steps of regular communication. In return, our algorithm mitigates the problem of poor load balancing because it is able to sustain a high sampling rate at substantially less cost. In addition, our algorithm efficiently accommodates the presence of duplicates without the overhead of tagging each element. And our algorithm achieves predictable, regular communication requirements which are essentially invariant with respect to the input distribution. Utilizing regular communication has become more important with the advent of message passing standards, such as MPI [16], which seek to guarantee the availability of very efficient (often machine specific) implementations of certain basic collective communication routines. Our algorithm was implemented in a high-level language and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using a variety of benchmarks that we identified to examine the dependence of our algorithm on the input distribution. Our experimental results are consistent with the theoretical analysis and illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly indifferent to the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be guaranteed with deterministically. The high-level language used in our studies is SPLIT-C [10], an extension of C for distributed memory machines. The algorithm makes use of MPI-like communication primitives but does not make any assumptions as to how these primitives are actually implemented. The basic data transport is a read or write operation. The remote read and write typically have both blocking and non-blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4], are similar to those of the MPI [16], the IBM POWERparallel [6], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data from a single source to all the other processors. The primitives gather and scatter are companion primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of which is distributed to a unique processor, and gather coalesces these blocks back into a single array at a particular processor. See [3, 4, 5] for algorithmic details, performance analyses, and empirical results for these communication primitives. The organization of this paper is as follows. Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm. Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.
Sorting on a mesh-connected parallel computer
Communications of the ACM, 1977
Two algorithms are presented for sorting n z elements on an n × n mesh-connected processor array that require O (n) routing and comparison steps. The best previous algoritmhm takes time O(n log n). The algorithms of this paper are shown to be optimal in time within small constant factors. Extensions to higherdimensional arrays are also given.