Implementations of a model of physical sorting (original) (raw)

Modular Design of High-Throughput, Low-Latency Sorting Units

High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing high-throughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.

Parallel sorting in two-dimensional VLSI models of computation

IEEE Transactions on Computers, 1989

Shear-sort opened new avenues in the research of sorting techniques for mesh-connected processor arrays. The algorithm is extremely simple and converges to a snake-like sorted sequence with a time complexity which is suboptimal by a logarithmic factor. The techniques used for analyzing shear-sort have been used to derive more efficient algorithms, which have important ramifications both from practical and theoretical viewpoints. Although the algorithms described apply to any general two-dimensional computational model, the focus of most discussions is on mesh-connected computers which are now commercially available. In spite of a rich history of O ( n ) sorting algorithms on an n x n SIMD mesh, the constants associated with the leading term (i.e., n ) are fairly large. This had led researchers to speculate about the tightness of the lower bound. The work in this paper sheds some more light on this problem as a 4n-step algorithm is shown to exist for a model slightly more powerful than the conventional SIMD model. Moreover, this algorithm has a running time of 3n steps on the more powerful MIMD model, which is "truly" optimal for such a model. Index Terms-Distance bound, lower bound, mesh-connected network, parallel algorithm, sorting, time complexity, upper bound. WO-DIMENSIONAL sorting is defined as the ordering of T a rectangular array of numbers such that every element is routed to a distinct position of the array predetermined by some indexing scheme. Some of the standard indexing schemes are illustrated in Fig. . The simplest computational model onto which this problem can be mapped is the meshconnected processor array (mesh for short). The simplicity of the interconnection pattern, and the locality of communication, makes the mesh easy to build and program and was the basis of one of the earliest parallel computers (ILLIAC IV). Since then, there have been more machines built on a much larger scale including the MPP and the DAPP using similar interconnection patterns. This simple architecture further motivates the idea of dealing with a given set of numbers as a rectangular array rather than as a linear sequence. More recently, Scherson [15] and Tseng et al. [22] have independently proposed a network which they call the orthogonal access architecture and the reduced-mesh network, respectively. It consists of p processors which are connected by a shared memory of p -q x p -q locations, where each Manuscript

On the design of a high-performance, expandable, sorting engine

Integration, the VLSI Journal, 1994

This paper presents the design and implementation of a modular, expandable and high-performance sorter based on the rebound sorting algorithm of Chen et al. (1978). This single chip rebound sorter can sort 24, 32-bit or 64-bit, records of 2's complement or unsigned data in either ascending or descending order. The modular design of the sorter allows direct cascading of chips for sorting more than 24 records. The monolithic sorter is implemented in 2.0/zm CMOS technology, in a frame of 7.9mm x 9.2mm, which supports its 84 I/O. A pipelining scheme was used to achieve a sustained throughput (of cascaded sorting chips) of 10 MHz, while a scan-path was used to allow external control of memory elements for testing purposes. The emphasis of this paper is on the architecture and circuit design of the sorter which results in a significant improvement in terms of functionality, versatility and performance, over previously reported monolithic sorter circuits. A comparative study of other hardware sorter implementations, and sorting with a general purpose processor, illustrates the performance advantages and functional versatility of the sorter chip reported in this paper.

Simulating the Bitonic Sort Using P Systems

Proceedings of the 8th …, 2007

This paper gives a version of the parallel bitonic sorting algorithm of Batcher, which can sort N elements in time O(log 2 N). When applying it to the 2D mesh architecture, two indexing functions are considered, row-major and shuffled rowmajor. Some properties are proved for the later, together with a correctness proof of the proposed algorithm. Two simulations with P systems are proposed and discussed. The first one uses dynamic communication graphs and follows the guidelines of the mesh version of the algorithm. The second simulation requires only symbol rewriting rules in one membrane.

A COMPARISON-FREE SORTING ALGORITHM ON CPUs

2016

The paper presents a new sorting algorithm that takes input data integer elements and sorts them without any comparison operations between the data—a comparison-free sorting. The algorithm uses a one-hot representation for each input element that is stored in a two-dimensional matrix called a one-hot matrix. Concurrently, each input element is also stored in a one-dimensional matrix in the input element’s integer representation. Subsequently, the transposed one-hot matrix is mapped to a binary matrix producing a sorted matrix with all elements in their sorted order. The algorithm exploits parallelism that is suitable for single instruction multiple thread (SIMT) computing that can harness the resources of these computing machines, such as CPUs with multiple cores and GPUs with large thread blocks. We analyze our algorithm’s sorting time on varying CPU architectures, including singleand multi-threaded implementations on a single CPU. Our results show a fast sorting time for the singl...

SORTCHIP: A VLSI implementation of a hardware algorithm for continuous data sorting

IEEE Journal of Solid-State Circuits, 2003

We present a VLSI implementation of a hardware sorting algorithm for continuous data sorting. The device is able to continuously process an input data stream while producing a sorted output data stream. At each clock cycle, the device reads and processes a 48-bit word, 24 bits for the datum and 24 bits for the associated tag. The data stream is sorted according to the tags preserving the order of words with identical tags. Sequences up to 256 words are completely sorted and longer sequences are partially sorted. The maximum operation frequency is 50 Mwords/s. The architecture is based on a chain of identical elementary sorting units. A full custom design exploits the highly regular architecture to achieve high area and time performance. We describe the algorithm and give architectural details.

The technique of in-place associative sorting

In the first place, a novel, yet straightforward in-place integer value-sorting algorithm is presented. It sorts in linear time using constant amount of additional memory for storing counters and indices beside the input array. The technique is inspired from the principal idea behind one of the ordinal theories of "serial order in behavior" and explained by the analogy with the three main stages in the formation and retrieval of memory in cognitive neuroscience: (i) practicing, (ii) storage and (iii) retrieval. It is further improved in terms of time complexity as well as specialized for distinct integers, though still improper for rank-sorting. Afterwards, another novel, yet straightforward technique is introduced which makes this efficient value-sorting technique proper for rank-sorting. Hence, given an array of n elements each have an integer key, the technique sorts the elements according to their integer keys in linear time using only constant amount of additional mem...

A new deterministic parallel sorting algorithm with an experimental evaluation

Journal of Experimental Algorithmics, 1998

We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed. We present a novel variation on the approach of sorting by regular sampling which leads to a new deterministic sorting algorithm that achieves optimal computational speedup with very little communication. Our algorithm exchanges the single step of irregular communication used by previous implementations for two steps of regular communication. In return, our algorithm mitigates the problem of poor load balancing because it is able to sustain a high sampling rate at substantially less cost. In addition, our algorithm efficiently accommodates the presence of duplicates without the overhead of tagging each element. And our algorithm achieves predictable, regular communication requirements which are essentially invariant with respect to the input distribution. Utilizing regular communication has become more important with the advent of message passing standards, such as MPI [16], which seek to guarantee the availability of very efficient (often machine specific) implementations of certain basic collective communication routines. Our algorithm was implemented in a high-level language and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using a variety of benchmarks that we identified to examine the dependence of our algorithm on the input distribution. Our experimental results are consistent with the theoretical analysis and illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly indifferent to the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be guaranteed with deterministically. The high-level language used in our studies is SPLIT-C [10], an extension of C for distributed memory machines. The algorithm makes use of MPI-like communication primitives but does not make any assumptions as to how these primitives are actually implemented. The basic data transport is a read or write operation. The remote read and write typically have both blocking and non-blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4], are similar to those of the MPI [16], the IBM POWERparallel [6], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data from a single source to all the other processors. The primitives gather and scatter are companion primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of which is distributed to a unique processor, and gather coalesces these blocks back into a single array at a particular processor. See [3, 4, 5] for algorithmic details, performance analyses, and empirical results for these communication primitives. The organization of this paper is as follows. Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm. Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.

VHDL Design of a Scalable VLSI Sorting Device Based on Pipelined Computation

Journal of Computing and Information Technology, 2004

This paper describes the VHDL design of a sorting algorithm, aiming at defining an elementary sorting unit as a building block of VLSI devices which require a huge number of sorting units. As such, an attempt was made to reach a reasonable low value of the area-time parameter. A sorting VLSI device, in fact, can be built as a cascade of elementary sorting units which process the input stream in a pipeline fashion: as the processing goes on, a wave of sorted numbers propagates towards the output ports. In the description of the design, the paper discusses the initial theoretical analysis of the algorithm's complexity VHDL behavioural analysis of the proposed architecture, a structural synthesis of a sorting block based on the Alliance tools and, finally, a silicon synthesis which was also worked out using Alliance. Two points in the proposed design are particularly noteworthy. First, the sorting architecture is suitable for treating a continuous stream of input data, rather than a block of data as in many other designs. Secondly, the proposed design reaches a reasonable compromise between area and time, as it yields an A T product which compares favourably with the theoretical lower bound.

Accelerating sorting with reconfigurable hardware

Abstract: This paper is dedicated to explore the acceleration of sorting algorithms with reconfigurable hardware. We present the rationale for solving the sorting problem in hardware, and suggest ways to ease the use of sorting hardware in the real world of applications programming. One of the ongoing work main goals is the migration of the quicksort algorithm to hardware. The algorithm and its mapping to hardware are discussed. Keywords: sorting, VHDL, FPGA, digital systems, fast prototyping.

Accelerating Sorting Through the Use of Reconfigurable Hardware

Abstract In this paper we present the first steps of a work dedicated to explore the acceleration of sorting algorithms with reconfigurable hardware. The rationale for solving the sorting problem in hardware is presented. We suggest ways to facilitate the use of sorting hardware in the real world of applications programming. One of the ongoing work main goals is the migration of the well-known quicksort algorithm to hardware. Accordingly, we discuss the algorithm and provide its mapping to hardware.

Design of various operating devices for sorting binary data

Eastern-European Journal of Enterprise Technologies

The object of research is the process of designing hardware devices for sorting arrays of binary data using the methodology of space-time graphs. The main task that is solved in the work is the development and research of multi-cycle operating devices for sorting binary data in order to choose the optimal structure with predetermined technical characteristics for solving the sorting problem. As an example, the development of different types of structures of multi-cycle operating sorting devices by the method of «even-odd» permutation is shown and their system characteristics are determined. New structures of multi-cycle operating devices have been designed for a given sorting algorithm, and analytical expressions for calculating equipment costs and their performance have been given. A comparative analysis of the hardware and time complexity of the developed structures of devices for sorting binary numbers of various types with known implementations of algorithmic and pipeline operat...

Overlapping Computations, Communications and I/O in Parallel Sorting

Journal of Parallel and Distributed Computing, 1995

In this paper we present a new parallel sorting algorithm which maximizes the overlap between the disk, network, and CPU subsystems of a processing node. This algorithm is shown to be of similar complexity to known e cient sorting algorithms. The pipelining e ect exploited by our algorithm should lead to higher levels of performance on distributed memory parallel processors. In order to achieve the best results using this strategy, the CPU, network and disk operations must take comparable time. We suggest acceptable levels of system balance for sorting machines and analyze the performance of the sorting algorithm as system parameters vary.

Bitonic Sort on Ultracomputers

Batcher's bitonic sort (cf. Knuth, v. III, pp. 232 ff) is a sorting network, capable of sorting n inputs in Q((log n) 2 ) stages. When adapted to conventional computers, it gives rise to an algorithm that runs in time Q(n(log n) 2 ). The method can also be adapted to ultracomputers (Schwartz [1979]) to exploit their high degree of parallelism. The resulting algorithm will take time Q((log N) 2 ) for ultracomputers of "size" N. The implicit constant factor is low, so that even for moderate values of N the ultracomputer architecture performs faster than the Q(N log N) time conventional architecture can achieve. The purpose of this note is to describe the adapted algorithm. After some preliminaries a first version of the algorithm is given whose correctness is easily shown. Next, this algorithm is transformed to make it suitable for an ultracomputer. 1. Introduction Batcher's bitonic sort (cf. Knuth, v. III, pp. 232 ff) is a sorting network, capable of sorting n input...

An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system

Computers & Electrical Engineering, 2009

Optical interconnections attract many engineers and scientists' attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole's pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770-85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system-concepts and applications. J Inform Sci 1998;106;237-58]. Although Cole's algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212-22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole's sorting algorithm that are presented as lemmas in this paper.

Case Study: Memory Conscious Parallel Sorting

Lecture Notes in Computer Science, 2003

The efficient parallelization of an algorithm is a hard task that requires a good command of software techniques and a considerable knowledge of the target computer. In such a task, either the programmer or the compiler have to tune different parameters to adapt the algorithm to the computer network topology and the communication libraries, or to expose the data locality of the algorithm on the memory hierarchy at hand.

Sorting in Linear Time?

Journal of Computer and System Sciences, 1998

We show that a unit-cost RAM with a word length of w bits can sort n integers in the range O.. 2W-1 in O (n log log n) time, for arbitrary w z log n, a significant improvement over the bound of O (n-) achieved by the fusion trees of Fredman and Willard. Provided that w 2 (log n)z+', for some fixed e > 0, the sorting can even be accomplished in linear expected time with a randomized algorithm. Both of our algorithms parallelize without loss on a unitcost PRAM with a word length of w bits. The first one yields an algorithm that uses O (log n) time and O (n log log n) operations on a deterministic CRCW PRAM. The second one yields an algorithm that uses O(log n) expected time and O(n) expected operations on a randomized EREW PRAM, provided that w 2 (log n)2+' for some fixed c >0. Our deterministic and randomized sequential and parallel algorithms generalize to the lexicographic sorting problem of sorting multiple-precision integers represented in several words.

High-performance sorting algorithms for the CRAY T3D parallel computer

The Journal of Supercomputing, 1997

In this paper we study the sorting performance of a 128-processor CRAY T3D and discuss the efficient use of the toroidal network connecting the processors. The problems we consider range from that of sorting one word per processor to sorting the entire memory of the machine, and we give efficient algorithms for each case. In addition, we give both algorithms that make assumptions about the distribution of the data and those that make no assumptions. The clear winner, if data can be assumed to be uniformly distributed, is a method that we call a hash-and-chain sort. The time for this algorithm to sort one million words per processor over 64 processors is less than two seconds, which compares favorably to about four seconds using a 4-processor CRAY C90 and about 17 seconds using a 64-processor Thinking Machines CM-5.