Implementations of a model of physical sorting (original) (raw)

Modular Design of High-Throughput, Low-Latency Sorting Units

High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing high-throughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.

On the design of a high-performance, expandable, sorting engine

Integration, the VLSI Journal, 1994

This paper presents the design and implementation of a modular, expandable and high-performance sorter based on the rebound sorting algorithm of Chen et al. (1978). This single chip rebound sorter can sort 24, 32-bit or 64-bit, records of 2's complement or unsigned data in either ascending or descending order. The modular design of the sorter allows direct cascading of chips for sorting more than 24 records. The monolithic sorter is implemented in 2.0/zm CMOS technology, in a frame of 7.9mm x 9.2mm, which supports its 84 I/O. A pipelining scheme was used to achieve a sustained throughput (of cascaded sorting chips) of 10 MHz, while a scan-path was used to allow external control of memory elements for testing purposes. The emphasis of this paper is on the architecture and circuit design of the sorter which results in a significant improvement in terms of functionality, versatility and performance, over previously reported monolithic sorter circuits. A comparative study of other hardware sorter implementations, and sorting with a general purpose processor, illustrates the performance advantages and functional versatility of the sorter chip reported in this paper.

Simulating the Bitonic Sort Using P Systems

Proceedings of the 8th …, 2007

This paper gives a version of the parallel bitonic sorting algorithm of Batcher, which can sort N elements in time O(log 2 N). When applying it to the 2D mesh architecture, two indexing functions are considered, row-major and shuffled rowmajor. Some properties are proved for the later, together with a correctness proof of the proposed algorithm. Two simulations with P systems are proposed and discussed. The first one uses dynamic communication graphs and follows the guidelines of the mesh version of the algorithm. The second simulation requires only symbol rewriting rules in one membrane.

A COMPARISON-FREE SORTING ALGORITHM ON CPUs

2016

The paper presents a new sorting algorithm that takes input data integer elements and sorts them without any comparison operations between the data—a comparison-free sorting. The algorithm uses a one-hot representation for each input element that is stored in a two-dimensional matrix called a one-hot matrix. Concurrently, each input element is also stored in a one-dimensional matrix in the input element’s integer representation. Subsequently, the transposed one-hot matrix is mapped to a binary matrix producing a sorted matrix with all elements in their sorted order. The algorithm exploits parallelism that is suitable for single instruction multiple thread (SIMT) computing that can harness the resources of these computing machines, such as CPUs with multiple cores and GPUs with large thread blocks. We analyze our algorithm’s sorting time on varying CPU architectures, including singleand multi-threaded implementations on a single CPU. Our results show a fast sorting time for the singl...

SORTCHIP: A VLSI implementation of a hardware algorithm for continuous data sorting

IEEE Journal of Solid-State Circuits, 2003

We present a VLSI implementation of a hardware sorting algorithm for continuous data sorting. The device is able to continuously process an input data stream while producing a sorted output data stream. At each clock cycle, the device reads and processes a 48-bit word, 24 bits for the datum and 24 bits for the associated tag. The data stream is sorted according to the tags preserving the order of words with identical tags. Sequences up to 256 words are completely sorted and longer sequences are partially sorted. The maximum operation frequency is 50 Mwords/s. The architecture is based on a chain of identical elementary sorting units. A full custom design exploits the highly regular architecture to achieve high area and time performance. We describe the algorithm and give architectural details.

The technique of in-place associative sorting

In the first place, a novel, yet straightforward in-place integer value-sorting algorithm is presented. It sorts in linear time using constant amount of additional memory for storing counters and indices beside the input array. The technique is inspired from the principal idea behind one of the ordinal theories of "serial order in behavior" and explained by the analogy with the three main stages in the formation and retrieval of memory in cognitive neuroscience: (i) practicing, (ii) storage and (iii) retrieval. It is further improved in terms of time complexity as well as specialized for distinct integers, though still improper for rank-sorting. Afterwards, another novel, yet straightforward technique is introduced which makes this efficient value-sorting technique proper for rank-sorting. Hence, given an array of n elements each have an integer key, the technique sorts the elements according to their integer keys in linear time using only constant amount of additional mem...

VHDL Design of a Scalable VLSI Sorting Device Based on Pipelined Computation

Journal of Computing and Information Technology, 2004

This paper describes the VHDL design of a sorting algorithm, aiming at defining an elementary sorting unit as a building block of VLSI devices which require a huge number of sorting units. As such, an attempt was made to reach a reasonable low value of the area-time parameter. A sorting VLSI device, in fact, can be built as a cascade of elementary sorting units which process the input stream in a pipeline fashion: as the processing goes on, a wave of sorted numbers propagates towards the output ports. In the description of the design, the paper discusses the initial theoretical analysis of the algorithm's complexity VHDL behavioural analysis of the proposed architecture, a structural synthesis of a sorting block based on the Alliance tools and, finally, a silicon synthesis which was also worked out using Alliance. Two points in the proposed design are particularly noteworthy. First, the sorting architecture is suitable for treating a continuous stream of input data, rather than a block of data as in many other designs. Secondly, the proposed design reaches a reasonable compromise between area and time, as it yields an A T product which compares favourably with the theoretical lower bound.

A new deterministic parallel sorting algorithm with an experimental evaluation

Journal of Experimental Algorithmics, 1998

We introduce a new deterministic parallel sorting algorithm for distributed memory machines based on the regular sampling approach. The algorithm uses only two rounds of regular all-to-all personalized communication in a scheme that yields very good load balancing with virtually no overhead. Moreover, unlike previous variations, our algorithm efficiently handles the presence of duplicate values without the overhead of tagging each element with a unique identifier. This algorithm was implemented in SPLIT-C and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2-WN, and the Cray Research T3D. We ran our code using widely different benchmarks to examine the dependence of our algorithm on the input distribution. Our experimental results illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly invariant over the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be deterministically guaranteed. We present a novel variation on the approach of sorting by regular sampling which leads to a new deterministic sorting algorithm that achieves optimal computational speedup with very little communication. Our algorithm exchanges the single step of irregular communication used by previous implementations for two steps of regular communication. In return, our algorithm mitigates the problem of poor load balancing because it is able to sustain a high sampling rate at substantially less cost. In addition, our algorithm efficiently accommodates the presence of duplicates without the overhead of tagging each element. And our algorithm achieves predictable, regular communication requirements which are essentially invariant with respect to the input distribution. Utilizing regular communication has become more important with the advent of message passing standards, such as MPI [16], which seek to guarantee the availability of very efficient (often machine specific) implementations of certain basic collective communication routines. Our algorithm was implemented in a high-level language and run on a variety of platforms, including the Thinking Machines CM-5, the IBM SP-2, and the Cray Research T3D. We ran our code using a variety of benchmarks that we identified to examine the dependence of our algorithm on the input distribution. Our experimental results are consistent with the theoretical analysis and illustrate the efficiency and scalability of our algorithm across different platforms. In fact, the performance compares closely to that of our random sample sort algorithm, which seems to outperform all similar algorithms known to the authors on these platforms. Together, their performance is nearly indifferent to the set of input distributions, unlike previous efficient algorithms. However, unlike our randomized sorting algorithm, the performance and memory requirements of our regular sorting algorithm can be guaranteed with deterministically. The high-level language used in our studies is SPLIT-C [10], an extension of C for distributed memory machines. The algorithm makes use of MPI-like communication primitives but does not make any assumptions as to how these primitives are actually implemented. The basic data transport is a read or write operation. The remote read and write typically have both blocking and non-blocking versions. Also, when reading or writing more than a single element, bulk data transports are provided with corresponding bulk read and bulk write primitives. Our collective communication primitives, described in detail in [4], are similar to those of the MPI [16], the IBM POWERparallel [6], and the Cray MPP systems [9] and, for example, include the following: transpose, bcast, gather, and scatter. Brief descriptions of these are as follows. The transpose primitive is an all-to-all personalized communication in which each processor has to send a unique block of data to every processor, and all the blocks are of the same size. The bcast primitive is used to copy a block of data from a single source to all the other processors. The primitives gather and scatter are companion primitives. Scatter divides a single array residing on a processor into equal-sized blocks, each of which is distributed to a unique processor, and gather coalesces these blocks back into a single array at a particular processor. See [3, 4, 5] for algorithmic details, performance analyses, and empirical results for these communication primitives. The organization of this paper is as follows. Section 2 presents our computation model for analyzing parallel algorithms. Section 3 describes in detail our improved sample sort algorithm. Finally, Section 4 describes our data sets and the experimental performance of our sorting algorithm.

Accelerating sorting with reconfigurable hardware

Abstract: This paper is dedicated to explore the acceleration of sorting algorithms with reconfigurable hardware. We present the rationale for solving the sorting problem in hardware, and suggest ways to ease the use of sorting hardware in the real world of applications programming. One of the ongoing work main goals is the migration of the quicksort algorithm to hardware. The algorithm and its mapping to hardware are discussed. Keywords: sorting, VHDL, FPGA, digital systems, fast prototyping.