On the design of a high-performance, expandable, sorting engine (original) (raw)

SORTCHIP: A VLSI implementation of a hardware algorithm for continuous data sorting

IEEE Journal of Solid-State Circuits, 2003

We present a VLSI implementation of a hardware sorting algorithm for continuous data sorting. The device is able to continuously process an input data stream while producing a sorted output data stream. At each clock cycle, the device reads and processes a 48-bit word, 24 bits for the datum and 24 bits for the associated tag. The data stream is sorted according to the tags preserving the order of words with identical tags. Sequences up to 256 words are completely sorted and longer sequences are partially sorted. The maximum operation frequency is 50 Mwords/s. The architecture is based on a chain of identical elementary sorting units. A full custom design exploits the highly regular architecture to achieve high area and time performance. We describe the algorithm and give architectural details.

VLSI circuit for programmable sorting

1997

Abstract A circuit for sorting analog quantities is described, which yields analog representations of sorted values and digital codings of the related ranks. The length of the sorted list can be digitally programmed at run time to support partial sorting. The modular structure facilitates layout design. Suitable coupling current-mode and voltage-mode signals minimizes the number of transistors.

VSR sort: A novel vectorised sorting algorithm & architecture extensions for future microprocessors

2015

Sorting is a widely studied problem in computer science and an elementary building block in many of its subfields. There are several known techniques to vectorise and accelerate a handful of sorting algorithms by using single instruction-multiple data (SIMD) instructions. It is expected that the widths and capabilities of SIMD support will improve dramatically in future microprocessor generations and it is not yet clear whether or not these sorting algorithms will be suitable or optimal when executed on them. This work extrapolates the level of SIMD support in future microprocessors and evaluates these algorithms using a simulation framework. The scalability, strengths and weaknesses of each algorithm are experimentally derived. We then propose VSR sort, our own novel vectorised non-comparative sorting algorithm based on radix sort. To facilitate the execution of this algorithm we define two new SIMD instructions and propose a complementary hardware structure for their execution. Our results show that VSR sort has maximum speedups between 14.9x and 20.6x over a scalar baseline and an average speedup of 3.4x over the next-best vectorised sorting algorithm.

Accelerating sorting with reconfigurable hardware

Abstract: This paper is dedicated to explore the acceleration of sorting algorithms with reconfigurable hardware. We present the rationale for solving the sorting problem in hardware, and suggest ways to ease the use of sorting hardware in the real world of applications programming. One of the ongoing work main goals is the migration of the quicksort algorithm to hardware. The algorithm and its mapping to hardware are discussed. Keywords: sorting, VHDL, FPGA, digital systems, fast prototyping.

Accelerating Sorting Through the Use of Reconfigurable Hardware

Abstract In this paper we present the first steps of a work dedicated to explore the acceleration of sorting algorithms with reconfigurable hardware. The rationale for solving the sorting problem in hardware is presented. We suggest ways to facilitate the use of sorting hardware in the real world of applications programming. One of the ongoing work main goals is the migration of the well-known quicksort algorithm to hardware. Accordingly, we discuss the algorithm and provide its mapping to hardware.

Tradeoff analysis and architecture design of a hybrid hardware/software sorter

Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors

Sorting long sequences of keys is a problem that occurs in many different applications. For embedded systems, a uniprocessor software solution is often not applicable due to the low performance, while realizing multiprocessor sorting methods on parallel computers is much too expensive with respect to power consumption, physical weight, and cost. We investigate cost/performance tradeoffs for hybrid sorting algorithms that use a mixture of sequential merge sort and systolic insertion sort techniques. We propose a scalable architecture for integer sorting that consists of a uniprocessor and an FPGA-based parallel systolic co-processor. Speedups obtained analytically and experimentally and depending on hardware (cost) constraints are determined as a function of time constants of the uniprocessor and the co-processor.

Panning sorter: an approach to the design of minimal-hardware parallel-input data sorters

Electronics Letters, 2010

The panning sorter is introduced, offering a new approach to the design of highly compact digital parallel-input sorters for low power 2D applications, such as image processing and data switching, among others. The result is believed to be the smallest sorter circuit for this type of implementation, for the given time complexity.

Modular Design of High-Throughput, Low-Latency Sorting Units

High-throughput and low-latency sorting is a key requirement in many applications that deal with large amounts of data. This paper presents efficient techniques for designing high-throughput, low-latency sorting units. Our sorting architectures utilize modular design techniques that hierarchically construct large sorting units from smaller building blocks. The sorting units are optimized for situations in which only the M largest numbers from N inputs are needed, because this situation commonly occurs in many applications for scientific computing, data mining, network processing, digital signal processing, and high-energy physics. We utilize our proposed techniques to design parameterized, pipelined, and modular sorting units. A detailed analysis of these sorting units indicates that as the number of inputs increases their resource requirements scale linearly, their latencies scale logarithmically, and their frequencies remain almost constant. When synthesized to a 65-nm TSMC technology, a pipelined 256-to-4 sorting unit with 19 stages can perform more than 2.7 billion sorts per second with a latency of about 7 ns per sort. We also propose iterative sorting techniques, in which a small sorting unit is used several times to find the largest values.

VHDL Design of a Scalable VLSI Sorting Device Based on Pipelined Computation

Journal of Computing and Information Technology, 2004

This paper describes the VHDL design of a sorting algorithm, aiming at defining an elementary sorting unit as a building block of VLSI devices which require a huge number of sorting units. As such, an attempt was made to reach a reasonable low value of the area-time parameter. A sorting VLSI device, in fact, can be built as a cascade of elementary sorting units which process the input stream in a pipeline fashion: as the processing goes on, a wave of sorted numbers propagates towards the output ports. In the description of the design, the paper discusses the initial theoretical analysis of the algorithm's complexity VHDL behavioural analysis of the proposed architecture, a structural synthesis of a sorting block based on the Alliance tools and, finally, a silicon synthesis which was also worked out using Alliance. Two points in the proposed design are particularly noteworthy. First, the sorting architecture is suitable for treating a continuous stream of input data, rather than a block of data as in many other designs. Secondly, the proposed design reaches a reasonable compromise between area and time, as it yields an A T product which compares favourably with the theoretical lower bound.