High performance parallel linear sorter core design (original) (raw)

On the design of a high-performance, expandable, sorting engine

Integration, the VLSI Journal, 1994

This paper presents the design and implementation of a modular, expandable and high-performance sorter based on the rebound sorting algorithm of Chen et al. (1978). This single chip rebound sorter can sort 24, 32-bit or 64-bit, records of 2's complement or unsigned data in either ascending or descending order. The modular design of the sorter allows direct cascading of chips for sorting more than 24 records. The monolithic sorter is implemented in 2.0/zm CMOS technology, in a frame of 7.9mm x 9.2mm, which supports its 84 I/O. A pipelining scheme was used to achieve a sustained throughput (of cascaded sorting chips) of 10 MHz, while a scan-path was used to allow external control of memory elements for testing purposes. The emphasis of this paper is on the architecture and circuit design of the sorter which results in a significant improvement in terms of functionality, versatility and performance, over previously reported monolithic sorter circuits. A comparative study of other hardware sorter implementations, and sorting with a general purpose processor, illustrates the performance advantages and functional versatility of the sorter chip reported in this paper.

Accelerating Sorting Through the Use of Reconfigurable Hardware

Abstract In this paper we present the first steps of a work dedicated to explore the acceleration of sorting algorithms with reconfigurable hardware. The rationale for solving the sorting problem in hardware is presented. We suggest ways to facilitate the use of sorting hardware in the real world of applications programming. One of the ongoing work main goals is the migration of the well-known quicksort algorithm to hardware. Accordingly, we discuss the algorithm and provide its mapping to hardware.

BulkSort: System Design and Parallel Hardware Implementation Considerations

IJCSIS Vol 17 No 12 December Issue, 2019

Algorithms are commonly perceived as difficult subjects. Many applications today require more complex algorithms than offered by a traditional manner. However, the researchers look for ways to make them as simple as possible. In high time demanding fields, sorting is one of the foremost issues in data structure for searching and optimization algorithms. In parallel processing, program instructions are divided among multiple processors by breaking problems into modules that can be executed in parallel, with the goal of reducing the execution time. In this paper, we proposed a novel Parallel, Re-configurable and adaptive sorting network of the BulkSort algorithm. Our architecture is based on a simple and elementary operations such as comparison and binary shifting. The main strength of the proposed solution is the ability of sorting in parallel without memory usage. Experimental results show that our proposed model is promising in view of required resources and its ability to perform a high speed sorting process. In this study, we take into account the analysis result of the Simulink design to establish the required hardware resources of the proposed system.

A Versatile Linear Insertion Sorter Based on a FIFO Scheme

2008

A linear sorter based on a First-In First-Out (FIFO) scheme is presented. It is capable of discarding the oldest stored datum and inserting the incoming datum while keeping the rest of the stored data sorted in a single clock cycle. This type of sorter can be used as a coprocessor or as a module in specialized architectures that continuously require to process data for non-linear filters based on order statistics. This FIFO sorting process is described by four different parallel functions that exploit the natural hardware parallelism. The architecture is composed of identical processing elements, thus it can be easily adapted to any data lengths, according to the specific application needs. The use of compact identical processing elements results in a high performance yet small architecture. Some examples are presented in order to understand the functionality and initialization of the proposed sorter. Results of synthesizing the proposed architecture targeting a Field Programmable Gate Array (FPGA) are presented and compared against other reported hardware based sorters. Scalability results for several sorted elements with different bits widths are also presented.

A versatile linear insertion sorter based on an FIFO scheme

Microelectronics Journal, 2009

A linear sorter based on a first-in first-out (FIFO) scheme is presented. It is capable of discarding the oldest stored datum and inserting the incoming datum while keeping the rest of the stored data sorted in a single clock cycle. This type of sorter can be used as a co-processor or as a module in specialized architectures that continuously require to process data for non-linear filters based on order statistics. This FIFO sorting process is described by four different parallel functions that exploit the natural hardware parallelism. The architecture is composed of identical processing elements; thus it can be easily adapted to any data lengths, according to the specific application needs. The use of compact identical processing elements results in a high performance yet small architecture. Some examples are presented in order to understand the functionality and initialization of the proposed sorter. The results of synthesizing the proposed architecture targeting a field programmable gate array (FPGA) are presented and compared against other reported hardware-based sorters. The scalability results for several sorted elements with different bits widths are also presented.

An Efficient Implementation of Batcher's Odd-Even Merge Algorithm and Its Application in Parallel Sorting Schemes

IEEE Transactions on Computers, 1983

An algorithm is presented to merge two subfiles of size n/2 each, stored in the left and the right halves of a linearly connected processor array, in 3n /2 route steps and log n compare-exchange steps. This algorithm is extended to merge two horizontally adjacent subfiles of size m X n/2 each, stored in an m X n mesh-connected processor array in row-major order, in m + 2n route steps and log mn compare-exchange steps. These algorithms are faster than their counterparts proposed so far.

A Comparative Study of Sorting Algorithms with FPGA Acceleration by High Level Synthesis

Nowadays, sorting is an important operation for several real-time embedded applications. It is one of the most commonly studied problems in computer science. It can be considered as an advantage for some applications such as avionic systems and decision support systems because these applications need a sorting algorithm for their implementation. However, sorting a big number of elements and/or real-time decision making need high processing speed. Therefore, accelerating sorting algorithms using FPGA can be an attractive solution. In this paper, we propose an efficient hardware implementation for different sorting algorithms (BubbleSort, InsertionSort, SelectionSort, QuickSort, HeapSort, ShellSort, MergeSort and TimSort) from high-level descriptions in the zynq-7000 platform. In addition, we compare the performance of different algorithms in terms of execution time, standard deviation and resource utilization. From the experimental results, we show that the SelectionSort is 1.01-1.23 times faster than other algorithms when N < 64; Otherwise, TimSort is the best algorithm.

An optimal and processor efficient parallel sorting algorithm on a linear array with a reconfigurable pipelined bus system

Computers & Electrical Engineering, 2009

Optical interconnections attract many engineers and scientists' attention due to their potential for gigahertz transfer rates and concurrent access to the bus in a pipelined fashion. These unique characteristics of optical interconnections give us the opportunity to reconsider traditional algorithms designed for ideal parallel computing models, such as PRAMs. Since the PRAM model is far from practice, not all algorithms designed on this model can be implemented on a realistic parallel computing system. From this point of view, we study Cole's pipelined merge sort [Cole R. Parallel merge sort. SIAM J Comput 1988;14:770-85] on the CREW PRAM and extend it in an innovative way to an optical interconnection model, the LARPBS (Linear Array with Reconfigurable Pipelined Bus System) model [Pan Y, Li K. Linear array with a reconfigurable pipelined bus system-concepts and applications. J Inform Sci 1998;106;237-58]. Although Cole's algorithm is optimal, communication details have not been provided due to the fact that it is designed for a PRAM. We close this gap in our sorting algorithm on the LARPBS model and obtain an O(log N)-time optimal sorting algorithm using O(N) processors. This is a substantial improvement over the previous best sorting algorithm on the LARPBS model that runs in O(log N log log N) worst-case time using N processors [Datta A, Soundaralakshmi S, Owens R. Fast sorting algorithms on a linear array with a reconfigurable pipelined bus system. IEEE Trans Parallel Distribut Syst 2002;13(3):212-22]. Our solution allows efficiently assign and reuse processors. We also discover two new properties of Cole's sorting algorithm that are presented as lemmas in this paper.

Zero-delay FPGA-based odd-even sorting network

2013 IEEE Symposium on Computers & Informatics (ISCI), 2013

Sorting is one of the most well-known problems in computer science and is frequently used for benchmarking computer systems. It can contribute significantly to the overall execution time of a process in a computer system. Dedicated sorting architectures can be used to accelerate applications and/or to reduce energy consumption. In this paper, we propose an efficient sorting network aiming at accelerating the sorting operation in FPGA-based embedded systems. The proposed sorting network is implemented based on an Optimized Oddeven sorting method ( ) using fully pipelined combinational logic architecture and ring shape processing. Consequently, generates the sorted array of numbers in parallel when the input array of numbers is given, without any delay or lag. Unlike conventional sorting networks, sorting network does not need memory to hold data and information about sorting, and neither need input clock to perform the sorting operations sequentially. We conclude that by using in FPGA-based image processing, we can optimize the performance of filters such as median filter which demands high performance sorting operations for realtime applications.

Implementation of a General Purpose Sorter on FPGA

The objective of the paper is to implement a general purpose sorting algorithm. The paper should offer a sorting network that can be deployed in various applications in impulsive noise reduction filters for image processing and other signal processing applications. The algorithm and sorting network should offer less hardware complexity and better memory usage options. It involves design, simulation and FPGA implementation of a general purpose sorter processor. The paper describes a detailed survey of sorting algorithms that are acquiescent to FPGA implementation, homing in on the most suitable one that may be deployed in digital signal and image processing applications. The work extends by demonstrating the potential of the implemented sorter in noise reduction filters. Keywords-Digital Algorithm Model & FPGA Implementation, using Xilinx, Finite State Machine for Controller and Design of the Data Path Structure of the Filter