Mohammad M. Mansour | American University of Beirut (original) (raw)

Papers by Mohammad M. Mansour

Research paper thumbnail of Low-power VLSI decoder architectures for LDPC codes

Proceedings of the 2002 international symposium on Low power electronics and design - ISLPED '02, 2002

Iterative decoding of low-density parity check codes (LDPC) using the message-passing algorithm h... more Iterative decoding of low-density parity check codes (LDPC) using the message-passing algorithm have proved to be extraordinarily effective compared to conventional maximumlikelihood decoding. However, the lack of any structural regularity in these essentially random codes is a major challenge for building a practical low-power LDPC decoder. In this paper, we jointly design the code and the decoder to induce the structural regularity needed for a reduced-complexity parallel decoder architecture. This interconnect-driven code design approach eliminates the need for a complex interconnection network while still retaining the algorithmic performance promised by random codes. Moreover, we propose a new approach for computing reliability metrics based on the BCJR algorithm that reduces the message switching activity in the decoder compared to existing approaches. Simulations show that the proposed approach results in power savings of up to 85.64% over conventional implementations. Categories and Subject Descriptors B.7.1 [Types and Design Styles]: VLSI; E.4 [Coding and Information Theory]: Error control codes However, in order to achieve desired power and throughputs for current applications (e.g., > lMbps in 3G wireless systems, > lGbps in magnetic recording systems), fully parallel and pipelined iterative decoder architectures are needed. Compared to turbo codes, LDPC codes enjoy a significant advantage in terms of computational complexity and are known to have a large amount of inherent parallelism [3]. However, the randomness of LDPC codes results in stringent memory requirements that amount to an order of magnitude increase in complexity compared to those for turbo codes. A direct approach to implementing a parallel decoder architecture would be to allocate, for each node or cluster of nodes in the graph defining the LDPC code, a function unit for computing the reliability messages, and employ an interconnection network to route messages between function nodes (see Fig.1). A major problem with this approach is that the interconnection networks require complex wiring to perform global routing of messages and hence must be deeply pipelined (e.g., bidirectional multilayered networks in [4] and 4096-input multiplexers per function unit in [5]). Moreover, the randomness in the pattern of communicating messages leads to routing and congestion problems on the networks which require extensive buffering to resolve.

Research paper thumbnail of Non-Binary Low-Density Parity-Check coded Cyclic Code-Shift Keying

2013 IEEE Wireless Communications and Networking Conference (WCNC), 2013

Classically, the association of high-order modulation techniques to binary channel coding suffers... more Classically, the association of high-order modulation techniques to binary channel coding suffers from significant information loss due to the bit level channel probabilities computation. In this paper, we investigate the association of Non-Binary Low-Density Parity-Check codes (NB-LDPC) and Cyclic Code-Shift Keying (CCSK) which aims at preventing the information loss by computing the probabilities at the symbol level. Simulation results over Gaussian and Rayleigh channels demonstrate that this association leads to significant performance gains (≈ 2.6dB over the Gaussian channel and ≈ 3.5dB over the Rayleigh channel).

Research paper thumbnail of Non-binary coded CCSK and Frequency-Domain Equalization with simplified LLR generation

2013 IEEE 24th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), 2013

In this paper, we investigate the performance of Single-Carrier (SC) transmission with Non-Binary... more In this paper, we investigate the performance of Single-Carrier (SC) transmission with Non-Binary Low-Density Parity-Check (NB-LDPC) coded Cyclic Code-Shift Keying (CCSK) signaling in a multipath environment and we show that the combination of CCSK signaling and non-binary codes results in two key advantages, namely, improved Log-Likelihood Ratio (LLR) generation via correlations and reduced implementation complexity. We demonstrate that Maximum Likelihood (ML) demodulation can be expressed by two circular convolution operations and thus it can be processed in the frequency domain. Then, we propose a joint Frequency-Domain Equalization (FDE) and LLR generation scheme that aims at reducing the complexity of the receiver. Finally, we demonstrate through Monte-Carlo simulations and histogram analysis that this proposed CCSK signaling scheme gives more robustness to SC-FDE systems than commonly employed Hadamard signaling schemes (a gap of ≈ 1.5dB in favor of CCSK signaling is observed at BER = 10 −5 , assuming perfect Channel State Information).

Research paper thumbnail of A Novel Design Methodology for High-Performance Programmable Decoder Cores for AA-LDPC Codes

The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, 2005

A new parameterized-core-based design methodology targeted forprograinniable decoders for low-den... more A new parameterized-core-based design methodology targeted forprograinniable decoders for low-density parity-check (LDPC) codes is proposed. The inethodology solves the two major drawbacks of excessive memory overhead and complex on-chip interconnect typical of existing decoder implementations which limit the scalability, degrade the error-correction capability, and restrict the domain of application of LDPC codes. Diverse memory and interconnect optimizations are pcrfotined at the code-design, decoding algorithm, decoder architecture, and physical layout levels, with the following features: 1) Architecture-aware (AA)-LDPC code design with embedded structural features that significantly reduce interconnect complexity, 2) faster and memory-etficient turbo-decoding algorithm for LDPC codes, 3) programmable architecture having distributed memory, parallel message processing units, and dynamiclscalable transport networks for routing messages, and 4) a parameterized macro-cell layout library implernenting the main components of the architecture with scaling parameters that enable low-level transistor sizing and power-rail scaling forpowerdelay-area optimization. A 14mm2 programmable decoder core for a rate-f, Icngtti 2048 AA-LDPC code generated using the proposed methodology is presented, which delivers B throuphwt of I. 6 G b~s at 125MHz and consumes 760mW of power.

Research paper thumbnail of VLSI architectures for SISO-APP decoders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2003

Very large scale integration (VLSI) design methodology and implementation complexities of high-sp... more Very large scale integration (VLSI) design methodology and implementation complexities of high-speed, low-power soft-input soft-output (SISO) a posteriori probability (APP) decoders are considered. These decoders are used in iterative algorithms based on turbo codes and related concatenated codes and have shown significant advantage in error correction capability compared to conventional maximum likelihood decoders. This advantage, however, comes at the expense of increased computational complexity, decoding delay, and substantial memory overhead, all of which hinge primarily on the well-known recursion bottleneck of the SISO-APP algorithm. This paper provides a rigorous analysis of the requirements for computational hardware and memory at the architectural level based on a tile-graph approach that models the resource-time scheduling of the recursions of the algorithm. The problem of constructing the decoder architecture and optimizing it for high speed and low power is formulated in terms of the individual recursion patterns which together form a tile graph according to a tiling scheme. Using the tile-graph approach, optimized architectures are derived for the various forms of the sliding-window and parallel-window algorithms known in the literature. A proposed tiling scheme of the recursion patterns, called hybrid tiling, is shown to be particularly effective in reducing memory overhead of high-speed SISO-APP architectures. Simulations demonstrate that the proposed approach achieves savings in area and power in the range of 4.2%-53.1% over state of the art.

Research paper thumbnail of High-throughput LDPC decoders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2003

A high-throughput memory-efficient decoder architecture for low-density parity-check (LDPC) codes... more A high-throughput memory-efficient decoder architecture for low-density parity-check (LDPC) codes is proposed based on a novel turbo decoding algorithm. The architecture benefits from various optimizations performed at three levels of abstraction in system design-namely LDPC code design, decoding algorithm, and decoder architecture. First, the interconnect complexity problem of current decoder implementations is mitigated by designing architecture-aware LDPC codes having embedded structural regularity features that result in a regular and scalable message-transport network with reduced control overhead. Second, the memory overhead problem in current day decoders is reduced by more than 75% by employing a new turbo decoding algorithm for LDPC codes that removes the multiple checkto-bit message update bottleneck of the current algorithm. A new merged-schedule merge-passing algorithm is also proposed that reduces the memory overhead of the current algorithm for low to moderate-throughput decoders. Moreover, a parallel soft-input-soft-output (SISO) message update mechanism is proposed that implements the recursions of the Balh-Cocke-Jelinek-Raviv (BCJR) algorithm in terms of simple "max-quartet" operations that do not require lookup-tables and incur negligible loss in performance compared to the ideal case. Finally, an efficient programmable architecture coupled with a scalable and dynamic transport network for storing and routing messages is proposed, and a full-decoder architecture is presented. Simulations demonstrate that the proposed architecture attains a throughput of 1.92 Gb/s for a frame length of 2304 bits, and achieves savings of 89.13% and 69.83% in power consumption and silicon area over state-of-the-art, with a reduction of 60.5% in interconnect length. Index Terms-Low-density parity-check (LDPC) codes, Ramanujan graphs, soft-input soft-output (SISO) decoder, turbo decoding algorithm, VLSI decoder architectures. I. INTRODUCTION T HE PHENOMENAL success of turbo codes [1] powered by the concept of iterative decoding via message-passing has rekindled the interest in low-density parity-check (LDPC) codes which were first discovered by Gallager in 1961 [2]. Recent breakthroughs to within 0.0045 dB of AWGN-channel capacity were achieved with the introduction of irregular LDPC codes in [3], [4] putting LDPC codes on par with turbo codes. However, efficient hardware implementation techniques of turbo decoders have given turbo codes a clear advantage Manuscript

Research paper thumbnail of A 640-Mb/s 2048-Bit Programmable LDPC Decoder Chip

IEEE Journal of Solid-State Circuits, 2006

A 14.3-mm 2 code-programmable and code-rate tunable decoder chip for 2048-bit low-density parity-... more A 14.3-mm 2 code-programmable and code-rate tunable decoder chip for 2048-bit low-density parity-check (LDPC) codes is presented. The chip implements the turbo-decoding message-passing (TDMP) algorithm for architecture-aware (AA-)LDPC codes which has a faster convergence rate and hence a throughput advantage over the standard decoding algorithm. It employs a reduced complexity message computation mechanism free of lookup tables, and features a programmable network for message interleaving based on the code structure. The chip decodes any mix of 2048-bit rate-1/2 (3,6)-regular AA-LDPC codes in standard mode by programming the network, and attains a throughput of 640 Mb/s at 125 MHz for 10 TDMP-decoding iterations. In augmented mode, the code rate can be tuned up to 14/16 in steps of 1/16 by augmenting the code. The chip is fabricated in 0.18-m six-metal-layer CMOS technology, operates at a peak clock frequency of 125 MHz at 1.8 V (nominal), and dissipates an average power of 787 mW. Index Terms-Architecture-aware low-density parity-check (AA-LDPC) codes, iterative decoders, LDPC codes, turbodecoding message-passing (TDMP) algorithm, VLSI decoder architectures.

Research paper thumbnail of A low-complexity MIMO subspace detection algorithm

EURASIP Journal on Wireless Communications and Networking, 2015

A low-complexity multiple-input multiple-output (MIMO) subspace detection algorithm is proposed. ... more A low-complexity multiple-input multiple-output (MIMO) subspace detection algorithm is proposed. It is based on decomposing a MIMO channel into multiple subsets of decoupled streams that can be detected separately. The new scheme employs triangular decomposition followed by elementary matrix operations to transform the channel into a generalized elementary matrix whose structure matches the subsets of streams to be detected. The proposed approach avoids matrix inversion and allows subsets to overlap, thus achieving better diversity gain. An optimized detector architecture based on a 2-by-2 ML detector core is also presented. Simulations demonstrate that the proposed algorithm performs to within a few tenths of a dB from the optimum detection algorithm.

Research paper thumbnail of Soft-Output MIMO Detectors with Channel Estimation Error

IEEE Signal Processing Letters, 2015

New expressions for the soft decision bit log-likelihood ratio (LLR) of a MIMO system using quadr... more New expressions for the soft decision bit log-likelihood ratio (LLR) of a MIMO system using quadrature amplitude modulation (QAM) taking into account channel estimation error (CEE). The bit LLR for the maximum likelihood (ML) and the linear minimum mean-squared error (MMSE) receivers are derived, showing in both receivers explicit scaling of the LLR that is a function of the QAM symbol and the CEE variance. These new expressions for the LLRs are used to show that only modest improvements in the link performance are achieved relative to the LLRs that do not take into account the CEE. This indicates that separating the detector design from channel estimation does not significantly impact the system performance, which leads to simplifications in the overall receiver implementation.

Research paper thumbnail of Low-power VLSI decoder architectures for LDPC codes

Proceedings of the 2002 international symposium on Low power electronics and design - ISLPED '02, 2002

Iterative decoding of low-density parity check codes (LDPC) using the message-passing algorithm h... more Iterative decoding of low-density parity check codes (LDPC) using the message-passing algorithm have proved to be extraordinarily effective compared to conventional maximumlikelihood decoding. However, the lack of any structural regularity in these essentially random codes is a major challenge for building a practical low-power LDPC decoder. In this paper, we jointly design the code and the decoder to induce the structural regularity needed for a reduced-complexity parallel decoder architecture. This interconnect-driven code design approach eliminates the need for a complex interconnection network while still retaining the algorithmic performance promised by random codes. Moreover, we propose a new approach for computing reliability metrics based on the BCJR algorithm that reduces the message switching activity in the decoder compared to existing approaches. Simulations show that the proposed approach results in power savings of up to 85.64% over conventional implementations. Categories and Subject Descriptors B.7.1 [Types and Design Styles]: VLSI; E.4 [Coding and Information Theory]: Error control codes However, in order to achieve desired power and throughputs for current applications (e.g., > lMbps in 3G wireless systems, > lGbps in magnetic recording systems), fully parallel and pipelined iterative decoder architectures are needed. Compared to turbo codes, LDPC codes enjoy a significant advantage in terms of computational complexity and are known to have a large amount of inherent parallelism [3]. However, the randomness of LDPC codes results in stringent memory requirements that amount to an order of magnitude increase in complexity compared to those for turbo codes. A direct approach to implementing a parallel decoder architecture would be to allocate, for each node or cluster of nodes in the graph defining the LDPC code, a function unit for computing the reliability messages, and employ an interconnection network to route messages between function nodes (see Fig.1). A major problem with this approach is that the interconnection networks require complex wiring to perform global routing of messages and hence must be deeply pipelined (e.g., bidirectional multilayered networks in [4] and 4096-input multiplexers per function unit in [5]). Moreover, the randomness in the pattern of communicating messages leads to routing and congestion problems on the networks which require extensive buffering to resolve.

Research paper thumbnail of Non-Binary Low-Density Parity-Check coded Cyclic Code-Shift Keying

2013 IEEE Wireless Communications and Networking Conference (WCNC), 2013

Classically, the association of high-order modulation techniques to binary channel coding suffers... more Classically, the association of high-order modulation techniques to binary channel coding suffers from significant information loss due to the bit level channel probabilities computation. In this paper, we investigate the association of Non-Binary Low-Density Parity-Check codes (NB-LDPC) and Cyclic Code-Shift Keying (CCSK) which aims at preventing the information loss by computing the probabilities at the symbol level. Simulation results over Gaussian and Rayleigh channels demonstrate that this association leads to significant performance gains (≈ 2.6dB over the Gaussian channel and ≈ 3.5dB over the Rayleigh channel).

Research paper thumbnail of Non-binary coded CCSK and Frequency-Domain Equalization with simplified LLR generation

2013 IEEE 24th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC), 2013

In this paper, we investigate the performance of Single-Carrier (SC) transmission with Non-Binary... more In this paper, we investigate the performance of Single-Carrier (SC) transmission with Non-Binary Low-Density Parity-Check (NB-LDPC) coded Cyclic Code-Shift Keying (CCSK) signaling in a multipath environment and we show that the combination of CCSK signaling and non-binary codes results in two key advantages, namely, improved Log-Likelihood Ratio (LLR) generation via correlations and reduced implementation complexity. We demonstrate that Maximum Likelihood (ML) demodulation can be expressed by two circular convolution operations and thus it can be processed in the frequency domain. Then, we propose a joint Frequency-Domain Equalization (FDE) and LLR generation scheme that aims at reducing the complexity of the receiver. Finally, we demonstrate through Monte-Carlo simulations and histogram analysis that this proposed CCSK signaling scheme gives more robustness to SC-FDE systems than commonly employed Hadamard signaling schemes (a gap of ≈ 1.5dB in favor of CCSK signaling is observed at BER = 10 −5 , assuming perfect Channel State Information).

Research paper thumbnail of A Novel Design Methodology for High-Performance Programmable Decoder Cores for AA-LDPC Codes

The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology, 2005

A new parameterized-core-based design methodology targeted forprograinniable decoders for low-den... more A new parameterized-core-based design methodology targeted forprograinniable decoders for low-density parity-check (LDPC) codes is proposed. The inethodology solves the two major drawbacks of excessive memory overhead and complex on-chip interconnect typical of existing decoder implementations which limit the scalability, degrade the error-correction capability, and restrict the domain of application of LDPC codes. Diverse memory and interconnect optimizations are pcrfotined at the code-design, decoding algorithm, decoder architecture, and physical layout levels, with the following features: 1) Architecture-aware (AA)-LDPC code design with embedded structural features that significantly reduce interconnect complexity, 2) faster and memory-etficient turbo-decoding algorithm for LDPC codes, 3) programmable architecture having distributed memory, parallel message processing units, and dynamiclscalable transport networks for routing messages, and 4) a parameterized macro-cell layout library implernenting the main components of the architecture with scaling parameters that enable low-level transistor sizing and power-rail scaling forpowerdelay-area optimization. A 14mm2 programmable decoder core for a rate-f, Icngtti 2048 AA-LDPC code generated using the proposed methodology is presented, which delivers B throuphwt of I. 6 G b~s at 125MHz and consumes 760mW of power.

Research paper thumbnail of VLSI architectures for SISO-APP decoders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2003

Very large scale integration (VLSI) design methodology and implementation complexities of high-sp... more Very large scale integration (VLSI) design methodology and implementation complexities of high-speed, low-power soft-input soft-output (SISO) a posteriori probability (APP) decoders are considered. These decoders are used in iterative algorithms based on turbo codes and related concatenated codes and have shown significant advantage in error correction capability compared to conventional maximum likelihood decoders. This advantage, however, comes at the expense of increased computational complexity, decoding delay, and substantial memory overhead, all of which hinge primarily on the well-known recursion bottleneck of the SISO-APP algorithm. This paper provides a rigorous analysis of the requirements for computational hardware and memory at the architectural level based on a tile-graph approach that models the resource-time scheduling of the recursions of the algorithm. The problem of constructing the decoder architecture and optimizing it for high speed and low power is formulated in terms of the individual recursion patterns which together form a tile graph according to a tiling scheme. Using the tile-graph approach, optimized architectures are derived for the various forms of the sliding-window and parallel-window algorithms known in the literature. A proposed tiling scheme of the recursion patterns, called hybrid tiling, is shown to be particularly effective in reducing memory overhead of high-speed SISO-APP architectures. Simulations demonstrate that the proposed approach achieves savings in area and power in the range of 4.2%-53.1% over state of the art.

Research paper thumbnail of High-throughput LDPC decoders

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2003

A high-throughput memory-efficient decoder architecture for low-density parity-check (LDPC) codes... more A high-throughput memory-efficient decoder architecture for low-density parity-check (LDPC) codes is proposed based on a novel turbo decoding algorithm. The architecture benefits from various optimizations performed at three levels of abstraction in system design-namely LDPC code design, decoding algorithm, and decoder architecture. First, the interconnect complexity problem of current decoder implementations is mitigated by designing architecture-aware LDPC codes having embedded structural regularity features that result in a regular and scalable message-transport network with reduced control overhead. Second, the memory overhead problem in current day decoders is reduced by more than 75% by employing a new turbo decoding algorithm for LDPC codes that removes the multiple checkto-bit message update bottleneck of the current algorithm. A new merged-schedule merge-passing algorithm is also proposed that reduces the memory overhead of the current algorithm for low to moderate-throughput decoders. Moreover, a parallel soft-input-soft-output (SISO) message update mechanism is proposed that implements the recursions of the Balh-Cocke-Jelinek-Raviv (BCJR) algorithm in terms of simple "max-quartet" operations that do not require lookup-tables and incur negligible loss in performance compared to the ideal case. Finally, an efficient programmable architecture coupled with a scalable and dynamic transport network for storing and routing messages is proposed, and a full-decoder architecture is presented. Simulations demonstrate that the proposed architecture attains a throughput of 1.92 Gb/s for a frame length of 2304 bits, and achieves savings of 89.13% and 69.83% in power consumption and silicon area over state-of-the-art, with a reduction of 60.5% in interconnect length. Index Terms-Low-density parity-check (LDPC) codes, Ramanujan graphs, soft-input soft-output (SISO) decoder, turbo decoding algorithm, VLSI decoder architectures. I. INTRODUCTION T HE PHENOMENAL success of turbo codes [1] powered by the concept of iterative decoding via message-passing has rekindled the interest in low-density parity-check (LDPC) codes which were first discovered by Gallager in 1961 [2]. Recent breakthroughs to within 0.0045 dB of AWGN-channel capacity were achieved with the introduction of irregular LDPC codes in [3], [4] putting LDPC codes on par with turbo codes. However, efficient hardware implementation techniques of turbo decoders have given turbo codes a clear advantage Manuscript

Research paper thumbnail of A 640-Mb/s 2048-Bit Programmable LDPC Decoder Chip

IEEE Journal of Solid-State Circuits, 2006

A 14.3-mm 2 code-programmable and code-rate tunable decoder chip for 2048-bit low-density parity-... more A 14.3-mm 2 code-programmable and code-rate tunable decoder chip for 2048-bit low-density parity-check (LDPC) codes is presented. The chip implements the turbo-decoding message-passing (TDMP) algorithm for architecture-aware (AA-)LDPC codes which has a faster convergence rate and hence a throughput advantage over the standard decoding algorithm. It employs a reduced complexity message computation mechanism free of lookup tables, and features a programmable network for message interleaving based on the code structure. The chip decodes any mix of 2048-bit rate-1/2 (3,6)-regular AA-LDPC codes in standard mode by programming the network, and attains a throughput of 640 Mb/s at 125 MHz for 10 TDMP-decoding iterations. In augmented mode, the code rate can be tuned up to 14/16 in steps of 1/16 by augmenting the code. The chip is fabricated in 0.18-m six-metal-layer CMOS technology, operates at a peak clock frequency of 125 MHz at 1.8 V (nominal), and dissipates an average power of 787 mW. Index Terms-Architecture-aware low-density parity-check (AA-LDPC) codes, iterative decoders, LDPC codes, turbodecoding message-passing (TDMP) algorithm, VLSI decoder architectures.

Research paper thumbnail of A low-complexity MIMO subspace detection algorithm

EURASIP Journal on Wireless Communications and Networking, 2015

A low-complexity multiple-input multiple-output (MIMO) subspace detection algorithm is proposed. ... more A low-complexity multiple-input multiple-output (MIMO) subspace detection algorithm is proposed. It is based on decomposing a MIMO channel into multiple subsets of decoupled streams that can be detected separately. The new scheme employs triangular decomposition followed by elementary matrix operations to transform the channel into a generalized elementary matrix whose structure matches the subsets of streams to be detected. The proposed approach avoids matrix inversion and allows subsets to overlap, thus achieving better diversity gain. An optimized detector architecture based on a 2-by-2 ML detector core is also presented. Simulations demonstrate that the proposed algorithm performs to within a few tenths of a dB from the optimum detection algorithm.

Research paper thumbnail of Soft-Output MIMO Detectors with Channel Estimation Error

IEEE Signal Processing Letters, 2015

New expressions for the soft decision bit log-likelihood ratio (LLR) of a MIMO system using quadr... more New expressions for the soft decision bit log-likelihood ratio (LLR) of a MIMO system using quadrature amplitude modulation (QAM) taking into account channel estimation error (CEE). The bit LLR for the maximum likelihood (ML) and the linear minimum mean-squared error (MMSE) receivers are derived, showing in both receivers explicit scaling of the LLR that is a function of the QAM symbol and the CEE variance. These new expressions for the LLRs are used to show that only modest improvements in the link performance are achieved relative to the LLRs that do not take into account the CEE. This indicates that separating the detector design from channel estimation does not significantly impact the system performance, which leads to simplifications in the overall receiver implementation.