Christophe Monat - Academia.edu (original) (raw)

Uploads

Papers by Christophe Monat

This paper presents a C library for the software support of single precision floating-point (FP) ... more This paper presents a C library for the software support of single precision floating-point (FP) arithmetic on processors without FP hardware units such as VLIW or DSP processor cores for embedded applications. This library provides several levels of compliance to the IEEE 754 FP standard. The complete specifications of the standard can be used or just some relaxed characteristics such as restricted rounding modes or computations without denormal numbers. This library is evaluated on the ST200 VLIW processors from STMicroelectronics.

2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578), 2001

This paper addresses the problem of improving the execution performance of saturated reduction lo... more This paper addresses the problem of improving the execution performance of saturated reduction loops on fixed-point instructionlevel parallel Digital Signal Processors (DSPs). We first introduce "bitexact" transformations, that are suitable for use in the ETSI and the ITU speech coding applications. We then present "approximate" transformations, the relative precision of which we are able to compare. Our main results rely on the properties of the saturated arithmetic.

7th IEEE International Symposium on Industrial Embedded Systems (SIES'12), 2012

This paper presents some work in progress on the design and implementation of efficient floating-... more This paper presents some work in progress on the design and implementation of efficient floating-point software support for embedded integer processors. We provide quantitative evidence of the benefits of supporting various non-generic (that is, specialized, fused, or simultaneous) operations in addition to the five basic arithmetic operations: for individual calls, speedups range from 1.12 to 4.86, while on DSP kernels and benchmarks, our approach allows us to be up to 1.34x faster.

Advanced Signal Processing Algorithms, Architectures, and Implementations XIV, 2004

2011 IEEE 20th Symposium on Computer Arithmetic, 2011

We consider the problem of computing IEEE floatingpoint squares by means of integer arithmetic. W... more We consider the problem of computing IEEE floatingpoint squares by means of integer arithmetic. We show how the specific properties of squaring can be exploited in order to design and implement algorithms that have much lower latency than those for general multiplication, while still guaranteeing correct rounding. Our algorithm descriptions are parameterized by the floating-point format, aim at high instruction-level parallelism (ILP) exposure, and cover all rounding modes. We show further that their C implementation for the binary32 format yields efficient codes for targets like the ST231 VLIW integer processor from STMicroelectronics, with a latency at least 1.75x smaller than that of general multiplication in the same context.

2009 19th IEEE Symposium on Computer Arithmetic, 2009

This paper deals with the design and implementation of low latency software for binary floating-p... more This paper deals with the design and implementation of low latency software for binary floating-point division with correct rounding to nearest. The approach we present here targets a VLIW integer processor of the ST200 family, and is based on fast and accurate programs for evaluating some particular bivariate polynomials. We start by giving approximation and evaluation error conditions that are sufficient to ensure correct rounding. Then we describe the heuristics used to generate such evaluation programs, as well as those used to automatically validate their accuracy. Finally, we propose, for the binary32 format, a complete C implementation of the resulting division algorithm. With the ST200 compiler and compared to previous implementations, the speed-up observed with our approach is by a factor of almost 1.8.

17th IEEE Symposium on Computer Arithmetic (ARITH'05), 2005

... All data registers are 40-bit wide and can be used with signed/unsigned, 16,32 or 40 integers... more ... All data registers are 40-bit wide and can be used with signed/unsigned, 16,32 or 40 integers or fractional values (with several formats). For multiplication (and MAC), as the 2 operands of the multiplier are only 16-bit wide, both the lower or the upper 16-bit half words (of the 32 ...

2007 International Symposium on Industrial Embedded Systems, 2007

This paper presents some work in progress on fast and accurate floating-point arithmetic software... more This paper presents some work in progress on fast and accurate floating-point arithmetic software for ST200-based embedded systems. We show how to use some key architectural features to design codes that achieve correct rounding-to-nearest without sacrificing for efficiency. This is illustrated with the square root function, whose implementation given here is faster by over 35% than the previously best one for such systems.

Microelectronic Engineering, 2000

IEEE Transactions on Computers, 2000

In this paper we show how to reduce the computation of correctly-rounded square roots of binary f... more In this paper we show how to reduce the computation of correctly-rounded square roots of binary floating-point data to the fixed-point evaluation of some particular integer polynomials in two variables. By designing parallel and accurate evaluation schemes for such bivariate polynomials, we show further that this approach allows for high instruction-level parallelism (ILP) exposure, and thus potentially low latency implementations. Then, as an illustration, we detail a C implementation of our method in the case of IEEE 754-2008 binary32 floating-point data (formerly called single precision in the 1985 version of the IEEE 754 standard). This software implementation, which assumes 32-bit integer arithmetic only, is almost complete in the sense that it supports special operands, subnormal numbers, and all rounding modes, but not exception handling (that is, status flags are not set). Finally we have carried out experiments with this implementation using the ST200 VLIW compiler from STMicroelectronics. The results obtained demonstrate the practical interest of our approach in that context: for all rounding modes, the generated assembly code is optimally scheduled and has indeed low latency (23 cycles).

Proceedings of the 4th International Workshop on Parallel and Symbolic Computation - PASCO '10, 2010

Recently, some high-performance IEEE 754 single precision floating-point software has been design... more Recently, some high-performance IEEE 754 single precision floating-point software has been designed, which aims at best exploiting some features (integer arithmetic, parallelism) of the STMicroelectronics ST200 Very Long Instruction Word (VLIW) processor. We review here the techniques and software tools used or developed for this design and its implementation, and how they allowed very high instruction-level parallelism (ILP) exposure. Those key points include a hierarchical description of function evaluation algorithms, the exploitation of the standard encoding of floating-point data, the automatic generation of fast and accurate polynomial evaluation schemes, and some compiler optimizations.