Complex instruction and software library mapping for embedded software using symbolic algebra (original) (raw)

Complex library mapping for embedded software using symbolic algebra

Proceedings of the 39th conference on Design automation - DAC '02, 2002

Embedded software designers often use libraries that have been pre-optimized for a given processor to achieve higher code quality. However, using such libraries in legacy code optimization is nontrivial and typically requires manual intervention. This paper presents a methodology that maps algorithmic constructs of the software specification to a library of complex software elements. This library-mapping step is automated by using symbolic algebra techniques. We illustrate the advantages of our methodology by optimizing an algorithmic level description of MPEG Layer III (MP3) audio decoder for the Badge4 portable embedded system. During the optimization process we use commercially available libraries with complex elements ranging from simple mathematical functions such as exp to the IDCT routine. We implemented and measured the performance and energy consumption of the MP3 decoder software on Badge4 running embedded Linux operating system. The optimized MP3 audio decoder runs 300 times faster than the original code obtained from the standards body while consuming 400 times less energy. Since our optimized MP3 decoder runs 3.5 times faster than real-time, additional energy can be saved by using processor frequency and voltage scaling.

Low power embedded software optimization using symbolic algebra

Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition, 2002

The market demand for portable multimediaapplications has exploded in the recent years.Unfortunately, for such applications current compilers andsoftware optimization methods often require designers todo part of the optimization manually. Specifically, thehigh-level arithmetic optimizations and the use of complexinstructions are left to the designers' ingenuity. In thispaper, we present a tool flow, SymSoft, that automates theoptimization of power-intensive algorithmic constructsusing symbolic

Automatic instruction set extension and utilization for embedded processors

Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003, 2003

There is a growing demand for application-specific embedded processors in system-on-a-chip designs. Current tools and design methodologies often require designers to manually specialize the processor based on an application. Moreover, the use of the new complex instructions added to the processor is often left to designers' ingenuity. In this paper, we present a solution that automatically groups dataflow operations in the application software as potential new complex instructions. The set of possible instructions is then automatically used for code generation combined with high-level arithmetic optimizations using symbolic algebra. Symbolic arithmetic manipulations provide a novel and effective method for instruction selection that is necessary due to the complexity of the automatically identified instructions. We have used our methodology to automatically add new instructions to Tensilica processors for a set of examples. Our results show that our tools improve designers productivity and efficiently specialize an embedded processor for the given application such that the execution time is greatly improved.

Application of symbolic computer algebra in high-level data-flow synthesis

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2003

The growing market of multimedia applications has required the development of complex application-specified integrated circuits with significant data-path portions. Unfortunately, most high-level synthesis tools and methods cannot automatically synthesize data paths such that complex arithmetic library blocks are intelligently used. Namely, most arithmetic-level optimizations are not supported and they are left to the designer's ingenuity. In this paper, we show how symbolic algebra can be used to construct arithmetic-level decomposition algorithms. We introduce our tool, SymSyn, that optimizes and maps data flow descriptions into data paths using complex arithmetic components. SymSyn uses two new algorithms to find either minimal component mapping or minimal critical path delay (CPD) mapping of the data flow. In this paper, we give an overview of the proposed algorithms. We also show how symbolic manipulations such as tree-height-reduction, factorization, expansion, and Horner transformation are incorporated in the preprocessing step. Such manipulations are used as guidelines in initial library element selection to accelerate the proposed algorithms. Furthermore, we demonstrate how substitution can be used for multiexpression component sharing and CPD optimization.

Fast and accurate multiprocessor architecture exploration with symbolic programs

… Automation and Test …, 2003

In system-level platform-based embedded systems design, the mapping model is a crucial link between the application model and the architecture model. All three models must match when design-space exploration has to be fast and accurate, and when exploration methods and design methods have to be closely related. For the media processing application domain we present an architecture model and corresponding mapping model that meet these requirements better than previously proposed models. A case study illustrates this improvement.

The use of compiler optimizations for embedded systems software

Crossroads, 2008

Optimizing embedded applications using a compiler can generally be broken down into two major categories: hand-optimizing code to take advantage of a particular processor's compiler and applying built-in optimization options to proven and well-polished code. The former is well documented for different processors, but little has been done to find generalized methods for optimal sets of compiler options based on common goal criteria such as application code size, execution speed, power consumption, and build time. This article discusses the fundamental differences between these two general categories of optimizations using the compiler. Examples of common, built-in compiler options are presented using a simulated ARM processor and C compiler, along with a simple methodology that can be applied to any embedded compiler for finding an optimal set of compiler options.

Address register-oriented optimizations for embedded processors

2001

Embedded systems consisting of the application program ROM, RAM, the embedded processor core, and any custom hardware on a single wafer are becoming increasingly common in application domains such as signal processing. Given the rapid deployment of these systems, programming on such systems has shifted from assembly language to high-level languages such as C, C++, and Java. The processors used in such systems are usually targeted toward specific application domains, e.g., digital signal processing (DSP). As a result, these embedded processors include application-specific instruction sets, complex and irregular data paths, etc., thereby rendering code generation for these processors difficult. In this paper, we present new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations. We present a heuristic to reduce code size by taking advantage of these addressing modes. Our solution aims at improving the offset assignment produced by Liao et al.'s solution. It finds a layout of variables in RAM, so that it is possible to subsume explicit address register manipulation instructions into other instructions as a post-increment or post-decrement operation. Experimental results show the effectiveness of our solution. Next, we propose an algorithm that uses commutative transformations to change the access sequence and thereby reducing the code size. Some DSP cores allow for the post-increment or decrement value to be larger than one. For such processors, we also present an approach that is incremental and has some advantages over another proposed solution that requires the expensive generation of cliques.

Exact and approximate algorithms for the extension of embedded processor instruction sets

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2000

In embedded computing, cost, power, and performance constraints call for the design of specialized processors, rather than for the use of the existing off-the-shelf solutions. While the design of these application-specific CPUs could be tackled from scratch, a cheaper and more effective option is that of extending the existing processors and toolchains. Extensibility is indeed a feature now offered in real designs, e.g., by processors such as Tensilica Xtensa [T. R. Halfhill, Microprocess Rep., 2003], ARC ARCtangent [T. R. Halfhill, Microprocess Rep., 2000], STMicroelectronics ST200 [P. Faraboschi, G. Brown, J. A. Fisher, G. Desoli, and F. Homewood, Proc. 27th Annu. Int. Symp. Computer Architecture, 2000, p. 203], and MIPS CorExtend [T. R. Halfhill, Microprocess Rep., 2003

Instruction selection for embedded DSPs with complex instructions

1996

Abstract{We address the problem of instruction selection in code generation for embedded digital signal processors. Recent work has shown that this task can be efciently solved b y t r e e c overing with dynamic programming, even in combination with the task of register allocation. However, performing instruction selection by tree c overing only does not exploit available instructionlevel parallelism, for instance in form of multiplyaccumulate instructions or parallel data moves. In this paper we investigate how such complex instructions may aect detection of optimal tree c overs, and we present a two-phase scheme for instruction selection which exploits available instruction-level parallelism. At the expense of higher compilation time, this technique may signicantly increase the code quality compared t o p r evious work, which is demonstrated for a widespread DSP.