Guido Araujo - Profile on Academia.edu (original) (raw)

Papers by Guido Araujo

This paper presents an environment based on Sys-temC for architecture specification of programmab... more This paper presents an environment based on Sys-temC for architecture specification of programmable systems. Making use of the new architecture description language ArchC, able to capture the processor description as well as the memory subsystem configuration, this environment offers support for system-level specification, intended for platform-based design. As a case study, it is presented the memory architecture exploration for a simple image processing application, yet a more robust environment evaluation is performed through the execution of some real-world benchmarks.

In this paper we investigate the problem of code generation for address computation for DSP proce... more In this paper we investigate the problem of code generation for address computation for DSP processors. This work is divided into four parts. First, we propose a branch instruction design which can guarantee minimum overhead for programs that make use of implicit indirect addressing.

IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 2005

Reconfigurable systems have been shown to achieve significant performance speedup through archite... more Reconfigurable systems have been shown to achieve significant performance speedup through architectures that map the most time-consuming application kernel modules or inner loops to a reconfigurable datapath. As each portion of the application starts to execute, the system partially reconfigures the datapath so as to perform the corresponding computation. The reconfigurable datapath should have as few and simple hardware blocks and interconnections as possible, in order to reduce its cost, area, and reconfiguration overhead. To achieve that, hardware blocks and interconnections should be reused as much as possible across the application. We represent each piece of the application as a data-flow graph (DFG). The DFG merging process identifies similarities among the DFGs, and produces a single datapath that can be dynamically reconfigured and has a minimum area cost, when considering both hardware blocks and interconnections. In this paper we present a novel technique for the DFG merge problem, and we evaluate it using programs from the MediaBench benchmark. Our algorithm execution time approaches the fastest previous solution to this problem and produces datapaths with an average area reduction of 20%. When compared to the best known area solution, our approach produces datapaths with area costs equivalent to (and in many cases better than) it, while achieving impressive speedups. Index Terms-High-level synthesis, reconfigurable computing, resource sharing.

Efficient address code optimization is a central problem in code generation for processors with r... more Efficient address code optimization is a central problem in code generation for processors with restricted addressing modes, like Digital Signal Processors (DSPs). This paper proposes a new heuristic to solve the Simple Offset Assignment (SOA) problem, the problem of allocating scalar variables to memory so as to minimize addressing code. This new approach, called Coalescing SOA (CSOA), performs variable memory slot coalescing simultaneously to offset assignment computation. Experimental results, based on compiling MediaBench benchmark programs with LANCE compiler, reveal a very significant improvement over the previous solutions to SOA. In fact, CSOA produces, on average, 37.3% fewer update instructions when comparing with the prior solution that perform memory slot coalescing before applying SOA, and 66.2% fewer update instructions when comparing with the best traditional SOA solution.

Instruction set design and optimization for address computation in dsp architectures

The increasing demand for wireless devices running mobile applications has renewed the interest o... more The increasing demand for wireless devices running mobile applications has renewed the interest on the research of high performance low power processors that can be programmed using very compact code. One way to achieve this goal is to design specialized processors with short instruction formats and shallow pipelines. Given that it enables such architectural features, indirect addressing is the most used addressing mode in embedded programs. This paper analyzes the problem of allocating address registers to array references in loops using auto-increment addressing mode. It leverages on previous work, which is based on a heuristic that merges address register live ranges. We prove, for the first time, that the merge operation is NP-hard in general, and show the existence of an optimal linear-time algorithm, based on dynamic programming, for a special case of the problem.

In this paper we address the problem of code generation for basic blocks in heterogeneous memory-... more In this paper we address the problem of code generation for basic blocks in heterogeneous memory-register DSP processors. We propose a new a technique, based on register-transfer paths, that can be used for e ciently dismantling basic block DAGs (Directed Acyclic Graphs) into expression trees. This approach builds on recent results which report optimal code generation algorithm for expression trees for these architectures. This technique has been implemented and experimentally validated for the TMS320C25, a popular xed point DSP processor. The results show that good code quality can be obtained using the proposed technique. An analysis of the type of DAGs found in the DSPstone benchmark programs reveals that the majority of basic blocks in this benchmark set are expression trees and leaf DAGs. This leads to our claim that tree based algorithms, like the one described in this paper, should be the technique of choice for basic block code generation with heterogeneous memoryregister architectures.

In this paper we address the problem of code generation for basic blocks in heterogeneous memory-... more In this paper we address the problem of code generation for basic blocks in heterogeneous memory-register DSP processors. We propose a new a technique, based on register-transfer paths, that can be used for efficiently dismantling basic block DAGs (Directed Acyclic Graphs) into expression trees. This approach builds on recent results which report optimal code generation algorithm for expression trees for these architectures. This technique has been implemented and experimentally validated for the TMS320C25, a popular fixed point DSP processor. The results show that good code quality can be obtained using the proposed technique. An analysis of the type of DAGs found in the DSPstone benchmark programs reveals that the majority of basic blocks in this benchmark set are expression trees and leaf DAGs. This leads to our claim that tree based algorithms, like the one described in this paper, should be the technique of choice for basic blocks code generation with heterogeneous memory register architectures

This paper examines the problem of code-generation for expression trees on non-homogeneous regist... more This paper examines the problem of code-generation for expression trees on non-homogeneous register set architectures. It proposes and proves the optimality of an O(n) algorithm for the tasks of instruction selection, register allocation and scheduling on a class of architectures de ned as the 1; 1] Model. Optimality is guaranteed by su cient conditions derived from the Register Transfer Graph (RTG), a structural representation of the architecture which depends exclusively on the processor Instruction Set Architecture (ISA). Experimental results using the TMS320C25 as the target processor show the e cacy of the approach.

Recent work in reconfigurable computing research has shown that a substantial performance speedup... more Recent work in reconfigurable computing research has shown that a substantial performance speedup can be achieved through architectures that map the most relevant application inner-loops to a reconfigurable datapath. Any solution to this problem must be able to synthesize a datapath for each loop and to merge them together into a single reconfigurable datapath. The main contribution of this paper is a novel graph-based technique for the datapath merge problem. This approach is based on the solution of a maximum clique problem that merges datapaths one at a time. A set of experiments, using the MediaBench benchmark, shows that the proposed technique produces 24% fewer datapath interconnections than a previous solution to this problem.

ACM Transactions in Embedded Computing Systems, 2004

Increasing nonrecurring engineering and mask costs are making it harder to turn to hardwired appl... more Increasing nonrecurring engineering and mask costs are making it harder to turn to hardwired application specific integrated circuit (ASIC) solutions for high-performance applications. The volume required to amortize these high costs has been increasing, making it increasingly expensive to afford ASIC solutions for medium-volume products. This has led to designers seeking programmable solutions of varying sorts using these so-called programmable platforms. These programmable platforms span a large range from bit-level programmable field programmable gate arrays to word-level programmable application-specific, and in some cases even general-purpose processors. The programmability comes with a power and performance overhead. Attempts to reduce this overhead typically involve making some core hardwired ASIC like logic blocks accessible to the programmable elements. This paper presents one such hybrid solution in this space-a relatively simple processor with a dynamically reconfigurable datapath acting as an accelerating coprocessor. This datapath consists of hardwired function units and reconfigurable interconnect. We present a methodology for the design of these solutions and illustrate it with two complete case studies: an MPEG2 coder, and a GSM coder, to show how significant speedups can be obtained using relatively little hardware. This work is part of the MESCAL project, which is geared towards developing design environments for the development of application-specific platforms.

IEEE Transactions on Very Large Scale Integration Systems, 2000

Reducing program size has become an important goal in the design of modern embedded systems targe... more Reducing program size has become an important goal in the design of modern embedded systems targeted to mass production. This problem has driven efforts aimed at designing processors with shorter instruction formats (e.g., ARM Thumb and MIPS16) or able to execute compressed code (e.g., IBM PowerPC 405). This paper proposes three code compression algorithms for embedded RISC architectures. In all algorithms, the encoded symbols are extracted from program expression trees. The algorithms differ on the granularity of the encoded symbol, which are selected from whole trees, parts of trees, or single instructions. Dictionary-based decompression engines are proposed for each compression algorithm. Experimental results, based on SPEC CINT95 programs running on the MIPS R4000 processor, reveal an average compression ratio of 53.6% (31.5%) if the area of the decompression engine is (not) considered.

Although SystemC is considered the most promising language for system-on-chip functional modeling... more Although SystemC is considered the most promising language for system-on-chip functional modeling, it doesn't come with power modeling capabilities. This work presents PowerSC, a novel power estimation framework which instruments SystemC for power characterization, modeling and estimation. Since it is entirely based on SystemC, PowerSC allows consistent power modeling from the highest to the lowest abstraction level. Besides, the framework's API provides facilities to integrate alternative modeling techniques, either at the same or at different abstraction levels. As a result, the required power evaluation infrastructure is reduced to a minimum: the standard SystemC library, the PowerSC library itself and a C++ compiler. Experimental results show both the effectiveness and the efficiency of our framework. On the one hand, two well-known macromodeling techniques were easily integrated into the framework, leading to acceptable average errors at the RT level. On the other hand, library characterization was more than 13× faster as compared to a typical industrial flow.

Decreasing the program size has become an important goal in the design of embedded systems target... more Decreasing the program size has become an important goal in the design of embedded systems target to mass production. This problem has led to a number of efforts aimed at designing processors with shorter instruction formats (e.g. ARM Thumb and MIPS16), or that can execute compressed code (e.g. IBM CodePack PowerPC). Much of this work has been directed towards RISC architectures though. This paper proposes a solution to the problem of executing compressed code on embedded DSPs. The experimental results reveal an average compression ratio of 75% for typical DSP programs running on the TMS320C25 processor. This number includes the size of the decompression engine. Decompression is performed by a state machine that translates codewords into instruction sequences during program execution. The decompression engine is synthesized using the AMS standard cell library and a 0.6m 5V technology. Gate level simulation of the decompression engine reveals minimum operation frequencies of 150MHz.

Using this technique, we show that tree and operand patterns have exponential frequency distribut... more Using this technique, we show that tree and operand patterns have exponential frequency distributions. A set of experiments were designed to explore this feature. They reveal an average compression ratio of 43% for SPECInt95 programs. A decompression engine is proposed, which assembles tree and operand patterns into uncompressed instruction sequences. An encoding that improves the design of the decompression engine results in a 48% compression ratio. Compression ratio numbers take into consideration an estimate of the decompression engine size.

This paper presents the use of the ArchC Architecture Description Language (ADL) as a support too... more This paper presents the use of the ArchC Architecture Description Language (ADL) as a support tool for computer architecture courses. ArchC enables students to perform several experiments using its automatically generated SystemC simulators, covering topics from simple single-cycle (functional) models to pipeline and memory hierarchy simulation. We show how instructive may be the process of modeling a processor using an ADL and suggest several possible exercises, following the course development structure presented in the classical Hennessy and Patterson's computer architecture didactical book. Moreover, we report how the experience of assigning students to study and to model modern embedded architectures has provided good results on an undergraduate computer architecture course at IC-UNICAMP. The simplicity and flexibility of the ADL, along with its simulation features, proved to be an useful tool not only for research, but also for computer architecture education.

In this paper we describe a design exploration methodology for clustered VLIW architectures. The ... more In this paper we describe a design exploration methodology for clustered VLIW architectures. The central idea of this work is a set of three techniques aimed at reducing the cost of expensive inter-cluster copy operations. Instruction scheduling is performed using a list-scheduling algorithm that stores operand chains into the same register file. Functional units are assigned to clusters based on the application inter-cluster communication pattern. Finally, a careful insertion of pipeline bypasses is used to increase the number of data-dependencies that can be satisfied by pipeline register operands. Experimental results, using the SPEC95 benchmark and the IMPACT compiler, reveal a substantial reduction in the number of copies between clusters.

This paper presents an architecture description language (ADL) called ArchC, which is an open-sou... more This paper presents an architecture description language (ADL) called ArchC, which is an open-source SystemCbased language that is specialized for processor architecture description. Its main goal is to provide enough information, at the right level of abstraction, in order to allow users to explore and verify new architectures, by automatically generating software tools like simulators and coverification interfaces. ArchC's key features are a storagebased co-verification mechanism that automatically checks the consistency of a refined ArchC model against a reference (functional) description, memory hierarchy modeling capability, the possibility of integration with other SystemC IPs and the automatic generation of high-level SystemC simulators. We have used ArchC to synthesize both functional and cycle-based simulators for the MIPS, Intel 8051 and SPARC V8 processors, as well as functional models of modern architectures like TMS320C62x, XScale and PowerPC.

IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems, 2005

Instruction set design and optimization for address computation in dsp architectures

In this paper we address the problem of code generation for basic blocks in heterogeneous memory-... more In this paper we address the problem of code generation for basic blocks in heterogeneous memory-register DSP processors. We propose a new a technique, based on register-transfer paths, that can be used for efficiently dismantling basic block DAGs (Directed Acyclic Graphs) into expression trees. This approach builds on recent results which report optimal code generation algorithm for expression trees for these architectures. This technique has been implemented and experimentally validated for the TMS320C25, a popular fixed point DSP processor. The results show that good code quality can be obtained using the proposed technique. An analysis of the type of DAGs found in the DSPstone benchmark programs reveals that the majority of basic blocks in this benchmark set are expression trees and leaf DAGs. This leads to our claim that tree based algorithms, like the one described in this paper, should be the technique of choice for basic blocks code generation with heterogeneous memory register architectures

ACM Transactions in Embedded Computing Systems, 2004

IEEE Transactions on Very Large Scale Integration Systems, 2000