P. Marwedel - Academia.edu (original) (raw)
Papers by P. Marwedel
Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased c... more Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased considerably and a number of methodologies have been proposed for ASIP design. A key step in ASIP synthesis involves deciding architectural features based on application requirements and constraints. In this report we observe the effect of changing register file size on the performance as well as power and energy consumption. Detailed data is generated and analyzed for a number of application programs. Results indicate that choice of an appropriate number of registers has a significant impact on performance.
Proceedings Design, Automation and Test in Europe Conference and Exhibition
In the context of portable embedded systems, reducing energy is one of the prime objectives. Most... more In the context of portable embedded systems, reducing energy is one of the prime objectives. Most high-end embedded microprocessors include onchip instruction and data caches, along with a small energy efficient scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the au courant architecture. In the presented work, we use the scratchpad for storing instructions and propose a generic Cache Aware Scratchpad Allocation (CASA) algorithm. We report an average reduction of 8-29% in instruction memory energy consumption compared to a previously published technique for benchmarks from the Mediabench suite. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against preloaded loop caches, we report average energy savings of 20-44%.
Proceedings of 1994 IEEE Workshop on VLSI Signal Processing
E cient e m bedded DSP system design requires methods of hardware software codesign. In this cont... more E cient e m bedded DSP system design requires methods of hardware software codesign. In this contribution we focus on software synthesis for partitioned system behavioral descriptions. In previous approaches, this task is performed by compiling the behavioral descriptions onto standard processors using target-speci c compilers. It is argued that abandoning this restriction allows for higher degrees of freedom in design space exploration. In turn, this demands for retargetable code generation tools. We present di erent s c hemes for DSP code generation using the MSSQ microcode generator. Experiments with industrial applications revealed that retargetable DSP code generation based on structural hardware descriptions is feasible, but there exists a strong dependency between the behavioral description style and the resulting code quality. As a result, necessary features of high-quality retargetable DSP code generators are identi ed. 1
Interest in low power embedded systems has increased considerably in the past few years. To produ... more Interest in low power embedded systems has increased considerably in the past few years. To produce low power code and to allow an estimation of power consumption of software running on embedded systems, a power model was developed based on physical measurement using an evaluation board and integrated into a compiler and pro ler. The compiler uses the power information to choose instruction sequences consuming less power, whereas the pro ler gives information about the total power consumed during execution of the generated program.
This paper binds together a collection of short presentations on Hardware Description Languages(H... more This paper binds together a collection of short presentations on Hardware Description Languages(HDLs) developed in Europe and provides a view of the history of HDLs during the last three decades. This historical review wants to present the ideas, conceived in these previous languages, which are now implemented in the standard languages. Furthermore, this paper will highlight those early concepts which yet need to be implemented in the evolving standards or could provide a way to unify them (like VHDL or Verilog or SDL) within a formally defined multi-language environment. Among a large number of European works over 3 decades, we have selected a sample from different countries France, Germany, U.K, Italy, which have been implemented and used reliably in various segments of the industry. The selected HDLs, with the date of origination, are: CASSANDRE (1967), MIMOLA (1977), DACAPO (1977), ELLA(1979), ART (1980), and CASCADE (1981). We do not pretend to any exhaustive review, which is not the goal of this presentation, and have consciously left aside several works as valuable as those selected. We have not addressed for example ≪ synchronous languages ≫ very well developed in France, such as ESTEREL, LUSTRE or SIGNAL. Several other works existed in Germany, such as KARL, which was popular in the eighties, and benefits from a large bibliography or REGLAN. We should mention also among those HDLs not presented here CONLAN (a major international standardization effort involving a notable European contribution). We have tried to compare the main features of the chosen languages according to a list of criteria and briefly identify those which are still missing in the recognized worldwide standards.
15th International Symposium on System Synthesis, 2002.
Springer Netherlands eBooks, 2011
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2006
In the context of mobile embedded devices, reducing energy is one of the prime objectives. Memori... more In the context of mobile embedded devices, reducing energy is one of the prime objectives. Memories are responsible for a significant percentage of a system's aggregate energy consumption. Consequently, novel memories as well as novel-memory architectures are being designed to reduce the energy consumption. Caches and scratchpads are two contrasting memory architectures. The former relies on hardware logic while the latter relies on software for its utilization. To meet different requirements, most contemporary high-end embedded microprocessors include on-chip instruction and data caches along with a scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the contemporary high-end systems. Instructions are allocated onto the scratchpad, while taking into account the behavior of the instruction cache present in the system. The problem of scratchpad allocation is solved using a heuristic and also optimally using an integer linear programming formulation. An average reduction of 7% and 23% in processor cycles and instruction-memory energy, respectively, is reported when compared against a previously published technique. The average deviation between optimal and nonoptimal solutions was found to be less than 6% both in terms of processor cycles and energy. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of the presented approach against that of a preloaded loop cache, an average reduction of 9% and 29% in processor cycles and instruction-memory energy, respectively, is reported.
2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip, 2010
Modern embedded devices require highly optimized code in order to efficiently run the wide range ... more Modern embedded devices require highly optimized code in order to efficiently run the wide range of applications they are designed for. However, most modern applications are getting more and more dynamic, which at the software level, translates in the use of dynamic data structures like dynamic arrays and lists. State of the art solutions for the optimization of these dynamic
Design, Automation and Test in Europe
Safety-critical embedded systems having to meet real-time constraints are to be highly predictabl... more Safety-critical embedded systems having to meet real-time constraints are to be highly predictable in order to guarantee at design time that certain timing deadlines will always be met. This requirement usually prevents designers from utilizing caches due to their highly dynamic, thus hardlypredictable The integration of scratchpad memories represents an alternative approach whichallows the system to a performance gain comparable to that of caches while at the same time maintaining predictability. In this work, we compare the impact of scratchpad memories and caches on worst case execution time (WCET) anulysis results. We show that caches, despite requiring complex techniques, can have a negative impact on the predicted while the WCET for scratchpadmemories scales with the achievedPerformance gain at no extra analysis cost.
Proceedings of the 2001 conference on Asia South Pacific design automation - ASP-DAC '01, 2001
This paper deals with address assignment in code generation for digital signal processors (DSPs) ... more This paper deals with address assignment in code generation for digital signal processors (DSPs) with SIMD (single instruction multiple data) memory accesses. In these processors data are organized in groups (or partitions), whose elements share one common memory address. In order to optimize program performance for processors with such memory architectures it is important to have a suitable memory layout of the variables. We propose a two-step address assignment technique for scalar variables using a genetic algorithm based partitioning method and a graph based heuristic which makes use of available DSP address generation hardware. We show that our address assignment techniques lead to a significant code quality improvement compared to heuristics.
Proceedings of the 2003 conference on Asia South Pacific design automation - ASPDAC, 2003
The energy consumption for Mobile Embedded Systems is a limiting factor because of today's batter... more The energy consumption for Mobile Embedded Systems is a limiting factor because of today's battery capacities. The memory subsystem consumes a large chunk of the energy, necessitating its efficient utilization. Energy efficient scratchpads are thus becoming common, though unlike caches they require to be explicitly utilized. In this paper, an algorithm integrated into a compiler is presented which analyzes the application, partitions an array variable whenever its beneficial, appropriately modifies the application and selects the best set of variables and program parts to be placed onto the scratchpad. Results show an energy improvement between 5.7% and 17.6% for a variety of applications against a previously known algorithm.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2001
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2000
The role of software is becoming increasingly important in the implementation of digital signal p... more The role of software is becoming increasingly important in the implementation of digital signal processing (DSP) applications. As this trend intensifies, and the complexity of applications escalates, we are seeing an increased need for automated tools to aid in the development of DSP software. This paper reviews the state-of-the-art in programming language and compiler technology for DSP software implementation. In particular, we review techniques for high-level block-diagram-based modeling of DSP applications; the translation of block-diagram specifications into efficient C programs using global target-independent optimization techniques; and the compilation of C programs into streamlined machine code for programmable DSP processors using architecture-specific and retargetable back-end optimizations. We also point out important directions for further investigation.
Dependable Embedded Systems
Advancing semiconductor technologies increasingly fail to provide expected gains in cost and ener... more Advancing semiconductor technologies increasingly fail to provide expected gains in cost and energy reductions due to reaching the physical limits of Moore’s Law and Dennard scaling. Instead, shrinking semiconductor feature sizes increase a circuit’s susceptibility to soft errors. In order to ensure reliable operation, a significant hardware overhead would be required.The FEHLER project (Flexible Error Handling for Embedded Real-Time Systems) introduces error semantics into the software development process which provide a system with information about the criticality of a given data object to soft errors. Using this information, the overhead required for error correction can be reduced significantly for many applications, since only errors affecting critical data have to be corrected.In this chapter, the fundamental components of FEHLER that cooperate at design and runtime of an embedded system are presented. These include static compiler analyses and transformations as well as a fa...
This paper describes the hierarchical test-generation method STAR-DUST, using selftest program ge... more This paper describes the hierarchical test-generation method STAR-DUST, using selftest program generator RESTART, test pattern generator DUST, fault simulator FAUST and SYNOPSYS logic synthesis tools. RESTART aims at supporting self-test of embedded processors. Its integration into the STAR-DUST environment allows test program generation for realistic fault assumptions and provides, for the rst time, experimental data on the fault coverage that can be obtained for full processor models. Experimental data shows that fault masking is not a problem even though the considered processor has to perform result comparison and arithmetic operations in the same ALU.
We propose coordinated use of compiler techniques to improvepredictability of timing behavior of ... more We propose coordinated use of compiler techniques to improvepredictability of timing behavior of hard real-time systems, and thus, to tighten their worst-case execution times. We aim at a generic methodology of compiler optimizations that replace the use of unpredictable hardware and operating system features by the use of more predictable features. We call the approach compiler controlled operation, because it is basedon using compilers to control operations that are traditionally controlled by hardware or operating systems. As an example of the approach, we overview our work in progress on a small experimental system.
Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16, 2016
Memory requirements can be a limiting factor for programs dealing with large data structures. Esp... more Memory requirements can be a limiting factor for programs dealing with large data structures. Especially interpreted programming languages that are used to deal with large vectors like R suffer from memory overhead when copying such data structures. Avoiding data duplication directly in the application can reduce the memory requirements. Alternatively, generic kernel-level memory reduction functionality like deduplication and compression can lower the amount of memory required, but they need to compensate for missing application knowledge by utilizing more CPU time, leading to excessive overhead. To allow new optimizations based on the application's knowledge about its own memory utilization, we propose to introduce a new system call. This system call uses the existing copy-on-write functionality of the Linux kernel to avoid duplicating memory when data is copied. Our experiments using real-world benchmarks written in the R language show that our approach can yield significant improvement in CPU time compared to Kernel Samepage Merging without compromising the amount of memory saved.
Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased c... more Interest in synthesis of Application Specific Instruction Set Processors or ASIPs has increased considerably and a number of methodologies have been proposed for ASIP design. A key step in ASIP synthesis involves deciding architectural features based on application requirements and constraints. In this report we observe the effect of changing register file size on the performance as well as power and energy consumption. Detailed data is generated and analyzed for a number of application programs. Results indicate that choice of an appropriate number of registers has a significant impact on performance.
Proceedings Design, Automation and Test in Europe Conference and Exhibition
In the context of portable embedded systems, reducing energy is one of the prime objectives. Most... more In the context of portable embedded systems, reducing energy is one of the prime objectives. Most high-end embedded microprocessors include onchip instruction and data caches, along with a small energy efficient scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the au courant architecture. In the presented work, we use the scratchpad for storing instructions and propose a generic Cache Aware Scratchpad Allocation (CASA) algorithm. We report an average reduction of 8-29% in instruction memory energy consumption compared to a previously published technique for benchmarks from the Mediabench suite. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of our approach against preloaded loop caches, we report average energy savings of 20-44%.
Proceedings of 1994 IEEE Workshop on VLSI Signal Processing
E cient e m bedded DSP system design requires methods of hardware software codesign. In this cont... more E cient e m bedded DSP system design requires methods of hardware software codesign. In this contribution we focus on software synthesis for partitioned system behavioral descriptions. In previous approaches, this task is performed by compiling the behavioral descriptions onto standard processors using target-speci c compilers. It is argued that abandoning this restriction allows for higher degrees of freedom in design space exploration. In turn, this demands for retargetable code generation tools. We present di erent s c hemes for DSP code generation using the MSSQ microcode generator. Experiments with industrial applications revealed that retargetable DSP code generation based on structural hardware descriptions is feasible, but there exists a strong dependency between the behavioral description style and the resulting code quality. As a result, necessary features of high-quality retargetable DSP code generators are identi ed. 1
Interest in low power embedded systems has increased considerably in the past few years. To produ... more Interest in low power embedded systems has increased considerably in the past few years. To produce low power code and to allow an estimation of power consumption of software running on embedded systems, a power model was developed based on physical measurement using an evaluation board and integrated into a compiler and pro ler. The compiler uses the power information to choose instruction sequences consuming less power, whereas the pro ler gives information about the total power consumed during execution of the generated program.
This paper binds together a collection of short presentations on Hardware Description Languages(H... more This paper binds together a collection of short presentations on Hardware Description Languages(HDLs) developed in Europe and provides a view of the history of HDLs during the last three decades. This historical review wants to present the ideas, conceived in these previous languages, which are now implemented in the standard languages. Furthermore, this paper will highlight those early concepts which yet need to be implemented in the evolving standards or could provide a way to unify them (like VHDL or Verilog or SDL) within a formally defined multi-language environment. Among a large number of European works over 3 decades, we have selected a sample from different countries France, Germany, U.K, Italy, which have been implemented and used reliably in various segments of the industry. The selected HDLs, with the date of origination, are: CASSANDRE (1967), MIMOLA (1977), DACAPO (1977), ELLA(1979), ART (1980), and CASCADE (1981). We do not pretend to any exhaustive review, which is not the goal of this presentation, and have consciously left aside several works as valuable as those selected. We have not addressed for example ≪ synchronous languages ≫ very well developed in France, such as ESTEREL, LUSTRE or SIGNAL. Several other works existed in Germany, such as KARL, which was popular in the eighties, and benefits from a large bibliography or REGLAN. We should mention also among those HDLs not presented here CONLAN (a major international standardization effort involving a notable European contribution). We have tried to compare the main features of the chosen languages according to a list of criteria and briefly identify those which are still missing in the recognized worldwide standards.
15th International Symposium on System Synthesis, 2002.
Springer Netherlands eBooks, 2011
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2006
In the context of mobile embedded devices, reducing energy is one of the prime objectives. Memori... more In the context of mobile embedded devices, reducing energy is one of the prime objectives. Memories are responsible for a significant percentage of a system's aggregate energy consumption. Consequently, novel memories as well as novel-memory architectures are being designed to reduce the energy consumption. Caches and scratchpads are two contrasting memory architectures. The former relies on hardware logic while the latter relies on software for its utilization. To meet different requirements, most contemporary high-end embedded microprocessors include on-chip instruction and data caches along with a scratchpad. Previous approaches for utilizing scratchpad did not consider caches and hence fail for the contemporary high-end systems. Instructions are allocated onto the scratchpad, while taking into account the behavior of the instruction cache present in the system. The problem of scratchpad allocation is solved using a heuristic and also optimally using an integer linear programming formulation. An average reduction of 7% and 23% in processor cycles and instruction-memory energy, respectively, is reported when compared against a previously published technique. The average deviation between optimal and nonoptimal solutions was found to be less than 6% both in terms of processor cycles and energy. The scratchpad in the presented architecture is similar to a preloaded loop cache. Comparing the energy consumption of the presented approach against that of a preloaded loop cache, an average reduction of 9% and 29% in processor cycles and instruction-memory energy, respectively, is reported.
2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip, 2010
Modern embedded devices require highly optimized code in order to efficiently run the wide range ... more Modern embedded devices require highly optimized code in order to efficiently run the wide range of applications they are designed for. However, most modern applications are getting more and more dynamic, which at the software level, translates in the use of dynamic data structures like dynamic arrays and lists. State of the art solutions for the optimization of these dynamic
Design, Automation and Test in Europe
Safety-critical embedded systems having to meet real-time constraints are to be highly predictabl... more Safety-critical embedded systems having to meet real-time constraints are to be highly predictable in order to guarantee at design time that certain timing deadlines will always be met. This requirement usually prevents designers from utilizing caches due to their highly dynamic, thus hardlypredictable The integration of scratchpad memories represents an alternative approach whichallows the system to a performance gain comparable to that of caches while at the same time maintaining predictability. In this work, we compare the impact of scratchpad memories and caches on worst case execution time (WCET) anulysis results. We show that caches, despite requiring complex techniques, can have a negative impact on the predicted while the WCET for scratchpadmemories scales with the achievedPerformance gain at no extra analysis cost.
Proceedings of the 2001 conference on Asia South Pacific design automation - ASP-DAC '01, 2001
This paper deals with address assignment in code generation for digital signal processors (DSPs) ... more This paper deals with address assignment in code generation for digital signal processors (DSPs) with SIMD (single instruction multiple data) memory accesses. In these processors data are organized in groups (or partitions), whose elements share one common memory address. In order to optimize program performance for processors with such memory architectures it is important to have a suitable memory layout of the variables. We propose a two-step address assignment technique for scalar variables using a genetic algorithm based partitioning method and a graph based heuristic which makes use of available DSP address generation hardware. We show that our address assignment techniques lead to a significant code quality improvement compared to heuristics.
Proceedings of the 2003 conference on Asia South Pacific design automation - ASPDAC, 2003
The energy consumption for Mobile Embedded Systems is a limiting factor because of today's batter... more The energy consumption for Mobile Embedded Systems is a limiting factor because of today's battery capacities. The memory subsystem consumes a large chunk of the energy, necessitating its efficient utilization. Energy efficient scratchpads are thus becoming common, though unlike caches they require to be explicitly utilized. In this paper, an algorithm integrated into a compiler is presented which analyzes the application, partitions an array variable whenever its beneficial, appropriately modifies the application and selects the best set of variables and program parts to be placed onto the scratchpad. Results show an energy improvement between 5.7% and 17.6% for a variety of applications against a previously known algorithm.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2001
IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 2000
The role of software is becoming increasingly important in the implementation of digital signal p... more The role of software is becoming increasingly important in the implementation of digital signal processing (DSP) applications. As this trend intensifies, and the complexity of applications escalates, we are seeing an increased need for automated tools to aid in the development of DSP software. This paper reviews the state-of-the-art in programming language and compiler technology for DSP software implementation. In particular, we review techniques for high-level block-diagram-based modeling of DSP applications; the translation of block-diagram specifications into efficient C programs using global target-independent optimization techniques; and the compilation of C programs into streamlined machine code for programmable DSP processors using architecture-specific and retargetable back-end optimizations. We also point out important directions for further investigation.
Dependable Embedded Systems
Advancing semiconductor technologies increasingly fail to provide expected gains in cost and ener... more Advancing semiconductor technologies increasingly fail to provide expected gains in cost and energy reductions due to reaching the physical limits of Moore’s Law and Dennard scaling. Instead, shrinking semiconductor feature sizes increase a circuit’s susceptibility to soft errors. In order to ensure reliable operation, a significant hardware overhead would be required.The FEHLER project (Flexible Error Handling for Embedded Real-Time Systems) introduces error semantics into the software development process which provide a system with information about the criticality of a given data object to soft errors. Using this information, the overhead required for error correction can be reduced significantly for many applications, since only errors affecting critical data have to be corrected.In this chapter, the fundamental components of FEHLER that cooperate at design and runtime of an embedded system are presented. These include static compiler analyses and transformations as well as a fa...
This paper describes the hierarchical test-generation method STAR-DUST, using selftest program ge... more This paper describes the hierarchical test-generation method STAR-DUST, using selftest program generator RESTART, test pattern generator DUST, fault simulator FAUST and SYNOPSYS logic synthesis tools. RESTART aims at supporting self-test of embedded processors. Its integration into the STAR-DUST environment allows test program generation for realistic fault assumptions and provides, for the rst time, experimental data on the fault coverage that can be obtained for full processor models. Experimental data shows that fault masking is not a problem even though the considered processor has to perform result comparison and arithmetic operations in the same ALU.
We propose coordinated use of compiler techniques to improvepredictability of timing behavior of ... more We propose coordinated use of compiler techniques to improvepredictability of timing behavior of hard real-time systems, and thus, to tighten their worst-case execution times. We aim at a generic methodology of compiler optimizations that replace the use of unpredictable hardware and operating system features by the use of more predictable features. We call the approach compiler controlled operation, because it is basedon using compilers to control operations that are traditionally controlled by hardware or operating systems. As an example of the approach, we overview our work in progress on a small experimental system.
Proceedings of the 31st Annual ACM Symposium on Applied Computing - SAC '16, 2016
Memory requirements can be a limiting factor for programs dealing with large data structures. Esp... more Memory requirements can be a limiting factor for programs dealing with large data structures. Especially interpreted programming languages that are used to deal with large vectors like R suffer from memory overhead when copying such data structures. Avoiding data duplication directly in the application can reduce the memory requirements. Alternatively, generic kernel-level memory reduction functionality like deduplication and compression can lower the amount of memory required, but they need to compensate for missing application knowledge by utilizing more CPU time, leading to excessive overhead. To allow new optimizations based on the application's knowledge about its own memory utilization, we propose to introduce a new system call. This system call uses the existing copy-on-write functionality of the Linux kernel to avoid duplicating memory when data is copied. Our experiments using real-world benchmarks written in the R language show that our approach can yield significant improvement in CPU time compared to Kernel Samepage Merging without compromising the amount of memory saved.