Bor-Yeh Shen | National Chiao Tung University (original) (raw)
Papers by Bor-Yeh Shen
Asymmetric multicore systems had been studied as a new hardware platform toward performance-power... more Asymmetric multicore systems had been studied as a new hardware platform toward performance-power efficiency for the execution of application programs. Each core in the system has distinct performance and power characteristics. When exploiting asymmetric multicore systems, a major issue is to distribute threads to various cores. In this work, we build a pseudo asymmetric system by the dynamic voltage frequency scaling (DVFS) mechanism on Intel core-i7 920 for physical power measurement and implement a tool agent for regular JVM to form an asymmetric-aware JVM that supervises the execution of Java threads and migrates threads with a fuzzycontrol scheduler. For result inspection, we consider energy delay product (EDP) as a metric to reveal the compromise between performance and energy use. Our fuzzy-control scheduler results in EDP benefit for some benchmarks and lower overall energy consumption. 1 Keywords-Asymmetric multicore; JVM; power efficiency; schedule;
Code size is an important issue in many embedded systems. In order to reduce code size, newer emb... more Code size is an important issue in many embedded systems. In order to reduce code size, newer embedded RISC processors employ a mixed-width instruction set, where processor architectures support interleaved execution between normal (usually 32-bit) and narrow (usually 16-bit) instructions without explicit mode switch. However, because of the restriction of the encoding length, narrow instructions can only access a limited set of registers. Therefore, for a mixed-width instruction set, proper register allocation can reduce code size. One approach is to re-assign the registers after traditional register allocation. In this paper, we prove that this register reassignment problem is NP-complete by showing that the 0-1 knapsack problem is a special case of this problem. We also propose a method for register reassignment for a mixed-width instruction set with the main goal of code size reduction. 1 keywords: Mixed-width ISA, Code Size Reduction, Register Reassignment, Thumb-2, Knapsack Pr...
Binary translation is an important technique for porting programs as it allows applications for o... more Binary translation is an important technique for porting programs as it allows applications for one platform to execute on another. The technique is widely used in virtual machines and emulators. However, binary translation is challenging because many delicate details, such as calling conventions and system calls, must be handled carefully to generate correct translated code. Identifying a mistranslated instruction in a program is difficult, especially when the application program is large. Therefore, it is necessary to develop an automatic tool to uncover problems incurred during translation. We have developed a new validation mechanism for static binary translation, which checks the correctness of emulated architecture state (the state of the emulated architecture) during program execution. We have also proposed additional optimizations to speed up the automatic validation process. General Terms Reliability, Validation.
More and more modern processors support SIMD instructions for improving performance in media appl... more More and more modern processors support SIMD instructions for improving performance in media applications. Programmers usually need detailed targetspecific knowledge to use SIMD instructions directly. Thus, an auto-vectorization compiler that automatically generates efficient SIMD instructions is in urgent need. We implement an automatic superword vectorization based on the LLVM compiler infrastructure, to which an autovectorization and an alignment analysis passes have been added. The superword auto-vectorization pass exploits dataparallelism and convert IR instructions from primitive type to vector type. Then, in code generator, the alignment analysis pass analyzes every memory access with respect to those vector instructions and generates the alignment information for generate target-specific alignment instructions. In this paper, we use UltraSPARC as our experimental platform and two realignment instructions to perform misaligned access. We also present preliminary experimental ...
ACM Transactions on Embedded Computing Systems
Code discovery has been a main challenge for static binary translation, especially when the sourc... more Code discovery has been a main challenge for static binary translation, especially when the source instruction set architecture has variable-length instructions, such as the x86 architectures. Due to embedded data such as PC (program counter)-relative data, jump tables, or paddings in the code section, a binary translator may be misled to translate data as instructions. For variable-length instructions, once a piece of data is mis-translated as instructions, decoding subsequent bytes could also go wrong. We are concerned with static binary translation for the very popular Advanced RISC Machine (ARM) architectures. Although ARM is considered a reduced instruction set computer architecture, it does allow the mix of 32-bit (ARM) instructions and 16-bit (Thumb) instructions in the same executables. In addition to different instruction lengths, the ARM and Thumb instructions are located at 4-byte or 2-byte aligned addresses, respectively. Furthermore, because ARM and Thumb instructions s...
ACM Transactions on Architecture and Code Optimization, 2014
Machines designed with new but incompatible Instruction Set Architecture (ISA) may lack proper ap... more Machines designed with new but incompatible Instruction Set Architecture (ISA) may lack proper applications. Binary translation can address this incompatibility by migrating applications from one legacy ISA to a new one, although binary translation has problems such as code discovery for variable-length ISA and code location issues for handling indirect branches. Dynamic Binary Translation (DBT) has been widely adopted for migrating applications since it avoids those problems. Static Binary Translation (SBT) is a less general solution and has not been actively researched. However, SBT performs more aggressive optimizations, which could yield more compact code and better code quality. Applications translated by SBT can consume less memory, processor cycles, and power than DBT and can be started more quickly. These advantages are even more critical for embedded systems than for general systems. In this article, we designed and implemented a new SBT tool, called LLBT, which translates ...
Computer Languages, Systems & Structures, 2015
2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2013
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems - CASES '12, 2012
7th IEEE International Symposium on Industrial Embedded Systems (SIES'12), 2012
IP multicasting has naturally been considered the ideal technique to be used with multimedia comm... more IP multicasting has naturally been considered the ideal technique to be used with multimedia communications. Unfortunately, current multicast protocols do not consider the dynamic membership. Therefore, we propose a time and distance-based multicast algorithm (TDBMA) for IPv6 mobile networks. The TDBMA is subject to time and distance. A foreign network is qualification to join the existing multicast tree only its value of time and distance is larger than a computed threshold. The experimental analyses are presented to characterize the performance of the proposed algorithm.
To allow embedded operating systems to update their components on-the-fly, dynamic update mechani... more To allow embedded operating systems to update their components on-the-fly, dynamic update mechanism is required for operating systems to be patched or added extra functionalities in without the need of rebooting the machines. However, embedded environments are usually resource-limited in terms of memory size, processing power, power consumption, and network bandwidth. Thus, dynamic update for embedded operating systems should be designed to make the best use of limited resources. In this paper, we have proposed a server-side pre-linking mechanism to make dynamic updates of embedded operating system efficiently. Applying this mechanism can reduce not only memory usage and CPU processing time for dynamic update, but also data transmission size for update components. Power consumption can be reduced as well. Performance evaluation shows that compared with the approach of Linux loadable kernel modules, the size of update components can be reduced about 14-35% and the overheads in embedded clients are minimal.
Journal of Information Science and Engineering, 2010
... Invocation T ime (μ s) Direct Invocation Trusted Component Un-trusted Component Fig. ... 13. ... more ... Invocation T ime (μ s) Direct Invocation Trusted Component Un-trusted Component Fig. ... 13. K. Sollins, The TFTP protocol (Revision 2), http://www.ietf.org/rfc/rfc1350.txt, 1992. 14. S. Furber, ARM System-on-Chip Architecture, 2nd ed., Addison-Wesley, Great Brit-ain, 2000. 15. ...
Asymmetric multicore systems had been studied as a new hardware platform toward performance-power... more Asymmetric multicore systems had been studied as a new hardware platform toward performance-power efficiency for the execution of application programs. Each core in the system has distinct performance and power characteristics. When exploiting asymmetric multicore systems, a major issue is to distribute threads to various cores. In this work, we build a pseudo asymmetric system by the dynamic voltage frequency scaling (DVFS) mechanism on Intel core-i7 920 for physical power measurement and implement a tool agent for regular JVM to form an asymmetric-aware JVM that supervises the execution of Java threads and migrates threads with a fuzzycontrol scheduler. For result inspection, we consider energy delay product (EDP) as a metric to reveal the compromise between performance and energy use. Our fuzzy-control scheduler results in EDP benefit for some benchmarks and lower overall energy consumption. 1 Keywords-Asymmetric multicore; JVM; power efficiency; schedule;
Code size is an important issue in many embedded systems. In order to reduce code size, newer emb... more Code size is an important issue in many embedded systems. In order to reduce code size, newer embedded RISC processors employ a mixed-width instruction set, where processor architectures support interleaved execution between normal (usually 32-bit) and narrow (usually 16-bit) instructions without explicit mode switch. However, because of the restriction of the encoding length, narrow instructions can only access a limited set of registers. Therefore, for a mixed-width instruction set, proper register allocation can reduce code size. One approach is to re-assign the registers after traditional register allocation. In this paper, we prove that this register reassignment problem is NP-complete by showing that the 0-1 knapsack problem is a special case of this problem. We also propose a method for register reassignment for a mixed-width instruction set with the main goal of code size reduction. 1 keywords: Mixed-width ISA, Code Size Reduction, Register Reassignment, Thumb-2, Knapsack Pr...
Binary translation is an important technique for porting programs as it allows applications for o... more Binary translation is an important technique for porting programs as it allows applications for one platform to execute on another. The technique is widely used in virtual machines and emulators. However, binary translation is challenging because many delicate details, such as calling conventions and system calls, must be handled carefully to generate correct translated code. Identifying a mistranslated instruction in a program is difficult, especially when the application program is large. Therefore, it is necessary to develop an automatic tool to uncover problems incurred during translation. We have developed a new validation mechanism for static binary translation, which checks the correctness of emulated architecture state (the state of the emulated architecture) during program execution. We have also proposed additional optimizations to speed up the automatic validation process. General Terms Reliability, Validation.
More and more modern processors support SIMD instructions for improving performance in media appl... more More and more modern processors support SIMD instructions for improving performance in media applications. Programmers usually need detailed targetspecific knowledge to use SIMD instructions directly. Thus, an auto-vectorization compiler that automatically generates efficient SIMD instructions is in urgent need. We implement an automatic superword vectorization based on the LLVM compiler infrastructure, to which an autovectorization and an alignment analysis passes have been added. The superword auto-vectorization pass exploits dataparallelism and convert IR instructions from primitive type to vector type. Then, in code generator, the alignment analysis pass analyzes every memory access with respect to those vector instructions and generates the alignment information for generate target-specific alignment instructions. In this paper, we use UltraSPARC as our experimental platform and two realignment instructions to perform misaligned access. We also present preliminary experimental ...
ACM Transactions on Embedded Computing Systems
Code discovery has been a main challenge for static binary translation, especially when the sourc... more Code discovery has been a main challenge for static binary translation, especially when the source instruction set architecture has variable-length instructions, such as the x86 architectures. Due to embedded data such as PC (program counter)-relative data, jump tables, or paddings in the code section, a binary translator may be misled to translate data as instructions. For variable-length instructions, once a piece of data is mis-translated as instructions, decoding subsequent bytes could also go wrong. We are concerned with static binary translation for the very popular Advanced RISC Machine (ARM) architectures. Although ARM is considered a reduced instruction set computer architecture, it does allow the mix of 32-bit (ARM) instructions and 16-bit (Thumb) instructions in the same executables. In addition to different instruction lengths, the ARM and Thumb instructions are located at 4-byte or 2-byte aligned addresses, respectively. Furthermore, because ARM and Thumb instructions s...
ACM Transactions on Architecture and Code Optimization, 2014
Machines designed with new but incompatible Instruction Set Architecture (ISA) may lack proper ap... more Machines designed with new but incompatible Instruction Set Architecture (ISA) may lack proper applications. Binary translation can address this incompatibility by migrating applications from one legacy ISA to a new one, although binary translation has problems such as code discovery for variable-length ISA and code location issues for handling indirect branches. Dynamic Binary Translation (DBT) has been widely adopted for migrating applications since it avoids those problems. Static Binary Translation (SBT) is a less general solution and has not been actively researched. However, SBT performs more aggressive optimizations, which could yield more compact code and better code quality. Applications translated by SBT can consume less memory, processor cycles, and power than DBT and can be started more quickly. These advantages are even more critical for embedded systems than for general systems. In this article, we designed and implemented a new SBT tool, called LLBT, which translates ...
Computer Languages, Systems & Structures, 2015
2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), 2013
Proceedings of the 2012 international conference on Compilers, architectures and synthesis for embedded systems - CASES '12, 2012
7th IEEE International Symposium on Industrial Embedded Systems (SIES'12), 2012
IP multicasting has naturally been considered the ideal technique to be used with multimedia comm... more IP multicasting has naturally been considered the ideal technique to be used with multimedia communications. Unfortunately, current multicast protocols do not consider the dynamic membership. Therefore, we propose a time and distance-based multicast algorithm (TDBMA) for IPv6 mobile networks. The TDBMA is subject to time and distance. A foreign network is qualification to join the existing multicast tree only its value of time and distance is larger than a computed threshold. The experimental analyses are presented to characterize the performance of the proposed algorithm.
To allow embedded operating systems to update their components on-the-fly, dynamic update mechani... more To allow embedded operating systems to update their components on-the-fly, dynamic update mechanism is required for operating systems to be patched or added extra functionalities in without the need of rebooting the machines. However, embedded environments are usually resource-limited in terms of memory size, processing power, power consumption, and network bandwidth. Thus, dynamic update for embedded operating systems should be designed to make the best use of limited resources. In this paper, we have proposed a server-side pre-linking mechanism to make dynamic updates of embedded operating system efficiently. Applying this mechanism can reduce not only memory usage and CPU processing time for dynamic update, but also data transmission size for update components. Power consumption can be reduced as well. Performance evaluation shows that compared with the approach of Linux loadable kernel modules, the size of update components can be reduced about 14-35% and the overheads in embedded clients are minimal.
Journal of Information Science and Engineering, 2010
... Invocation T ime (μ s) Direct Invocation Trusted Component Un-trusted Component Fig. ... 13. ... more ... Invocation T ime (μ s) Direct Invocation Trusted Component Un-trusted Component Fig. ... 13. K. Sollins, The TFTP protocol (Revision 2), http://www.ietf.org/rfc/rfc1350.txt, 1992. 14. S. Furber, ARM System-on-Chip Architecture, 2nd ed., Addison-Wesley, Great Brit-ain, 2000. 15. ...