Hirofumi Sakane - Academia.edu (original) (raw)
Papers by Hirofumi Sakane
Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique
Overlapping computation with communication is central to obtaining high performance on distribute... more Overlapping computation with communication is central to obtaining high performance on distributed-memory multiprocessors. This report explicates the overlapping capability of two distributed-memory multiprocessors: the EM-X and IBM SP-2. The well-known bitonic sorting algorithm is selected for experiments. Various message sizes are used to determine when, where, how much and why overlapping takes place. Experimental results indicate that both multiprocessors would
international conference on parallel architectures and compilation techniques, Nov 11, 1997
ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor E... more ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor EM-X distributed-memory multiprocessor. EM-X has hardware support for low latency, high throughput fine-grain communication -- this hardware support includes packet generation integrated into the instruction execution pipeline for single-cycle communication overhead, direct memory access for remote references, and rapid context switching for latency tolerance. We study the fine-grain communication performance of integer radix sort, a code with irregular communication, on EM-X, and compare it to the Fujitsu AP1000+ and the Cray Server CS6400. Our experimental results indicate that EM-X achieves high throughput and low overhead for fine-grain communication. Whereas EM-X's communication performance scales perfectly as we increase the number of processors, other coarse-grain message-passing machines exhibit fluctuation and performance degradation for larger configurations due to network contention.
Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques
ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor E... more ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor EM-X distributed-memory multiprocessor. EM-X has hardware support for low latency, high throughput fine-grain communication -- this hardware support includes packet generation integrated into the instruction execution pipeline for single-cycle communication overhead, direct memory access for remote references, and rapid context switching for latency tolerance. We study the fine-grain communication performance of integer radix sort, a code with irregular communication, on EM-X, and compare it to the Fujitsu AP1000+ and the Cray Server CS6400. Our experimental results indicate that EM-X achieves high throughput and low overhead for fine-grain communication. Whereas EM-X's communication performance scales perfectly as we increase the number of processors, other coarse-grain message-passing machines exhibit fluctuation and performance degradation for larger configurations due to network contention.
Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN)
Proceedings 11th International Parallel Processing Symposium
Sparse matrix problems require a communication paradigm different from those used in conventional... more Sparse matrix problems require a communication paradigm different from those used in conventional distributed-memory multiprocessors. We present in this paper how fine-grain communication can help obtain high performance in the experimental distributed-memory multiprocessor, EM-X, developed at ETL, which can handle fine-grain communication very efficiently. The sparse matrix kernel, Conjugate Gradient, is selected for the experiments. Among the steps in CG is the sparse matrix vector multiplications we focus on in the study. Some communication methods are developed for performance comparison, including coarse-grain and fine-grain implementations. Fine-grain communication allows exact data access in an unstructured problem to reduce the amount of communication. While CG presents bottlenecks in terms of a large number of fine-grain remote reads, the multithraded principles of execution is so designed to tolerate such latency. Experimental results indicate that the performance of fine-grain read implementation is comparable to that of coarse-grain implementation on 64 processors. The results demonstrate that fine-grain communication can be a viable and efficient approach to unstructured sparse matrix problems on large-scale distributed-memory multiprocessors.
| In this paper, we evaluate two techniques for executing shared memory programs on the EM-X dist... more | In this paper, we evaluate two techniques for executing shared memory programs on the EM-X distributed memory multiprocessor: access with no local copy(NL) and access with coherent local copy(CL). For the NL approach, multithreading e ciently hides the latency caused by ne-grain communication, whereas the thread switching overhead still remains. To eliminate the thread switching overhead and exploit locality, we have also implemented the CL mechanism derived from the notion of conventional software distributed shared memory. Performance analyses show that a highly optimized implementation for a frequent shared access program greatly improves the performance, in spite of additional software overhead. Tradeo s of NL versus CL, thus obtained, provide a basis for the selection of a technique that is more appropriate for applications on the EM-X.
Lecture Notes in Computer Science, 2008
A secure and dependable dynamic partial reconfiguration (DPR) system based on the AES-GCM cipher ... more A secure and dependable dynamic partial reconfiguration (DPR) system based on the AES-GCM cipher is developed, where the reconfigurable IP cores are protected by encrypting and authenticating their bitstreams with AES-GCM. In DPR systems, bitstream authentication is essential for avoiding fatal damage caused by inadvertent bitstreams. Although encryption-only systems can prevent bitstream cloning and reverse engineering, they cannot prevent erroneous or malicious bitstreams from being accepted as valid. If a bitstream error is detected after the system has already been partly configured, the system must be reconfigured with an errorless bitstream or at worst rebooted since the DPR changes the hardware architecture itself and the system cannot recover itself to the initial state by asserting a reset signal. In this regard, our system can recover from configuration errors without rebooting. To the authors' best knowledge, this is the first DPR system featuring both bitstream protection and error recovery mechanisms. Additionally, we clarify the relationship between the computation time and the bitstream block size, and derive the optimal internal memory size necessary to achieve the highest throughput. Furthermore, we implemented an AES-GCMbased DPR system targeting the Virtex-5 device on an off-the-shelf board, and demonstrated that all functions of bitstream decryption, verification, configuration, and error recovery work correctly. This paper clarifies the throughput, the hardware utilization, and the optimal memory configuration of said DPR system.
Lecture Notes in Computer Science, 1995
The EM-X, a new generation of EM-4, is a distributed memory multiprocessor which has a dataflow m... more The EM-X, a new generation of EM-4, is a distributed memory multiprocessor which has a dataflow mechanism. The dataflow mechanism enables a fine-grain communication packet through the network to invoke and synchronize the thread of control dynamically with very small overhead. In this paper, we present programming with a distributed data structure shared by threads, and its implementation for the
Proceedings of the 9th international conference on Supercomputing - ICS '95, 1995
The purpose of this paper is to propose a new fast execution scheme of FORTRAN programs. The prop... more The purpose of this paper is to propose a new fast execution scheme of FORTRAN programs. The proposed scheme enables the fast initiation of macrotask when ita data dependence are satisfied even if the control flow has not been reached. The previous schemes to parallelize a program including conditional branches have a number of problems-1) Though the theoretical speedup ratio is up to N when N conditional branches are jumped on either a VLIW or a superscalstr machine, the number of N is restricted up to the number of ALUs on a chip, 2) Since conventional control schemes use a few processors to control macrotasks, the overhead to control them is large. The proposed scheme solves these problems-1) The proposed scheme enables speculative execution between coarse grain tasks, i.e., macro tasks, on multiprocessors by jumping many conditional branches, 2) A distributed control scheme is proposed and implemented on the EM-4 multiprocessor to decrease the control overhead of macrotasks. Preliminary evaluations show that the control overhead of the proposed scheme is smaller than that of the other control schemes. Moreover, it is confirmed that the distributed control can be implemented by using software when the average macrotssk execution time is larger than 14.4ps on the EM-4 multiprocessor.
Synthesiology English edition, 2010
hardly trace the results produced at the different experimental environments unique to each of th... more hardly trace the results produced at the different experimental environments unique to each of them. Therefore, we have developed a standard exper imental environ ment and published information about side-channel experiments in order to contribute to the standardization activities from the neutral standpoint of the National Institute of Advanced Industrial Science and Technology (AIST) as a pubic research institution. In addition, we are pursuing collaborations with domestic and overseas research institutions, private companies, and universities toward operations of security evaluation systems for cryptographic modules. In this paper, we first present a comprehensive vision of these standardization activities and our role in them. Secondly, we explain our effort in developing a standard evaluation environment for side-channel attacks and demonstrate the current status of side-channel attacks through experiments with the environment. Thirdly, we introduce our vision for future research on fault-injection attacks and invasive attacks, which require higher techniques, and on system dependability and security assurance against accidental errors and faults in addition to attack-basis security issues. 2 E x p a n d i n g a p p li c a t i o n a n d se c u r i t y evaluation of cryptographic technology 2.1 Standardization of cryptographic algorithms The invention of writing made non-oral infor mation propagat ion a nd k nowledge accu mu lat ion possible. Since then, humankind has devised various measures for preventing a third person from discovering the information or knowledge. Cryptographic technology is one of them.-Development of a standard evaluation environment for side channel attacks
Multithreading is known be effective for tolerating communication latency in distributed-memory m... more Multithreading is known be effective for tolerating communication latency in distributed-memory multiprocessors. Two types of support for multithreading have been used to date including software and hardware. This paper presents the impact of multithreading on performance through empirical studies. In particular, we explicate the performance difference between software support and hardware support for the 80-processor EM-X distributed-memory multiprocessor which we have designed and implemented. The EMX provides three types of hardware supports for fine-grain multithreading including direct remote memory access, fast thread invocation, and dedicated instructions for generating fixed-sized communication packets. To demonstrate the effect of multithreading, we have performed various experiments using micro benchmark programs and MP3D, one of the SPLASH benchmarks. Three types of performance parameters have been measured including processor efficiency, remote memory latency, and networ...
In this paper, we evaluate two techniques for executing shared memory programs on the EM-X distri... more In this paper, we evaluate two techniques for executing shared memory programs on the EM-X distributed memory multiprocessor: access with no local copy(NL) and access with coherent local copy(CL). For the NL approach, multithreading efficiently hides the latency caused by fine-grain communication, whereas the thread switching overhead still remains. To eliminate the thread switching overhead and exploit locality, we have also implemented the CL mechanism derived from the notion of conventional software distributed shared memory. Performance analyses show that a highly optimized implementation for a frequent shared access program greatly improves the performance, in spite of additional software overhead. Tradeoffs of NL versus CL, thus obtained, provide a basis for the selection of a technique that is more appropriate for applications on the EM-X. Keywords: Parallel memory systems, Performance evaluation and measurements, Fine-grain communication, Distributed shared memory 1
2007 International Conference on Field-Programmable Technology, 2007
We developed an FPGA-based content delivery system to securely distribute digital content on the ... more We developed an FPGA-based content delivery system to securely distribute digital content on the Internet. With partial reconfigurability of a Xilinx Virtex-II Pro FPGA, the system provides a flexible single-chip solution for protecting digital content. In the system, a partial circuit must be downloaded from a server to the client terminal to play content. Content will be played if and only if the downloaded circuit is correctly combined (= interlocked) with the circuit built in the terminal. Since each circuit has a unique I/O configuration, the downloaded circuit interlocks with the corresponding built-in circuit designed for a particular terminal. Thus, the interface of the circuit itself provides a novel authentication mechanism. In the present paper, we describe the detailed architecture of the proposed system and clarify the feasibility and effectiveness of this system experimentally using a single-chip partial reconfiguration. In addition, we discuss the fail-safe mechanisms, partially reconfigurable FPGA architecture, and future research necessary for the practical application of the system.
IEICE Transactions on Information and Systems, 2013
Protecting the confidentiality and integrity of a configuration bitstream is essential for the dy... more Protecting the confidentiality and integrity of a configuration bitstream is essential for the dynamic partial reconfiguration (DPR) of field-programmable gate arrays (FPGAs). This is because erroneous or falsified bitstreams can cause fatal damage to FPGAs. In this paper, we present a high-speed and area-efficient bitstream protection scheme for DPR systems using the Advanced Encryption Standard with Galois/ Counter Mode (AES-GCM), which is an authenticated encryption algorithm. Unlike many previous studies, our bitstream protection scheme also provides a mechanism for error recovery and tamper resistance against configuration block deletion, insertion, and disorder. The implementation and evaluation results show that our DPR scheme achieves a higher performance, in terms of speed and area, than previous methods.
IEICE Transactions on Information and Systems, 2008
We developed a content delivery system using a partially reconfigurable FPGA to securely distribu... more We developed a content delivery system using a partially reconfigurable FPGA to securely distribute digital content on the Internet. With partial reconfigurability of a Xilinx Virtex-II Pro FPGA, the system provides an innovative single-chip solution for protecting digital content. In the system, a partial circuit must be downloaded from a server to the client terminal to play content. Content will be played only when the downloaded circuit is correctly combined (=interlocked) with the circuit built in the terminal. Since each circuit has a unique I/O configuration, the downloaded circuit interlocks with the corresponding built-in circuit designed for a particular terminal. Thus, the interface of the circuit itself provides a novel authentication mechanism. This paper describes the detailed architecture of the system and clarify the feasibility and effectiveness of the system. In addition, we discuss a fail-safe mechanism and future work necessary for the practical application of the system. key words: field-programmable gate array (FPGA), partial run-time reconfiguration (RTR), content protection, digital rights management (DRM)
2008 International Conference on Field Programmable Logic and Applications, 2008
A high-speed and secure dynamic partial reconfiguration (DPR) system is realized with AES-GCM tha... more A high-speed and secure dynamic partial reconfiguration (DPR) system is realized with AES-GCM that guarantees both confidentiality and authenticity of FPGA bitstreams. In DPR systems, bitstream authentication is essential for avoiding fatal damage caused by unintended bitstreams. An encryption-only system can prevent bitstream cloning and reverse engineering, but cannot prevent erroneous or malicious bitstreams from being configured. Authenticated encryption is a relatively new concept that provides both message encryption and authentication, and AES-GCM is one of the latest authenticated encryption algorithms suitable for hardware implementation. We implemented the AES-GCMbased DPR system targeting the Virtex-5 device on an offthe-shelf board, and evaluated its throughput and hardware resource utilization. For comparison, we also implemented AES-CBC and SHA-256 modules on the same device. The experimental results showed that the AES-GCM-based system achieved higher throughput with less resource utilization than the AES/SHA-based system. The AES-GCM module achieved more than 1 Gbps throughput and the entire system achieved about 800 Mbps throughput with reasonable resource utilization. This paper clarifies the advantage of using AES-GCM for protecting DPR systems.
Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798)
ABSTRACT DIMES (Delaware Iterative Multiprocessor Emulation System) is a new FPGA based hardware ... more ABSTRACT DIMES (Delaware Iterative Multiprocessor Emulation System) is a new FPGA based hardware emulator for large logic systems incorporating a number of identical functional modules such as a Multiprocessor-System-On-Chip or a cellular architecture. It aims to provide both logic verification and early software development environments with dramatically improved cost performance. Under our iterative emulation technology, a part of FPGA resource will be time-shared among several identical modules of the target design and iteratively used to emulate them in multiple steps. The representation of the identical modules in the FPGA consists of (1) a single module copy and (2) a storage block holding all the states of the modules during iterative emulation. On a first implementation of DIMES-called DIMES/P, we have implemented a multiprocessor-system-on-chip design of the IBM Cyclops architecture as a case study. We report our preliminary results and experience of exploiting the iterative emulation technology in Cyclops emulation.
ACM Symposium on Parallel Algorithms and Architectures, 1997
Multithreading aims to tolerate latency by overlapping communication with computation. This repor... more Multithreading aims to tolerate latency by overlapping communication with computation. This report explicates the multi- threading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hard- ware supports for fine-grain multithreading, including a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed- sized communication packets. Bitonic sorting and
Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique
Overlapping computation with communication is central to obtaining high performance on distribute... more Overlapping computation with communication is central to obtaining high performance on distributed-memory multiprocessors. This report explicates the overlapping capability of two distributed-memory multiprocessors: the EM-X and IBM SP-2. The well-known bitonic sorting algorithm is selected for experiments. Various message sizes are used to determine when, where, how much and why overlapping takes place. Experimental results indicate that both multiprocessors would
international conference on parallel architectures and compilation techniques, Nov 11, 1997
ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor E... more ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor EM-X distributed-memory multiprocessor. EM-X has hardware support for low latency, high throughput fine-grain communication -- this hardware support includes packet generation integrated into the instruction execution pipeline for single-cycle communication overhead, direct memory access for remote references, and rapid context switching for latency tolerance. We study the fine-grain communication performance of integer radix sort, a code with irregular communication, on EM-X, and compare it to the Fujitsu AP1000+ and the Cray Server CS6400. Our experimental results indicate that EM-X achieves high throughput and low overhead for fine-grain communication. Whereas EM-X's communication performance scales perfectly as we increase the number of processors, other coarse-grain message-passing machines exhibit fluctuation and performance degradation for larger configurations due to network contention.
Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques
ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor E... more ABSTRACT This report presents empirical results of fine-grain communication on the 80-processor EM-X distributed-memory multiprocessor. EM-X has hardware support for low latency, high throughput fine-grain communication -- this hardware support includes packet generation integrated into the instruction execution pipeline for single-cycle communication overhead, direct memory access for remote references, and rapid context switching for latency tolerance. We study the fine-grain communication performance of integer radix sort, a code with irregular communication, on EM-X, and compare it to the Fujitsu AP1000+ and the Cray Server CS6400. Our experimental results indicate that EM-X achieves high throughput and low overhead for fine-grain communication. Whereas EM-X's communication performance scales perfectly as we increase the number of processors, other coarse-grain message-passing machines exhibit fluctuation and performance degradation for larger configurations due to network contention.
Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN)
Proceedings 11th International Parallel Processing Symposium
Sparse matrix problems require a communication paradigm different from those used in conventional... more Sparse matrix problems require a communication paradigm different from those used in conventional distributed-memory multiprocessors. We present in this paper how fine-grain communication can help obtain high performance in the experimental distributed-memory multiprocessor, EM-X, developed at ETL, which can handle fine-grain communication very efficiently. The sparse matrix kernel, Conjugate Gradient, is selected for the experiments. Among the steps in CG is the sparse matrix vector multiplications we focus on in the study. Some communication methods are developed for performance comparison, including coarse-grain and fine-grain implementations. Fine-grain communication allows exact data access in an unstructured problem to reduce the amount of communication. While CG presents bottlenecks in terms of a large number of fine-grain remote reads, the multithraded principles of execution is so designed to tolerate such latency. Experimental results indicate that the performance of fine-grain read implementation is comparable to that of coarse-grain implementation on 64 processors. The results demonstrate that fine-grain communication can be a viable and efficient approach to unstructured sparse matrix problems on large-scale distributed-memory multiprocessors.
| In this paper, we evaluate two techniques for executing shared memory programs on the EM-X dist... more | In this paper, we evaluate two techniques for executing shared memory programs on the EM-X distributed memory multiprocessor: access with no local copy(NL) and access with coherent local copy(CL). For the NL approach, multithreading e ciently hides the latency caused by ne-grain communication, whereas the thread switching overhead still remains. To eliminate the thread switching overhead and exploit locality, we have also implemented the CL mechanism derived from the notion of conventional software distributed shared memory. Performance analyses show that a highly optimized implementation for a frequent shared access program greatly improves the performance, in spite of additional software overhead. Tradeo s of NL versus CL, thus obtained, provide a basis for the selection of a technique that is more appropriate for applications on the EM-X.
Lecture Notes in Computer Science, 2008
A secure and dependable dynamic partial reconfiguration (DPR) system based on the AES-GCM cipher ... more A secure and dependable dynamic partial reconfiguration (DPR) system based on the AES-GCM cipher is developed, where the reconfigurable IP cores are protected by encrypting and authenticating their bitstreams with AES-GCM. In DPR systems, bitstream authentication is essential for avoiding fatal damage caused by inadvertent bitstreams. Although encryption-only systems can prevent bitstream cloning and reverse engineering, they cannot prevent erroneous or malicious bitstreams from being accepted as valid. If a bitstream error is detected after the system has already been partly configured, the system must be reconfigured with an errorless bitstream or at worst rebooted since the DPR changes the hardware architecture itself and the system cannot recover itself to the initial state by asserting a reset signal. In this regard, our system can recover from configuration errors without rebooting. To the authors' best knowledge, this is the first DPR system featuring both bitstream protection and error recovery mechanisms. Additionally, we clarify the relationship between the computation time and the bitstream block size, and derive the optimal internal memory size necessary to achieve the highest throughput. Furthermore, we implemented an AES-GCMbased DPR system targeting the Virtex-5 device on an off-the-shelf board, and demonstrated that all functions of bitstream decryption, verification, configuration, and error recovery work correctly. This paper clarifies the throughput, the hardware utilization, and the optimal memory configuration of said DPR system.
Lecture Notes in Computer Science, 1995
The EM-X, a new generation of EM-4, is a distributed memory multiprocessor which has a dataflow m... more The EM-X, a new generation of EM-4, is a distributed memory multiprocessor which has a dataflow mechanism. The dataflow mechanism enables a fine-grain communication packet through the network to invoke and synchronize the thread of control dynamically with very small overhead. In this paper, we present programming with a distributed data structure shared by threads, and its implementation for the
Proceedings of the 9th international conference on Supercomputing - ICS '95, 1995
The purpose of this paper is to propose a new fast execution scheme of FORTRAN programs. The prop... more The purpose of this paper is to propose a new fast execution scheme of FORTRAN programs. The proposed scheme enables the fast initiation of macrotask when ita data dependence are satisfied even if the control flow has not been reached. The previous schemes to parallelize a program including conditional branches have a number of problems-1) Though the theoretical speedup ratio is up to N when N conditional branches are jumped on either a VLIW or a superscalstr machine, the number of N is restricted up to the number of ALUs on a chip, 2) Since conventional control schemes use a few processors to control macrotasks, the overhead to control them is large. The proposed scheme solves these problems-1) The proposed scheme enables speculative execution between coarse grain tasks, i.e., macro tasks, on multiprocessors by jumping many conditional branches, 2) A distributed control scheme is proposed and implemented on the EM-4 multiprocessor to decrease the control overhead of macrotasks. Preliminary evaluations show that the control overhead of the proposed scheme is smaller than that of the other control schemes. Moreover, it is confirmed that the distributed control can be implemented by using software when the average macrotssk execution time is larger than 14.4ps on the EM-4 multiprocessor.
Synthesiology English edition, 2010
hardly trace the results produced at the different experimental environments unique to each of th... more hardly trace the results produced at the different experimental environments unique to each of them. Therefore, we have developed a standard exper imental environ ment and published information about side-channel experiments in order to contribute to the standardization activities from the neutral standpoint of the National Institute of Advanced Industrial Science and Technology (AIST) as a pubic research institution. In addition, we are pursuing collaborations with domestic and overseas research institutions, private companies, and universities toward operations of security evaluation systems for cryptographic modules. In this paper, we first present a comprehensive vision of these standardization activities and our role in them. Secondly, we explain our effort in developing a standard evaluation environment for side-channel attacks and demonstrate the current status of side-channel attacks through experiments with the environment. Thirdly, we introduce our vision for future research on fault-injection attacks and invasive attacks, which require higher techniques, and on system dependability and security assurance against accidental errors and faults in addition to attack-basis security issues. 2 E x p a n d i n g a p p li c a t i o n a n d se c u r i t y evaluation of cryptographic technology 2.1 Standardization of cryptographic algorithms The invention of writing made non-oral infor mation propagat ion a nd k nowledge accu mu lat ion possible. Since then, humankind has devised various measures for preventing a third person from discovering the information or knowledge. Cryptographic technology is one of them.-Development of a standard evaluation environment for side channel attacks
Multithreading is known be effective for tolerating communication latency in distributed-memory m... more Multithreading is known be effective for tolerating communication latency in distributed-memory multiprocessors. Two types of support for multithreading have been used to date including software and hardware. This paper presents the impact of multithreading on performance through empirical studies. In particular, we explicate the performance difference between software support and hardware support for the 80-processor EM-X distributed-memory multiprocessor which we have designed and implemented. The EMX provides three types of hardware supports for fine-grain multithreading including direct remote memory access, fast thread invocation, and dedicated instructions for generating fixed-sized communication packets. To demonstrate the effect of multithreading, we have performed various experiments using micro benchmark programs and MP3D, one of the SPLASH benchmarks. Three types of performance parameters have been measured including processor efficiency, remote memory latency, and networ...
In this paper, we evaluate two techniques for executing shared memory programs on the EM-X distri... more In this paper, we evaluate two techniques for executing shared memory programs on the EM-X distributed memory multiprocessor: access with no local copy(NL) and access with coherent local copy(CL). For the NL approach, multithreading efficiently hides the latency caused by fine-grain communication, whereas the thread switching overhead still remains. To eliminate the thread switching overhead and exploit locality, we have also implemented the CL mechanism derived from the notion of conventional software distributed shared memory. Performance analyses show that a highly optimized implementation for a frequent shared access program greatly improves the performance, in spite of additional software overhead. Tradeoffs of NL versus CL, thus obtained, provide a basis for the selection of a technique that is more appropriate for applications on the EM-X. Keywords: Parallel memory systems, Performance evaluation and measurements, Fine-grain communication, Distributed shared memory 1
2007 International Conference on Field-Programmable Technology, 2007
We developed an FPGA-based content delivery system to securely distribute digital content on the ... more We developed an FPGA-based content delivery system to securely distribute digital content on the Internet. With partial reconfigurability of a Xilinx Virtex-II Pro FPGA, the system provides a flexible single-chip solution for protecting digital content. In the system, a partial circuit must be downloaded from a server to the client terminal to play content. Content will be played if and only if the downloaded circuit is correctly combined (= interlocked) with the circuit built in the terminal. Since each circuit has a unique I/O configuration, the downloaded circuit interlocks with the corresponding built-in circuit designed for a particular terminal. Thus, the interface of the circuit itself provides a novel authentication mechanism. In the present paper, we describe the detailed architecture of the proposed system and clarify the feasibility and effectiveness of this system experimentally using a single-chip partial reconfiguration. In addition, we discuss the fail-safe mechanisms, partially reconfigurable FPGA architecture, and future research necessary for the practical application of the system.
IEICE Transactions on Information and Systems, 2013
Protecting the confidentiality and integrity of a configuration bitstream is essential for the dy... more Protecting the confidentiality and integrity of a configuration bitstream is essential for the dynamic partial reconfiguration (DPR) of field-programmable gate arrays (FPGAs). This is because erroneous or falsified bitstreams can cause fatal damage to FPGAs. In this paper, we present a high-speed and area-efficient bitstream protection scheme for DPR systems using the Advanced Encryption Standard with Galois/ Counter Mode (AES-GCM), which is an authenticated encryption algorithm. Unlike many previous studies, our bitstream protection scheme also provides a mechanism for error recovery and tamper resistance against configuration block deletion, insertion, and disorder. The implementation and evaluation results show that our DPR scheme achieves a higher performance, in terms of speed and area, than previous methods.
IEICE Transactions on Information and Systems, 2008
We developed a content delivery system using a partially reconfigurable FPGA to securely distribu... more We developed a content delivery system using a partially reconfigurable FPGA to securely distribute digital content on the Internet. With partial reconfigurability of a Xilinx Virtex-II Pro FPGA, the system provides an innovative single-chip solution for protecting digital content. In the system, a partial circuit must be downloaded from a server to the client terminal to play content. Content will be played only when the downloaded circuit is correctly combined (=interlocked) with the circuit built in the terminal. Since each circuit has a unique I/O configuration, the downloaded circuit interlocks with the corresponding built-in circuit designed for a particular terminal. Thus, the interface of the circuit itself provides a novel authentication mechanism. This paper describes the detailed architecture of the system and clarify the feasibility and effectiveness of the system. In addition, we discuss a fail-safe mechanism and future work necessary for the practical application of the system. key words: field-programmable gate array (FPGA), partial run-time reconfiguration (RTR), content protection, digital rights management (DRM)
2008 International Conference on Field Programmable Logic and Applications, 2008
A high-speed and secure dynamic partial reconfiguration (DPR) system is realized with AES-GCM tha... more A high-speed and secure dynamic partial reconfiguration (DPR) system is realized with AES-GCM that guarantees both confidentiality and authenticity of FPGA bitstreams. In DPR systems, bitstream authentication is essential for avoiding fatal damage caused by unintended bitstreams. An encryption-only system can prevent bitstream cloning and reverse engineering, but cannot prevent erroneous or malicious bitstreams from being configured. Authenticated encryption is a relatively new concept that provides both message encryption and authentication, and AES-GCM is one of the latest authenticated encryption algorithms suitable for hardware implementation. We implemented the AES-GCMbased DPR system targeting the Virtex-5 device on an offthe-shelf board, and evaluated its throughput and hardware resource utilization. For comparison, we also implemented AES-CBC and SHA-256 modules on the same device. The experimental results showed that the AES-GCM-based system achieved higher throughput with less resource utilization than the AES/SHA-based system. The AES-GCM module achieved more than 1 Gbps throughput and the entire system achieved about 800 Mbps throughput with reasonable resource utilization. This paper clarifies the advantage of using AES-GCM for protecting DPR systems.
Proceedings. 2003 IEEE International Conference on Field-Programmable Technology (FPT) (IEEE Cat. No.03EX798)
ABSTRACT DIMES (Delaware Iterative Multiprocessor Emulation System) is a new FPGA based hardware ... more ABSTRACT DIMES (Delaware Iterative Multiprocessor Emulation System) is a new FPGA based hardware emulator for large logic systems incorporating a number of identical functional modules such as a Multiprocessor-System-On-Chip or a cellular architecture. It aims to provide both logic verification and early software development environments with dramatically improved cost performance. Under our iterative emulation technology, a part of FPGA resource will be time-shared among several identical modules of the target design and iteratively used to emulate them in multiple steps. The representation of the identical modules in the FPGA consists of (1) a single module copy and (2) a storage block holding all the states of the modules during iterative emulation. On a first implementation of DIMES-called DIMES/P, we have implemented a multiprocessor-system-on-chip design of the IBM Cyclops architecture as a case study. We report our preliminary results and experience of exploiting the iterative emulation technology in Cyclops emulation.
ACM Symposium on Parallel Algorithms and Architectures, 1997
Multithreading aims to tolerate latency by overlapping communication with computation. This repor... more Multithreading aims to tolerate latency by overlapping communication with computation. This report explicates the multi- threading capabilities of the EM-X distributed-memory multiprocessor through empirical studies. The EM-X provides hard- ware supports for fine-grain multithreading, including a by-passing mechanism for direct remote reads and writes, hardware FIFO thread scheduling, and dedicated instructions for generating fixed- sized communication packets. Bitonic sorting and