Ryan Kastner | University of California, San Diego (original) (raw)
Papers by Ryan Kastner
2017 International Conference on Computing, Networking and Communications (ICNC), 2017
Face recognition systems play a vital role in many applications including surveillance, biometric... more Face recognition systems play a vital role in many applications including surveillance, biometrics and security. In this work, we present a complete real-time face recognition system consisting of a face detection, a recognition and a downsampling module using an FPGA. Our system provides an end-to-end solution for face recognition; it receives video input from a camera, detects the locations of the face(s) using the Viola-Jones algorithm, subsequently recognizes each face using the Eigenface algorithm, and outputs the results to a display. Experimental results show that our complete face recognition system operates at 45 frames per second on a Virtex-5 FPGA.
2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems, 2014
The increasing demand for complex and specialized embedded hardware must be met by processors whi... more The increasing demand for complex and specialized embedded hardware must be met by processors which are optimized for performance, yet are also extremely flexible. In our work, we explore the tradeoff between flexibility and performance in the domain of reconfigurable processor design. Specifically, we seek to identify regularly occurring, computation-heavy patterns in an application or set of applications. These patterns become candidates for hard-logic implementation, potentially embedded in the flexible reconfigurable fabric as special optimized instructions. In this work we present an extension to previous work in instruction generation: an algorithm that identifies parallel templates. We discuss the advantages of parallel templates, and prove the correctness of our algorithm. We introduce an All-Pairs Common Slack Graph (APCSG) as an effective tool for parallel template generation. Finally, we demonstrate the effectiveness of our algorithm on several applications' dataflow graphs, reducing latency on average by 51.98%, without unreasonably increasing chip area.
Proceedings of the 47th Design Automation Conference, 2010
Understanding the flow of information is an important aspect in computer security. There has been... more Understanding the flow of information is an important aspect in computer security. There has been a recent move towards tracking information in hardware and understanding the flow of individual bits through Boolean functions. Such gate level information flow tracking (GLIFT) provides a precise understanding of all flows of information and is an invaluable tool for proving many system security properties. This paper presents a theoretical analysis of GLIFT.
This paper addresses the issue of timing driven gate duplication for delay optimization. Gate dup... more This paper addresses the issue of timing driven gate duplication for delay optimization. Gate duplication has been used extensively for cutset minimization but the usefulness in minimizing the circuit delay has not been addressed. This paper studies the complexity issues in timing driven gate duplication and proposes an algorithm for solving the so called global gate duplication problem. Delay improvements over highly optimized results from SIS have been reported.
High-assurance systems, such as flight control and banking systems require strict guarantees on i... more High-assurance systems, such as flight control and banking systems require strict guarantees on information flows or else face catastrophic consequences. Information flow tracking (IFT) is a frequently used security measure for preventing unintended information flows in such systems. Recently, Gate Level Information Flow Tracking (GLIFT) has been proposed to track information flows at the hardware level. GLIFT enables a concrete understanding of all information flows from Boolean gates. It unifies the notions of explicit flows, covert channels, and even timing channels at the gate level and provides a general approach for enhancing important security properties such as integrity and confidentiality. This article presents a new encoding scheme for GLIFT with fewer encoding states by combining two states into one. Unlike the previous method, this reduction in encoding states allows the GLIFT tracking logic to operate independently from the original circuit. This independence allows for the GLIFT logic to be configured as both redundancy and tracking logic for the original circuit. Further, experimental results show this new state assignment provides on average 25.7% reductions in area, 31.4% reductions in delay, and 48.6% decrease in simulation time for several IWLS benchmarks.
American Academy of Underwater Sciences, 2009
The quantification of abundance, size, and distribution of fish is critical to properly manage an... more The quantification of abundance, size, and distribution of fish is critical to properly manage and protect marine ecosystems and regulate marine fisheries. Currently, fish surveys are conducted using fish tagging, scientific diving, and/or capture and release methods (i.e., net trawls), methods that are both costly and time consuming. Therefore, providing an automated way to conduct fish surveys could provide a real benefit to marine managers. In order to provide automated fish counts and classification we propose an automated fish species classification system using computer vision. This computer vision system can count and classify fish found in underwater video images using a classification method known as Haar classification. We have partnered with the Birch Aquarium to obtain underwater images of a variety of fish species, and present in this paper the implementation of our vision system and its detection results for our first test species, the Scythe Butterfly fish, subject of the Birch Aquarium logo.
Ieee Design and Test of Computers, Feb 9, 2008
2009 Ieee Wireless Communications and Networking Conference, Apr 5, 2009
Matrix decomposition is required in various algorithms used in wireless communication application... more Matrix decomposition is required in various algorithms used in wireless communication applications. FPGAs strike a balance between ASICs and DSPs, as they have the programmability of software with performance capacity approaching that of a custom hardware implementation. However, FPGA architectures require designers to make a countless number of system, architectural and logic design decisions. By performing design space exploration, a designer can find the optimal device for a specific application, however very few tools exist which can accomplish this task. This paper presents automatic generation and optimization of decomposition methods using a core generator tool, GUSTO, that we developed to enable easy design space exploration with different parameterization options such as resource allocation, bit widths of the data, number of functional units and organization of controllers and interconnects. We present a detailed study of area and throughput tradeoffs of matrix decomposition architectures using different parameterizations.
Form approved OMB No 0704-0188
Acm Transactions on Design Automation of Electronic Systems, Jul 1, 2008
The extremely high cost of custom ASIC fabrication makes FPGAs an attractive alternative for depl... more The extremely high cost of custom ASIC fabrication makes FPGAs an attractive alternative for deployment of custom hardware. Embedded systems based on reconfigurable hardware integrate many functions onto a single device. Since embedded designers often have no choice but to use soft IP cores obtained from third parties, the cores operate at different trust levels, resulting in mixed-trust designs. The goal of this project is to evaluate recently proposed security primitives for reconfigurable hardware by building a real embedded system with several cores on a single FPGA and implementing these primitives on the system. Overcoming the practical problems of integrating multiple cores together with security mechanisms will help us to develop realistic security-policy specifications that drive enforcement mechanisms on embedded systems. Range 1 → [0x28000010,0x28000777]; (AES1) Range 2 → [0x28000800,0x28000fff]; (AES2) Range 3 → [0x24000000,0x24777777]; (DRAM1) Range 4 → [0x24800000,0x24ffffff]; (DRAM2) Range 5 → [0x40600000,0x4060ffff]; (RS-232) Range 6 → [0x40c00000,0x40c0ffff]; (Ethernet) Range 7 → [0x28000004,0x28000007]; (Ctrl Word 1 ) Range 8 → [0x28000008,0x2800000f]; (Ctrl Word 2 ) Range 9 → [0x28000000,0x28000003]; (Ctrl Word AES ) The second part of the policy specifies the different access modes, one for each state. Access 0 → {Module 1 , rw, Range 5 } | {Module 2 , rw, Range 6 } | {Module 1 , rw, Range 3 } | {Module 2 , rw, Range 4 }
Proceedings of the Asp Dac 2005 Asia and South Pacific Design Automation Conference 2005, Jan 18, 2005
This paper presents a novel technique to reduce the number of operations in Multiplierless implem... more This paper presents a novel technique to reduce the number of operations in Multiplierless implementations of linear DSP transforms, by iteratively eliminating two-term common subexpressions. Our method uses a polynomial transformation of linear systems that enables us to eliminate common subexpressions consisting of multiple variables. Our algorithm is fast and produces the least number of additions/subtractions compared to all known techniques. The synthesized examples show significant reductions in the area and power consumption.
Ersa, 2008
Computing systems designed using reconfigurable hardware are now used in many sensitive applicati... more Computing systems designed using reconfigurable hardware are now used in many sensitive applications, where security is of utmost importance. Unfortunately, a strong notion of security is not currently present in FPGA hardware and software design flows. In the following, we discuss the security implications of using reconfigurable hardware in sensitive applications, and outline problems, attacks, solutions and topics for future research.
Abstract| Deep sub-micron e ects, along with increasing interconnect densities, have increased th... more Abstract| Deep sub-micron e ects, along with increasing interconnect densities, have increased the complexity of the routing problem. Whereas previously we could focus on minimizing wirelength, we m ust now consider a variety of objectives during routing. For example, an increased amount of timing restrictions means that we m ust minimize interconnect delay. But, interconnect delay is no longer simply related to wirelength. Coupling capacitance has become a dominant component of delay due to the shrinking of device sizes. Regardless, the most important objective i s producing a routable circuit. Unfortunately, this often conicts with minimizing interconnect delay as minimum delay routes create congested areas, for which an exact routing cannot be realized without violating design rules. In this work, we use the concept of pattern routing to develop algorithms that guide the router to a solution that minimizes interconnect delay ; by considering both coupling and wirelength ; without damaging the routability of the circuit.
This paper describes methods for synthesizing the internal representation of a compiler into a ha... more This paper describes methods for synthesizing the internal representation of a compiler into a hardware description language in order to program reconfigurable hardware devices. We demonstrate the usefulness of static single assignment (SSA) in reducing the amount of data communication in the hardware. However, the placement of Φ-nodes by current SSA algorithms is not optimal in terms of minimizing data communication. We propose an improved SSA algorithm which optimally places Φnodes, further decreasing area and communication latency. Our algorithm reduces the data communication (measured as total edge weight in a control data flow graph) by as much as 20% for some applications as compared to the best-known SSA algorithm -the pruned algorithm. We also show that our algorithm frequently leads to increased overall area, and describe future modifications to our model that should correct this shortcoming.
Proceedings of the 43rd Annual Design Automation Conference, 2006
Transistor leakage is poised to become the dominant source of power dissipation in digital system... more Transistor leakage is poised to become the dominant source of power dissipation in digital systems, and reconfigurable devices are not immune to this problem. Modern FPGAs already have a significant amount of memory on the die, and with each generation the proportion of embedded memory to logic cells is growing. While assigning high V th can limit the leakage power, embedded memory timing is critical to performance and will draw an increasingly significant amount of leakage current. However, unlike in many processor based systems, on-chip memory accesses are often fully deterministic and completely under the control of the scheduler. In this paper we explore a variety of techniques to battle the problem of leakage in FPGA embedded memories that range in complexity and effectiveness. Through the addition of sleep and drowsy modes, controlled by the scheduler, the amount of leakage power can be reduced by several orders of magnitude. We show how even very simple schemes offer large amounts of benefit, and that further reductions are possible through careful leakage-aware data placement.
2017 International Conference on Computing, Networking and Communications (ICNC), 2017
Face recognition systems play a vital role in many applications including surveillance, biometric... more Face recognition systems play a vital role in many applications including surveillance, biometrics and security. In this work, we present a complete real-time face recognition system consisting of a face detection, a recognition and a downsampling module using an FPGA. Our system provides an end-to-end solution for face recognition; it receives video input from a camera, detects the locations of the face(s) using the Viola-Jones algorithm, subsequently recognizes each face using the Eigenface algorithm, and outputs the results to a display. Experimental results show that our complete face recognition system operates at 45 frames per second on a Virtex-5 FPGA.
2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems, 2014
The increasing demand for complex and specialized embedded hardware must be met by processors whi... more The increasing demand for complex and specialized embedded hardware must be met by processors which are optimized for performance, yet are also extremely flexible. In our work, we explore the tradeoff between flexibility and performance in the domain of reconfigurable processor design. Specifically, we seek to identify regularly occurring, computation-heavy patterns in an application or set of applications. These patterns become candidates for hard-logic implementation, potentially embedded in the flexible reconfigurable fabric as special optimized instructions. In this work we present an extension to previous work in instruction generation: an algorithm that identifies parallel templates. We discuss the advantages of parallel templates, and prove the correctness of our algorithm. We introduce an All-Pairs Common Slack Graph (APCSG) as an effective tool for parallel template generation. Finally, we demonstrate the effectiveness of our algorithm on several applications' dataflow graphs, reducing latency on average by 51.98%, without unreasonably increasing chip area.
Proceedings of the 47th Design Automation Conference, 2010
Understanding the flow of information is an important aspect in computer security. There has been... more Understanding the flow of information is an important aspect in computer security. There has been a recent move towards tracking information in hardware and understanding the flow of individual bits through Boolean functions. Such gate level information flow tracking (GLIFT) provides a precise understanding of all flows of information and is an invaluable tool for proving many system security properties. This paper presents a theoretical analysis of GLIFT.
This paper addresses the issue of timing driven gate duplication for delay optimization. Gate dup... more This paper addresses the issue of timing driven gate duplication for delay optimization. Gate duplication has been used extensively for cutset minimization but the usefulness in minimizing the circuit delay has not been addressed. This paper studies the complexity issues in timing driven gate duplication and proposes an algorithm for solving the so called global gate duplication problem. Delay improvements over highly optimized results from SIS have been reported.
High-assurance systems, such as flight control and banking systems require strict guarantees on i... more High-assurance systems, such as flight control and banking systems require strict guarantees on information flows or else face catastrophic consequences. Information flow tracking (IFT) is a frequently used security measure for preventing unintended information flows in such systems. Recently, Gate Level Information Flow Tracking (GLIFT) has been proposed to track information flows at the hardware level. GLIFT enables a concrete understanding of all information flows from Boolean gates. It unifies the notions of explicit flows, covert channels, and even timing channels at the gate level and provides a general approach for enhancing important security properties such as integrity and confidentiality. This article presents a new encoding scheme for GLIFT with fewer encoding states by combining two states into one. Unlike the previous method, this reduction in encoding states allows the GLIFT tracking logic to operate independently from the original circuit. This independence allows for the GLIFT logic to be configured as both redundancy and tracking logic for the original circuit. Further, experimental results show this new state assignment provides on average 25.7% reductions in area, 31.4% reductions in delay, and 48.6% decrease in simulation time for several IWLS benchmarks.
American Academy of Underwater Sciences, 2009
The quantification of abundance, size, and distribution of fish is critical to properly manage an... more The quantification of abundance, size, and distribution of fish is critical to properly manage and protect marine ecosystems and regulate marine fisheries. Currently, fish surveys are conducted using fish tagging, scientific diving, and/or capture and release methods (i.e., net trawls), methods that are both costly and time consuming. Therefore, providing an automated way to conduct fish surveys could provide a real benefit to marine managers. In order to provide automated fish counts and classification we propose an automated fish species classification system using computer vision. This computer vision system can count and classify fish found in underwater video images using a classification method known as Haar classification. We have partnered with the Birch Aquarium to obtain underwater images of a variety of fish species, and present in this paper the implementation of our vision system and its detection results for our first test species, the Scythe Butterfly fish, subject of the Birch Aquarium logo.
Ieee Design and Test of Computers, Feb 9, 2008
2009 Ieee Wireless Communications and Networking Conference, Apr 5, 2009
Matrix decomposition is required in various algorithms used in wireless communication application... more Matrix decomposition is required in various algorithms used in wireless communication applications. FPGAs strike a balance between ASICs and DSPs, as they have the programmability of software with performance capacity approaching that of a custom hardware implementation. However, FPGA architectures require designers to make a countless number of system, architectural and logic design decisions. By performing design space exploration, a designer can find the optimal device for a specific application, however very few tools exist which can accomplish this task. This paper presents automatic generation and optimization of decomposition methods using a core generator tool, GUSTO, that we developed to enable easy design space exploration with different parameterization options such as resource allocation, bit widths of the data, number of functional units and organization of controllers and interconnects. We present a detailed study of area and throughput tradeoffs of matrix decomposition architectures using different parameterizations.
Form approved OMB No 0704-0188
Acm Transactions on Design Automation of Electronic Systems, Jul 1, 2008
The extremely high cost of custom ASIC fabrication makes FPGAs an attractive alternative for depl... more The extremely high cost of custom ASIC fabrication makes FPGAs an attractive alternative for deployment of custom hardware. Embedded systems based on reconfigurable hardware integrate many functions onto a single device. Since embedded designers often have no choice but to use soft IP cores obtained from third parties, the cores operate at different trust levels, resulting in mixed-trust designs. The goal of this project is to evaluate recently proposed security primitives for reconfigurable hardware by building a real embedded system with several cores on a single FPGA and implementing these primitives on the system. Overcoming the practical problems of integrating multiple cores together with security mechanisms will help us to develop realistic security-policy specifications that drive enforcement mechanisms on embedded systems. Range 1 → [0x28000010,0x28000777]; (AES1) Range 2 → [0x28000800,0x28000fff]; (AES2) Range 3 → [0x24000000,0x24777777]; (DRAM1) Range 4 → [0x24800000,0x24ffffff]; (DRAM2) Range 5 → [0x40600000,0x4060ffff]; (RS-232) Range 6 → [0x40c00000,0x40c0ffff]; (Ethernet) Range 7 → [0x28000004,0x28000007]; (Ctrl Word 1 ) Range 8 → [0x28000008,0x2800000f]; (Ctrl Word 2 ) Range 9 → [0x28000000,0x28000003]; (Ctrl Word AES ) The second part of the policy specifies the different access modes, one for each state. Access 0 → {Module 1 , rw, Range 5 } | {Module 2 , rw, Range 6 } | {Module 1 , rw, Range 3 } | {Module 2 , rw, Range 4 }
Proceedings of the Asp Dac 2005 Asia and South Pacific Design Automation Conference 2005, Jan 18, 2005
This paper presents a novel technique to reduce the number of operations in Multiplierless implem... more This paper presents a novel technique to reduce the number of operations in Multiplierless implementations of linear DSP transforms, by iteratively eliminating two-term common subexpressions. Our method uses a polynomial transformation of linear systems that enables us to eliminate common subexpressions consisting of multiple variables. Our algorithm is fast and produces the least number of additions/subtractions compared to all known techniques. The synthesized examples show significant reductions in the area and power consumption.
Ersa, 2008
Computing systems designed using reconfigurable hardware are now used in many sensitive applicati... more Computing systems designed using reconfigurable hardware are now used in many sensitive applications, where security is of utmost importance. Unfortunately, a strong notion of security is not currently present in FPGA hardware and software design flows. In the following, we discuss the security implications of using reconfigurable hardware in sensitive applications, and outline problems, attacks, solutions and topics for future research.
Abstract| Deep sub-micron e ects, along with increasing interconnect densities, have increased th... more Abstract| Deep sub-micron e ects, along with increasing interconnect densities, have increased the complexity of the routing problem. Whereas previously we could focus on minimizing wirelength, we m ust now consider a variety of objectives during routing. For example, an increased amount of timing restrictions means that we m ust minimize interconnect delay. But, interconnect delay is no longer simply related to wirelength. Coupling capacitance has become a dominant component of delay due to the shrinking of device sizes. Regardless, the most important objective i s producing a routable circuit. Unfortunately, this often conicts with minimizing interconnect delay as minimum delay routes create congested areas, for which an exact routing cannot be realized without violating design rules. In this work, we use the concept of pattern routing to develop algorithms that guide the router to a solution that minimizes interconnect delay ; by considering both coupling and wirelength ; without damaging the routability of the circuit.
This paper describes methods for synthesizing the internal representation of a compiler into a ha... more This paper describes methods for synthesizing the internal representation of a compiler into a hardware description language in order to program reconfigurable hardware devices. We demonstrate the usefulness of static single assignment (SSA) in reducing the amount of data communication in the hardware. However, the placement of Φ-nodes by current SSA algorithms is not optimal in terms of minimizing data communication. We propose an improved SSA algorithm which optimally places Φnodes, further decreasing area and communication latency. Our algorithm reduces the data communication (measured as total edge weight in a control data flow graph) by as much as 20% for some applications as compared to the best-known SSA algorithm -the pruned algorithm. We also show that our algorithm frequently leads to increased overall area, and describe future modifications to our model that should correct this shortcoming.
Proceedings of the 43rd Annual Design Automation Conference, 2006
Transistor leakage is poised to become the dominant source of power dissipation in digital system... more Transistor leakage is poised to become the dominant source of power dissipation in digital systems, and reconfigurable devices are not immune to this problem. Modern FPGAs already have a significant amount of memory on the die, and with each generation the proportion of embedded memory to logic cells is growing. While assigning high V th can limit the leakage power, embedded memory timing is critical to performance and will draw an increasingly significant amount of leakage current. However, unlike in many processor based systems, on-chip memory accesses are often fully deterministic and completely under the control of the scheduler. In this paper we explore a variety of techniques to battle the problem of leakage in FPGA embedded memories that range in complexity and effectiveness. Through the addition of sleep and drowsy modes, controlled by the scheduler, the amount of leakage power can be reduced by several orders of magnitude. We show how even very simple schemes offer large amounts of benefit, and that further reductions are possible through careful leakage-aware data placement.