Gustavo Sutter | Universidad Autónoma de Madrid (original) (raw)
Papers by Gustavo Sutter
En este trabajo se presenta una arquitectura basada en FPGA, diseñada para la agregación y poster... more En este trabajo se presenta una arquitectura basada en FPGA, diseñada para la agregación y posterior exportación de registros de sesiones TCP en enlaces de hasta 40 Gbit/s sin realizar muestreo de paquetes, incluso a la máxima tasa de paquetes. De esta manera, se descarga a exportadores de flujos basados en hardware de propósito específico de tareas para las cuales las FPGA ofrecen una flexibilidad y un desempeño adecuados, reduciendo los requerimientos del sistema completo.<br> Un prototipo funcional del sistema ha sido implementado en la plataforma NetFPGA-SUME, donde fue sometido a tráfico real. En el mismo, se incorporó una estimación de las retransmisiones por flujo, además de otras estadísticas estándar, tales como número de bytes y paquetes de cada conexión de red.
Lecture notes in electrical engineering, 2012
Finite fields are used in different types of computers and digital communication systems. Two wel... more Finite fields are used in different types of computers and digital communication systems. Two well-known examples are error-correction codes and cryptography. The traditional way of implementing the corresponding algorithms is software, running on general-purpose processors or on digital-signal processors. Nevertheless, in some cases the time constraints cannot be met with instruction-set processors, and specific hardware must be considered.
Lecture notes in electrical engineering, 2012
IFAC Proceedings Volumes, Apr 1, 1997
The generator includes several tools that allows to translate the initial problem specification t... more The generator includes several tools that allows to translate the initial problem specification to a specific circuit implementation. From the rule based specification the generator produces a first computing scheme. By applying various transformations (lattice and arithmetic operation minimization, optimal register assignation, ...) the system produces the microprogram that drives the controller. For this purpose, the VHDL language is used to simulate the hardware controller, and the ES2 design kit is used to design the circuits to be integrated.
IEEE Access, 2021
Near-lossless compression is a generalization of lossless compression, where the codec user is ab... more Near-lossless compression is a generalization of lossless compression, where the codec user is able to set the maximum absolute difference (the error tolerance) between the values of an original pixel and the decoded one. This enables higher compression ratios, while still allowing the control of the bounds of the quantization errors in the space domain. This feature makes them attractive for applications where a high degree of certainty is required. The JPEG-LS lossless and near-lossless image compression standard combines a good compression ratio with a low computational complexity, which makes it very suitable for scenarios with strong restrictions, common in embedded systems. However, our analysis shows great coding efficiency improvement potential, especially for lower entropy distributions, more common in near-lossless. In this work, we propose enhancements to the JPEG-LS standard, aimed at improving its coding efficiency at a low computational overhead, particularly for hardware implementations. The main contribution is a low complexity and efficient coder, based on Tabled Asymmetric Numeral Systems (tANS), well suited for a wide range of entropy sources and with simple hardware implementation. This coder enables further optimizations, resulting in great compression ratio improvements. When targeting photographic images, the proposed system is capable of achieving, in mean, 1.6%, 6%, and 37.6% better compression for error tolerances of 0, 1, and 10, respectively. Additional improvements are achieved increasing the context size and image tiling, obtaining 2.3% lower bpp for lossless compression. Our results also show that our proposal compares favorably against state-of-the-art codecs like JPEG-XL and WebP, particularly in near-lossless, where it achieves higher compression ratios with a faster coding speed. INDEX TERMS Image codec, near-lossless compression, JPEG-LS, asymmetric numeral systems, low complexity, two-sized geometric distribution.
D3.2 reports the first validation of the METRO-HAUL node architecture and the interconnection, tr... more D3.2 reports the first validation of the METRO-HAUL node architecture and the interconnection, transmission and switching optical solutions developed in T3.1-T3.3 in different disaggregated scenarios. The deliverable also includes an early description of the METRO-HAUL node control and management environment and the developed software for controlling all the OIE developed in T3.4.
2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2017
In network traffic monitoring, a very important analysis is to find heavy hitters. That is, findi... more In network traffic monitoring, a very important analysis is to find heavy hitters. That is, finding those flows that use most resources in a given network link. This information can be very useful for security or traffic management purposes. Though this analysis might seem easy to implement, since it is essentially based on counting, the fact is that doing it at 100 Gbit/s rates is far from trivial. In 100 Gbit/s Ethernet (100 GbE), up to 148 million packets per second can be received, thus making it very difficult to parse packets and maintain counters at such rate. In this paper, we leverage the integrated 100G Ethernet Subsystem available in Xilinx UltraScale devices to implement a heavy hitter detector for 100 GbE in a VCU108 evaluation kit. Thanks to the integration of the Count Sketch algorithm with a priority list and a network packet parser, the proposed architecture is able to work at line rate for average packet sizes bigger than 215 bytes. The work presents a theoretical analysis of the error, as well as the technical details of the proposed solution. The implementation has been validated using real-world traces, obtaining an average error of 1.29%.
2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2016
Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed ... more Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed of network links. At 100 Gbit/s, the huge volume of data makes it very difficult to perform online analyses or to store traffic for subsequent forensic investigations. It is therefore mandatory to carry out some kind of filtering and/or capping in the network traffic to be analyzed. Additionally, the fraction of encrypted traffic is relentlessly increasing. For such encrypted traffic, storing the payload is most times useless. In this paper we present an FPGA implementation of a method to identify plain text (that is, human readable) in the network packet payload. The method is based on both detecting bursts of printable ASCII characters and calculating the fraction of these printable characters in the packet payload. This method has proven to be very effective in reducing the amount of information used in traffic analysis, by saving only the headers of packets with encrypted payloads. We leveraged the advantages of high-level languages to reduce development time, though traditional HDL languages were also used to optimize critical areas of the design. The design targets the 100 Gbit/s Ethernet interfaces of Xilinx Virtex UltraScale devices and it is able to detect human-readable packet payloads at line rate, with a high accuracy.
Lecture Notes in Electrical Engineering, 2012
2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2017
Network traffic monitoring usually faces the problem of packet duplication, which arises when por... more Network traffic monitoring usually faces the problem of packet duplication, which arises when port mirroring is being used. That is, when traffic is copied from the ports of a switch or a router that are being monitored, to a mirror port where a monitoring probe is attached. Thus, a packet can be copied twice, both at the ingress and egress ports, therefore generating duplicates. Information redundancy caused by packet duplication not only leads to increased workloads at the monitoring probes, but also calls for more disk space to store the network traces. Actually, packet duplication may increase 100% the monitoring load. There are different sorts of packet duplication; in this paper we focus on switching duplication, because it is the most common in a network monitoring scenario, where the network probe is attached to a core switch. We present a high performance FPGA architecture that is able to detect and remove duplicated packets in 100 Gbit/s networks. It is based on a 64-bit key and a BRAM-based shift register that allows us to build an element-based sliding window of size up to 79,872 elements. The design targets the Xilinx Virtex UltraScale family, using the integrated 100G Ethernet Subsystem available in such devices, and it has been tested on a VCU108 evaluation kit.
IEEE Transactions on Circuits and Systems II: Express Briefs
Ambiguous read-after-Write (RAW) dependencies are omnipresent in multiple streaming applications,... more Ambiguous read-after-Write (RAW) dependencies are omnipresent in multiple streaming applications, establishing hard to optimize bottlenecks. Considering actual input data, these may rarely be true dependencies. However, the increasingly used High-Level Synthesis (HLS) compilers must assume the worstcase scenario, as they rely on static optimizations. Conditional stalling is a simple yet impactful technique, useful even when conflicts are common. At the cost of a small area penalty, it allows improving (in some cases, by several times) the mean throughput of these systems. In this brief, we describe a high-frequency HLS implementation of the technique and examine its behavior as a function of input and architecture characteristics, with the goal of understanding when to use it and how to optimize throughput.
Lecture Notes in Electrical Engineering, 2012
Lecture Notes in Electrical Engineering, 2012
Synthesis of Arithmetic Circuits, 2006
FPGA, ASIC, and Embedded Systems, 2006
Page 1. 6 ARITHMETIC OPERATIONS: DIVISION Integer or finite length fractional numbers can be mult... more Page 1. 6 ARITHMETIC OPERATIONS: DIVISION Integer or finite length fractional numbers can be multiplied exactly, whenever sufficient length is allowed for the result. Division doesn't share this feature. As a matter of fact, division ...
2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
Networks are currently essential for computing: It is therefore essential to guarantee the qualit... more Networks are currently essential for computing: It is therefore essential to guarantee the quality of network links in order to ensure a proper operation of computing systems. However, a widespread deployment of network monitoring devices might not be economically feasible. In this paper, we propose the use of Programmable System-on-Chip FPGAs (PSoCs) for enabling a comprehensive testing of networks. Software-only solutions are no longer valid, because the timescales of current networks call for custom-hardware solutions. Thus, we show that PSoCs are a perfect fit for network quality monitoring devices, by mapping the required measurements to the capabilities of such devices. In order to demonstrate the benefits of PSoCs to monitor the quality of network links, we have developed a prototype based on Xilinx Zynq that is capable of measuring the key performance indicators of Gigabit Ethernet networks, namely: available bandwidth, packet loss, delay and jitter. The monitoring probe features GPS-based timestamping, thus enabling the construction of network delay maps. We present the benefits of the proposed approach in terms of cost and simplicity, and we also show how it could be expanded to multi-Gb/s networks.
this paper describes the design and implementation of a hardware module to calculate the decimal ... more this paper describes the design and implementation of a hardware module to calculate the decimal floating-point (DFP) multiplication compliant with the current IEEE-754- 2008 standard. The design proposed is made up of independent stages: IEEE-754 coder / decoder, decimal multiplier and rounding. The decimal multiplication is based on a previously designed BCD multiplier. The novelty is the design of a combinational and sequential architecture for rounding stage. Time performances and hardware requirements results are reported and evaluated. A decimal64 multiplication is able to be performed in 66 ns in a Virtex 4. The DFP multiplication presented supports operations on the decimal64 format and it is easily extendable for the decimal128 format. To the best of author's knowledge, this is the first publication to present an IEEE 754-2008 multiplier in FPGA.
Microprocessors and Microsystems, 2018
This paper proposes efficient fixed-point and floating-point implementations for radix-10 decimal... more This paper proposes efficient fixed-point and floating-point implementations for radix-10 decimal logarithm on Xilinx FPGA devices. The technique is based on the digit-recurrence method, which supports the three decimal floating-point (DFP) types specified in the IEEE 754-2008 standard. The novelty of this proposal is that it avoids the implementation of redundant carry-save logic by direct selection (i.e. via scaling). The designs involve novel techniques based on efficient use of dedicated resources in the programmable devices. Implementations were made on Xilinx 7-series devices. For fixed-point logarithm, they are capable of operating up to 145 MHz for p = 7, 124 MHz for p = 16 and 108 MHz for p = 34, and for DFP logarithm the operation frequency obtained was 123 MHz for p = 7, 104 MHz for p = 16 and 93 MHz for p = 34. In contrast to other related works, the proposed architecture achieves better computation times and less occupation in area in terms of LUT s.
En este trabajo se presenta una arquitectura basada en FPGA, diseñada para la agregación y poster... more En este trabajo se presenta una arquitectura basada en FPGA, diseñada para la agregación y posterior exportación de registros de sesiones TCP en enlaces de hasta 40 Gbit/s sin realizar muestreo de paquetes, incluso a la máxima tasa de paquetes. De esta manera, se descarga a exportadores de flujos basados en hardware de propósito específico de tareas para las cuales las FPGA ofrecen una flexibilidad y un desempeño adecuados, reduciendo los requerimientos del sistema completo.<br> Un prototipo funcional del sistema ha sido implementado en la plataforma NetFPGA-SUME, donde fue sometido a tráfico real. En el mismo, se incorporó una estimación de las retransmisiones por flujo, además de otras estadísticas estándar, tales como número de bytes y paquetes de cada conexión de red.
Lecture notes in electrical engineering, 2012
Finite fields are used in different types of computers and digital communication systems. Two wel... more Finite fields are used in different types of computers and digital communication systems. Two well-known examples are error-correction codes and cryptography. The traditional way of implementing the corresponding algorithms is software, running on general-purpose processors or on digital-signal processors. Nevertheless, in some cases the time constraints cannot be met with instruction-set processors, and specific hardware must be considered.
Lecture notes in electrical engineering, 2012
IFAC Proceedings Volumes, Apr 1, 1997
The generator includes several tools that allows to translate the initial problem specification t... more The generator includes several tools that allows to translate the initial problem specification to a specific circuit implementation. From the rule based specification the generator produces a first computing scheme. By applying various transformations (lattice and arithmetic operation minimization, optimal register assignation, ...) the system produces the microprogram that drives the controller. For this purpose, the VHDL language is used to simulate the hardware controller, and the ES2 design kit is used to design the circuits to be integrated.
IEEE Access, 2021
Near-lossless compression is a generalization of lossless compression, where the codec user is ab... more Near-lossless compression is a generalization of lossless compression, where the codec user is able to set the maximum absolute difference (the error tolerance) between the values of an original pixel and the decoded one. This enables higher compression ratios, while still allowing the control of the bounds of the quantization errors in the space domain. This feature makes them attractive for applications where a high degree of certainty is required. The JPEG-LS lossless and near-lossless image compression standard combines a good compression ratio with a low computational complexity, which makes it very suitable for scenarios with strong restrictions, common in embedded systems. However, our analysis shows great coding efficiency improvement potential, especially for lower entropy distributions, more common in near-lossless. In this work, we propose enhancements to the JPEG-LS standard, aimed at improving its coding efficiency at a low computational overhead, particularly for hardware implementations. The main contribution is a low complexity and efficient coder, based on Tabled Asymmetric Numeral Systems (tANS), well suited for a wide range of entropy sources and with simple hardware implementation. This coder enables further optimizations, resulting in great compression ratio improvements. When targeting photographic images, the proposed system is capable of achieving, in mean, 1.6%, 6%, and 37.6% better compression for error tolerances of 0, 1, and 10, respectively. Additional improvements are achieved increasing the context size and image tiling, obtaining 2.3% lower bpp for lossless compression. Our results also show that our proposal compares favorably against state-of-the-art codecs like JPEG-XL and WebP, particularly in near-lossless, where it achieves higher compression ratios with a faster coding speed. INDEX TERMS Image codec, near-lossless compression, JPEG-LS, asymmetric numeral systems, low complexity, two-sized geometric distribution.
D3.2 reports the first validation of the METRO-HAUL node architecture and the interconnection, tr... more D3.2 reports the first validation of the METRO-HAUL node architecture and the interconnection, transmission and switching optical solutions developed in T3.1-T3.3 in different disaggregated scenarios. The deliverable also includes an early description of the METRO-HAUL node control and management environment and the developed software for controlling all the OIE developed in T3.4.
2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2017
In network traffic monitoring, a very important analysis is to find heavy hitters. That is, findi... more In network traffic monitoring, a very important analysis is to find heavy hitters. That is, finding those flows that use most resources in a given network link. This information can be very useful for security or traffic management purposes. Though this analysis might seem easy to implement, since it is essentially based on counting, the fact is that doing it at 100 Gbit/s rates is far from trivial. In 100 Gbit/s Ethernet (100 GbE), up to 148 million packets per second can be received, thus making it very difficult to parse packets and maintain counters at such rate. In this paper, we leverage the integrated 100G Ethernet Subsystem available in Xilinx UltraScale devices to implement a heavy hitter detector for 100 GbE in a VCU108 evaluation kit. Thanks to the integration of the Count Sketch algorithm with a priority list and a network packet parser, the proposed architecture is able to work at line rate for average packet sizes bigger than 215 bytes. The work presents a theoretical analysis of the error, as well as the technical details of the proposed solution. The implementation has been validated using real-world traces, obtaining an average error of 1.29%.
2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2016
Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed ... more Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed of network links. At 100 Gbit/s, the huge volume of data makes it very difficult to perform online analyses or to store traffic for subsequent forensic investigations. It is therefore mandatory to carry out some kind of filtering and/or capping in the network traffic to be analyzed. Additionally, the fraction of encrypted traffic is relentlessly increasing. For such encrypted traffic, storing the payload is most times useless. In this paper we present an FPGA implementation of a method to identify plain text (that is, human readable) in the network packet payload. The method is based on both detecting bursts of printable ASCII characters and calculating the fraction of these printable characters in the packet payload. This method has proven to be very effective in reducing the amount of information used in traffic analysis, by saving only the headers of packets with encrypted payloads. We leveraged the advantages of high-level languages to reduce development time, though traditional HDL languages were also used to optimize critical areas of the design. The design targets the 100 Gbit/s Ethernet interfaces of Xilinx Virtex UltraScale devices and it is able to detect human-readable packet payloads at line rate, with a high accuracy.
Lecture Notes in Electrical Engineering, 2012
2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2017
Network traffic monitoring usually faces the problem of packet duplication, which arises when por... more Network traffic monitoring usually faces the problem of packet duplication, which arises when port mirroring is being used. That is, when traffic is copied from the ports of a switch or a router that are being monitored, to a mirror port where a monitoring probe is attached. Thus, a packet can be copied twice, both at the ingress and egress ports, therefore generating duplicates. Information redundancy caused by packet duplication not only leads to increased workloads at the monitoring probes, but also calls for more disk space to store the network traces. Actually, packet duplication may increase 100% the monitoring load. There are different sorts of packet duplication; in this paper we focus on switching duplication, because it is the most common in a network monitoring scenario, where the network probe is attached to a core switch. We present a high performance FPGA architecture that is able to detect and remove duplicated packets in 100 Gbit/s networks. It is based on a 64-bit key and a BRAM-based shift register that allows us to build an element-based sliding window of size up to 79,872 elements. The design targets the Xilinx Virtex UltraScale family, using the integrated 100G Ethernet Subsystem available in such devices, and it has been tested on a VCU108 evaluation kit.
IEEE Transactions on Circuits and Systems II: Express Briefs
Ambiguous read-after-Write (RAW) dependencies are omnipresent in multiple streaming applications,... more Ambiguous read-after-Write (RAW) dependencies are omnipresent in multiple streaming applications, establishing hard to optimize bottlenecks. Considering actual input data, these may rarely be true dependencies. However, the increasingly used High-Level Synthesis (HLS) compilers must assume the worstcase scenario, as they rely on static optimizations. Conditional stalling is a simple yet impactful technique, useful even when conflicts are common. At the cost of a small area penalty, it allows improving (in some cases, by several times) the mean throughput of these systems. In this brief, we describe a high-frequency HLS implementation of the technique and examine its behavior as a function of input and architecture characteristics, with the goal of understanding when to use it and how to optimize throughput.
Lecture Notes in Electrical Engineering, 2012
Lecture Notes in Electrical Engineering, 2012
Synthesis of Arithmetic Circuits, 2006
FPGA, ASIC, and Embedded Systems, 2006
Page 1. 6 ARITHMETIC OPERATIONS: DIVISION Integer or finite length fractional numbers can be mult... more Page 1. 6 ARITHMETIC OPERATIONS: DIVISION Integer or finite length fractional numbers can be multiplied exactly, whenever sufficient length is allowed for the result. Division doesn't share this feature. As a matter of fact, division ...
2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
Networks are currently essential for computing: It is therefore essential to guarantee the qualit... more Networks are currently essential for computing: It is therefore essential to guarantee the quality of network links in order to ensure a proper operation of computing systems. However, a widespread deployment of network monitoring devices might not be economically feasible. In this paper, we propose the use of Programmable System-on-Chip FPGAs (PSoCs) for enabling a comprehensive testing of networks. Software-only solutions are no longer valid, because the timescales of current networks call for custom-hardware solutions. Thus, we show that PSoCs are a perfect fit for network quality monitoring devices, by mapping the required measurements to the capabilities of such devices. In order to demonstrate the benefits of PSoCs to monitor the quality of network links, we have developed a prototype based on Xilinx Zynq that is capable of measuring the key performance indicators of Gigabit Ethernet networks, namely: available bandwidth, packet loss, delay and jitter. The monitoring probe features GPS-based timestamping, thus enabling the construction of network delay maps. We present the benefits of the proposed approach in terms of cost and simplicity, and we also show how it could be expanded to multi-Gb/s networks.
this paper describes the design and implementation of a hardware module to calculate the decimal ... more this paper describes the design and implementation of a hardware module to calculate the decimal floating-point (DFP) multiplication compliant with the current IEEE-754- 2008 standard. The design proposed is made up of independent stages: IEEE-754 coder / decoder, decimal multiplier and rounding. The decimal multiplication is based on a previously designed BCD multiplier. The novelty is the design of a combinational and sequential architecture for rounding stage. Time performances and hardware requirements results are reported and evaluated. A decimal64 multiplication is able to be performed in 66 ns in a Virtex 4. The DFP multiplication presented supports operations on the decimal64 format and it is easily extendable for the decimal128 format. To the best of author's knowledge, this is the first publication to present an IEEE 754-2008 multiplier in FPGA.
Microprocessors and Microsystems, 2018
This paper proposes efficient fixed-point and floating-point implementations for radix-10 decimal... more This paper proposes efficient fixed-point and floating-point implementations for radix-10 decimal logarithm on Xilinx FPGA devices. The technique is based on the digit-recurrence method, which supports the three decimal floating-point (DFP) types specified in the IEEE 754-2008 standard. The novelty of this proposal is that it avoids the implementation of redundant carry-save logic by direct selection (i.e. via scaling). The designs involve novel techniques based on efficient use of dedicated resources in the programmable devices. Implementations were made on Xilinx 7-series devices. For fixed-point logarithm, they are capable of operating up to 145 MHz for p = 7, 124 MHz for p = 16 and 108 MHz for p = 34, and for DFP logarithm the operation frequency obtained was 123 MHz for p = 7, 104 MHz for p = 16 and 93 MHz for p = 34. In contrast to other related works, the proposed architecture achieves better computation times and less occupation in area in terms of LUT s.