A high-performance framework for a network programmable packet processor using P4 and FPGA (original) (raw)

An Energy-Efficient FPGA-Based Packet Processing Framework

Lecture Notes in Computer Science, 2010

Modern packet processing hardware (e.g. IPv6-supported routers) demands high processing power, while it also should be power-efficient. In this paper we present an architecture for high-speed packet processing with a hierarchical chip-level power management that minimizes the energy consumption of the system. In particular, we present a modeling framework that provides an easy way to create new networking applications on an FPGA based board. The development environment consists of a modeling environment, where the new application is modeled in SystemC. Furthermore, our power management is modeled and tested against different traffic loads through extensive simulation analysis. Our results show that our proposed solution can help to reduce the energy consumption significantly in a wide range of traffic scenarios.

A multiprocessor architecture for fast packet processing

2005 12th IEEE International Conference on Electronics, Circuits and Systems, 2005

The design of high performance application-specific processors is the bottleneck in many applications. In critical high data rate network devices such as modems and other end-user interfaces, the packet forwarding engine requires both high degree o.fflexibility (to support large number of protocols) as well as extremely high performance (to be able to support Gigabit processing). While hardware application-specific designs can cope with performance, they lack flexibility. Although software based solutions could be ideal to meet flexibility and low cost constraints, unfortunately, they miss the speed performance requirement. The main goal of this paper is to present an architectural model to design a new processor architecture based on multiple cores and specific dedicated hardware units. In particular, we propose a new model which takes into account networkprotocols characteristics and input traffic nature to meet performance and cost requirements.

Packet Processing Acceleration With a 3-Stage Programmable Pipeline Engine

IEEE Communications Letters, 2004

In this letter, we present the architecture and implementation of a novel, 3-stage processing engine, suitable for deep packet processing in high-speed networks. The engine, which has been fabricated as part of a network processor, comprises of a typical RISC core and programmable hardware. To assess the performance of the engine, experiments with packets of various lengths have been performed and compared against the IXP1200 network processor. The comparison has revealed that for the case study shown in this letter, the proposed packet-processing engine is up to three times faster. Moreover, the engine is simple to be fabricated, less expensive than the corresponding hardware cores of IXP1200 and can be easily programmed for different networking applications.

An FPGA-based soft multiprocessor system for IPv4 packet forwarding

2005

To realize high performance, embedded applications are deployed on multiprocessor platforms tailored for an application domain. However, when a suitable platform is not available, only few application niches can justify the increasing costs of an IC product design. An alternative is to design the multiprocessor on an FPGA. This retains the programmability advantage, while obviating the risks in producing silicon. This also opens FPGAs to the world of software designers. In this paper, we demonstrate the feasibility of FPGA-based multiprocessors for high performance applications. We deploy IPv4 packet forwarding on a multiprocessor on the Xilinx Virtex-II Pro FPGA. The design achieves a 1.8 Gbps throughput and loses only 2.6X in performance (normalized to area) compared to an implementation on the Intel IXP-2800 network processor. We also develop a design space exploration framework using Integer Linear Programming to explore multiprocessor configurations for an application. Using this framework, we achieve a more efficient multiprocessor design surpassing the performance of our hand-tuned solution for packet forwarding.

System-on-chip packet processor for an experimental network services platform

2003

As the focus of networking research shifts from raw performance to the delivery of advanced network services, there is a growing need for open-platform systems for extensible networking research. The Applied Research Laboratory at Washington University in Saint Louis has developed a flexible Network Services Platform (NSP) to meet this need. The NSP provides an extensible platform for prototyping next-generation network services and applications. This paper describes the design of a system-on-chip Packet Processor for the NSP which performs all core packet processing functions including segmentation and reassembly, packet classification, route lookup, and queue management. Targeted to a commercial configurable logic device, the system is designed to support gigabit links and switch fabrics with a 2:1 speed advantage. We provide resource consumption results for each component of the Packet Processor design.

P4-Compatible High-Level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018

Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated C++ classes. The pipeline layout is derived from a parser graph that corresponds a P4 code after a series of graph transformation rounds. The RTL code is generated from the C++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves 100 Gb/s data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art. CCS CONCEPTS • Hardware → Reconfigurable logic applications; • Networks → Programming interfaces;

A Packet Generator on the NetFPGA Platform

2009 17th IEEE Symposium on Field Programmable Custom Computing Machines, 2009

A packet generator and network traffic capture system has been implemented on the NetFPGA. The NetFPGA is an open networking platform accelerator that enables rapid development of hardware-accelerated packet processing applications. The packet generator application allows Internet packets to be transmitted at line rate on up to four Gigabit Ethernet ports simultaneously. Data transmitted is specified in a standard PCAP file, transferred to local memory on the NetFPGA card, then sent on the Gigabit links using a precise data rate, interpacket delay, and number of iterations specified by the user. The hardware circuit also simultaneously operates as a packet capture system, allowing traffic to be captured from up to all four of the Gigabit Ethernet ports. Timestamps are recorded and traffic can be transferred back to the host and stored using the same PCAP format. The project has been implemented as a fully open-source project and serves as an exemplar project on how to build and distribute NetFPGA applications. All of the code (Verilog hardware, system software, verification scripts, makefiles, and support tools) can be freely downloaded from the NetFPGA.org website. Benchmarks comparing this hardware-accelerated application to the fastest available PC with a PCIe NIC shows that the FPGA-based hardware-accelerator far exceeds the performance possible using TCP-reply software.

An efficient pipeline processing scheme for programming Protocol-independent Packet Processors

Journal of Network and Computer Applications, 2020

OpenFlow is unable to provide customized flow tables, resulting in memory explosions and high switch retirement rates. This is the bottleneck for the development of SDN. Recently, P4 (Programming Protocol-independent Packet Processors) attracts much attentions from both academia and industry. It provides customized networking services by offering flow-level control. P4 can "produce" various forwarding tables according to packets. P4 increases the speed of custom ASICs. However, with the prevalence of P4, the multiple forwarding tables could explode when used in large scale networks. The explosion problem can slow down the lookup speed, which causes congestions and packet losses. In addition, the pipelined structure of forwarding tables brings additional processing delay. In this study, we will improve the lookup performance by optimizing the forwarding tables of P4. Intuitively, we will install the rules according to their popularity, i.e., the popular rules will appear earlier than others. Thus, the packets can hit the matched rule sooner. In this paper, we formalize the optimization problem, and prove that the problem is NP-hard. To solve the problem, we propose a heuristic algorithm called EPSP (Efficient Pipeline Processing Scheme for P4), which can largely reduce the lookup time while keeping the forwarding actions the same. Because running the optimization algorithm frequently brings additional processing burdens, wedesign an incremental update algorithm to alleviate this problem. To evaluate the proposed algorithms, we set up the simulation environments based on ns-3. The simulation results show that the algorithm greatly reduces both the lookup time and the number of memory accesses. The incremental algorithm largely reduces the processing burdens while the lookup time remains almost the same with the non-incremental algorithm. We also implemented a prototype using floodlight and mininet. The results show that our algorithm brings acceptable burder, and performs better than traditional algorithm. This paper is an extendable version of the IEEE LCN 2016 paper (Wu et al., 2016). We made some additional contributions, including: propose an incremental update algorithm, add some experiments, use P4 switch.