Packet Classification Based On Merge Fsm Computer Science Essay

Published: November 9, 2015 Words: 5576

The development of the Internet demands next-generation routers to support a variety o f network functionalities, such as firewall processing, quality of service (QoS) differentiation, virtual private networks, policy routing, traffic billing, and other value added services. In order to provide these ser-vices, the router needs to classify the packets into different categories based on a set of predefined rules, which specify the value ranges of the multiple fields in the packet header. Such a function is called multi-field packet classification. In traditional net-work applications, packet classification problems usually consider the fixed 5-tuple fields: 32-bit source/destination IP ad-dresses, 16-bit source/destination port numbers, and 8-bit trans-port layer protocol. Recently network virtualization emerges as an essential feature for next-generation enterprise, data center and cloud computing networks. This requires the underlying data plane be fl exible and provide clean interface for control plane. One such effort is the Open Flow switch which man-ages explicitly the network flows using a rule set with rich definition as the software-hardware interface. In Open Flow, up to 12-tuple header fields are considered. We call such Open-Flow-like packet classification the next-generation packet classification problems.

Due to the rapid growth of the Internet traffic, as well as the rule set size, multi-field packet classification has become one of the fundamental challenges to designing high speed routers. For example, the current link rate has been pushed beyond the OC-768 rate, i.e., 40 Gbps, which requires processing a packet every 8 ns in the worst case (where the packets are of minimum size, i.e., 40 bytes).Such throughput is impossible using existing software-based solutions[13]. Next-generation packet classification on more header fields poses an even bigger challenge. Most of the existing work in high-throughput packet classification is based on ternary content addressable memory (TCAM) [14]-[16]or a variety of hashing schemes such as Bloom Filters[17]-[19]. However, TCAMs are not scalable with respect to clock rate, power consumption, or circuit area, compared to SRAMs[20]. Most of TCAM-based solutions also suffer from range expansion when converting ranges into prefixes[15],[16]. Hashing-based solutions such as Bloom Filters have become popular due to their time performance and high memory efficiency[21]. However, hashing cannot provide deterministic performance due to potential collision and is inefficient in handling wildcard or prefix matching. A secondary module is always needed to resolve false positives inherent in Bloom Filters, which may be slow and can limit the overall performance.

As an alternative, our work focuses on optimizing and mapping state-of-the-art packet classification algorithms onto SRAM-based parallel architectures such as field-programmable gate array (FPGA). FPGA technology has become an attractive option for implementing real-time network processing engines due to its ability to recon figure and to offer abundant parallelism. State-of-the-art SRAM-based FPGA devices such as Xilinx Virtex-6 [25]and Altera Stratix-IV [26] provide high clock rate, low power dissipation and large amounts of on-chip dual-port memory with configurable word width. In this pa per we exploit these desirable features in current FPGAs for designing high-performance next-generation packet classification engines.

The contributions of this paper are as follows. To the best of our knowledge, this work is among the first discussions of accelerating next-generation packet classification using FPGA. After revisiting the traditional fixed 5-tuple packet classification solutions, we adopt decision-tree-based schemes, which are considered among the most scalable packet classification algorithms[22]. We identify memory explosion to be the major challenge for handling 12-tuple packet classification. To address this challenge, we propose a decision-tree-based multi-pipeline architecture. We develop a framework, called fuzzy logic decision forest, to partition a given set of 12-tuple rules into multiple subsets so that each subset uses a small number of header fields to build a decision tree of bounded depth. We tackle the problem of rule duplication when building the decision tree. Two optimization techniques, called rule overlap reduction and precise range cutting, are proposed to minimize the rule duplication. As a result, the memory requirement is almost linear with the number of rules. To map the tree onto the pipeline architecture, we intro-duce a fi ne-grained node-to-stage mapping scheme which allows imposing the bounds on the memory size as well as the number of nodes in each stage. As a result, the memory utilization of the architecture is maximized. The memory allocation scheme also enables using external SRAMs to handle even larger rule sets. We exploit the dual-port high-speed Block RAMs provided in state-of-the-art FPGAs to achieve a high throughput of two packets per clock cycle (PPC). On-the- fl y rule update without service interruption becomes feasible due to the memory-based linear architecture. All the routing paths are localized to avoid large routing delay so that a high clock frequency is achieved. When matching multiple fields simultaneously, it is difficult to achieve both high classification rate and modest storage in the worst case. Our soft decision tree based scheme, which can be considered among the most algorithms which has high throughput and efficiency.

2. BACKGROUND

2.1 Fuzzy logic decision tree:

A fuzzy decision tree is a tree structure where every edge is annotated with a condition, and every leaf is annotated with a fuzzy set over C. For conciseness, we consider only binary trees in this paper, where one of the conditions at the outgoing edges of a node is chosen from T and anthe condition at the other outgoing edge is the negation of the test condition. The restriction to binary trees is lifted. A further restriction is that each condition is used at most once in each path from root to leaf.

Consider a path from the root to a leaf of the tree. Intuitively, such a path can be interpreted as follows: whenever an example fulfills the conditions the edges of a path are annotated with, we expect the example to belong to the classes according to the fuzzy set the leaf is annotated along the edges labeled image and technical gives us a clue, that the interest of the user in non-technical images is classified to positive with a degree of truth 0.57 and negative with a degree of truth 0.43. One should observe that unlike in decision trees based on classical logic, the conditions on the outgoing edges of a node are not necessarily mutually exclusive: e.g. an example can be both technical and non-technical to a non-zero degree of truth. So we have to combine the clues given by all paths into one final result. There is a variety of methods to accomplish. In this paper, we compute the weighted average of the fuzzy sets the leafs are annotated with, where each fuzzy set is weighted by the degree the example belongs to the conjunction of the conditions on the edges of the path. After presenting this formally, we will discuss an example. One should observe that CLASSIFY yields a fuzzy degree of truth as value; if a crisp output value is desired one can apply any of the well-known defuzzification methods, e.g. take the class with the highest truth value.

Using decision tree method, we could deduce the high-level context. However, still one main problem exists. For example, if the temperature is 140C or 200C, using the method, both of them belong to "Low" category. However, the "Low degree" of 140C and 200C are same. Also 19 and 20obelong to different category based on the deduction result might be totally different. The difference between 190C and 200C is only 1.

The main reason is crisp cut points are used in classical decision trees. In fact, crisp cut model does not match human thing and is not reasonable. This makes decision trees sensitive to noise. To overcome this problem, we incorporate fuzzy theory in decision trees. Instead of crisp boundaries between categories, fuzzy logic introduces a membership function, which reflects how well a given value falls into a category.

We assume that training data are stored in an FDT S = (U,A), where A is partitioned into the set of condition attributes A0 = {f1, · · · , fm−1} and the decision attribute fm. In the terminology of the classification problem, any fuzzy subset of Vm is called a class label.

2.2 Field-programmable gate array:

In the context of 2-dimensional image processing, convolution is a standard operation often used for filtering (blurring, sharpening, compression, noise cleaning) or resampling (compositing, scaling, rotation, warping, texture mapping), etc. The function at each pixel is the weighted sum of neighboring pixels given by

V'(x,y) = i,j C(i,j) * V(x+i,y+j),

Where C represents the convolution filter kernel, usually a small 2-dimensional array of weights. If the filter size is m x m, then each output pixel depends on m2 adjacent pixels, and thus m2 multiplies and adds are required at each site. (Although many 2-D functions can be computed separable as the combination of two 1-D functions, we will concentrate here on the general 2-dimensional case).

An example pipeline implementing a 3x3 convolution filter for video applications. Pixels arrive in raster-scan order, and each is simultaneously multiplied by all 9 filter coefficients. Partial sums are then accumulated by the adder chain, with a delay of one pixel between adjacent adds and a delay of the image line length minus 3 pixels between rows to allow proper alignment. The partial sums may be accumulated at double precision to minimize loss. Because the sum is initialized to 1/2, the result is rounded automatically when it is truncated at the output.

2.4.Multiplier generation:

Using a macro building utility, we can assemble copies of various macros and primitive elements into a multiplier, and then compose several multipliers and adders into the complete filter pipeline. Conventions as to expansion and mapping of signal names (e.g. X.i becomes X0, X1, ..., etc.) allow appropriate connections to be made automatically by the router. One stage of the serial-parallel carry-save multiplier with one bit of coefficient register implemented in 9 cells of an FPGA is shown. In this particular multiplier design, one operand (X) is supplied in parallel to several stages, while the other operand (Y) arrives in bit-serial fashion to all stages simultaneously. Stages are stacked vertically to the arithmetic length desired, with the high-order bit at the top and the low-order bit at the bottom, and intermediate results propagating downward via path P.

Result bits emerge serially from the bottom of the multiplier, lsb first. Not shown are common clock and reset lines. In order to assemble a complete multiplier from this macro cell, we need only stack the appropriate number of these stages, and perform a simple mapping of signal names to make the proper connections. Each P.out signal connects to P.in of the stage below. Each stage receives one bit of the parallel operand, so input X.i of the i'th stage is connected to the coefficient bus signal Xi. The serial operand Y is bused vertically to all stages in the multiplier, as is the common signal LdC.j, which enables loading of the entire coefficient register.

3.RELATED WORK

3.1.Soft decision tree:

The various heuristic methods for construction of SDT can roughly be divided into four categories: Bottom-up approaches, Top-Down approaches the hybrid approach and tree Growing - pruning approaches. A decision tree using bottom-up approach was constructed and studied. Using max-min soft decision measures, pair wise distances between a priori defined classes are computed and at each step the two classes with the node decision are merged to form a new group, and this process is repeated until one is left with one group at the root which will be the optimized epilepsy risk level patterns. From a processing point of view, these types of trees are highly recommended. The generic representation of SDT optimization is explained, let W= [Pij] be the co -occurrence matrix with (i,j) elements which represents fuzzy based epilepsy risk level patterns of single epoch. There are 48 (16x3) epochs are available. Three models of SDT such as (16-8-4-2-1), (16-4-2-1), and (16-2-1) were selected for optimization of fuzzy patterns. A decision strategy of Method -I (Max-min) or Method -II (Min-max) were applied at each nodes in the above three SDT models. Therefore six types of SDT models were obtained. In the case of (16-8-4-2-1) model an epoch of (16x1) elements were considered as the leaf nodes of the tree. The next level of tree was named as B with eight decision nodes, which was followed by C level with four soft decision nodes. Further level was designated as D level with two nodes and the final level was the E level with single node which was the root of the tree.

Each SDT model is trained and tested by means of MSE Function. Since our model is patient specific in nature, we are applying 48 (3x16) patterns for each SDT model. As the number of patterns in each database for training is limited, each model is trained with one set of patterns (16) for Minimum Mean Square Error (MSE) condition and tested with other two sets of patterns (2x16). After network is trained using these, the classification performance of test set is recorded.

3.2.Soft cuts:

We have presented so far decision tree methods working with cuts treated as sharp classifiers such, that real values are partitioned by them into disjoint intervals. One can observe that in some situations objects which are close one to other can be treated as very different. In this section we introduce some notions of soft cuts which discern two given values if those values are far enough from the cut.

The test functions defined by cuts can be here we propose two strategies being modifications of that method by using described above soft cuts (fuzzy separated cuts). They are called fuzzy decision tree and rough decision tree. In fuzzy decision tree method instead of checking the condition a (u)> c we have to check how strong is hypothesis that u is on the left or right side of the cut (a; c). This condition can be expressed by µL(u) and µR(u) where µL and µR are membership functions of left and right intervals (respectively). The values of those membership functions can be treated as a probability distribution of u in the node labeled by soft cut(a,c-∑,c+∑). Then one can compute the probability of the event that object u is reaching a leaf. The decision for u is equal to decision labeling the leaf with largest probability. In the case of rough decision tree, when we are not able to decide to turn left or right (the value a (u) is too close to c) we do not distribute the probability to children of considered node. We have to compare their answers taking into account the numbers of supported by them objects.

3.3.Packet classification:

Next-generation packet classification can be viewed as a natural extension from traditional 5-tuple packet classification whose solutions have been extensively studied in the past decade. Comprehensive surveys can be found. Most of those algorithms fall into two categories: decomposition-based and decision-tree-based approaches.

Decomposition-based algorithms (e.g., parallel bit vector), perform independent search on each field and finally combine the search results from all fields. Such algorithms are desirable for hardware implementation due to their parallel search on multiple fields. However, substantial storage is usually needed to merge the independent search results to obtain the final result. Decomposition-based algorithms have poor scalability, and work well only for small-scale rule sets.

Decision-tree-based algorithms, take the geometric view of the packet classification problem. Each rule defines a hypercube in a d-dimensional space where d is the number of header fields considered for packet classification. Each packet defines a point in this d-dimensional space. The decision tree construction algorithm employs several heuristics to cut the space recursively into smaller subspaces. Each sub-space ends up with fewer rules, which finally allows a low-cost linear search to find the best matching rule. After the decision tree is built, the algorithm to classify a packet is simple. Based on the value of the packet header, the algorithm follows the cutting sequence to locate the target subspace (i.e., a leaf node in the decision tree), and then performs a linear search on the rules in this subspace. Decision-tree-based algorithms allow incremental rule updates and scale better than decomposition-based algorithms. The outstanding representatives of decision-tree-based packet classification algorithms are Hi Cuts and its enhanced version Hyper Cuts. At each node of the decision tree, the search space is cut based on the information from one or more fields in the rule. Hi Cuts builds a decision tree using local optimization decisions at each node to choose the next dimension to cut, and how many cuts to make in the chosen dimension. The Hyper Cuts algorithm, on the other hand, allows cutting on multiple fields per step, resulting in a fatter and shorter decision tree.

4.PROBLEM OF PACKET CLASSIFICATION

Packet classification is performed using a packet classifier, also called a policy database, flow classifier, or simply a classifier. A classifier is a collection of rules or policies. Each rule specifies a class that a packet may belong to based on some criterion on F fields of the packet header, and associates with each class an identifier, class ID. This identifier uniquely specifies the action associated with the rule. Each rule has F components. The ith component of rule R, referred to as R[i], is a regular expression on the ith field of the packet header (in practice, the regular expression is limited by syntax to a simple address/mask or operator/number(s) specification). A packet P is said to match a particular rule R, if , the ith field of the header of P satisfies the regular expression R[i ]. It is convenient to think of a rule R as the set of all packet headers which could match R . When viewed in this way, two distinct rules are said to be either partially overlapping or non-overlapping, or that one is a subset of the other, with corresponding set-related definitions. We will assume throughout this paper that when two rules are not mutually exclusive, the order in which they appear in the classifier will determine their relative priority. For example, in a packet classifier that performs longest-prefix address lookups, each destination prefix is a rule, the corresponding next hop is its action, the pointer to the next-hop is the associated class ID, and the classifier is the whole forwarding table. If we assume that the forwarding table has longer prefixes appearing before shorter ones, the lookup is an example of the packet classification problem.

5.ARCHITECTURE

5.1. Pipeline Architecture:

To achieve line-rate throughput, we map the decision forest including P trees onto a parallel multi-pipeline architecture with P linear pipelines. Each pipeline is used for traversing a decision tree as well as matching the rule lists attached to the leaf nodes of that tree. The pipeline stages for tree traversal are called the tree stages while those for rule list matching are called the rule stages. Each tree stage includes a memory block storing the tree nodes and the cutting logic which generates the memory access address based on the input packet header values. At the end of tree traversal, the index of the corresponding leaf node is retrieved to access the rule stages. Since a leaf node contains a list of list Size rules, we need

List Size rule stages for matching these rules. All the leaf nodes of a tree have their rule lists mapped onto these list Size rule stages. Each rule stage includes a memory block storing the full content of rules and the matching logic which performs parallel matching on all header fields.

Each incoming packet goes through all the P pipelines in parallel. A different subset of header fields of the packet may be used to traverse the trees in different pipelines. Each pipeline outputs the rule ID or its corresponding action. The priority re-solver picks the result with the highest priority among the outputs from the P pipelines. It takes H+listSize clock cycles for each packet to go through the architecture, where H denotes the number of tree stages.

Like the Hyper Cuts with the push common rule upwards heuristic enabled, our algorithm may reduce the memory consumption at the cost of increased search time, if the process to match the rules in the internal rule list of each tree node is placed in the same critical path of decision tree traversal. Any packet traversing the decision tree must perform: 1) matching the rules in the internal rule list of the current node and 2) branching to the child nodes, in sequence. The number of memory accesses along the critical path can be very large in the worst cases. Although the throughput can be boosted by using a deep pipeline, the large delay passing the packet classification engine requires the router to use a large buffer to store the payload of all packets being classified. Moreover, since the search in the rule list and the traversal in the decision tree have different structures; a heterogeneous pipeline is needed, which complicates the hardware design.

FPGAs provide massive parallelism and high-speed dual-port. Block RAMs distributed across the device. We exploit these features and propose a highly parallel architecture with localized routing paths. The design is based on the following considerations.

1) The traversal of the decision tree can be pipelined. Thus we have a pipeline for traversing the decision tree, shown as light-color blocks. We call this pipeline Tree Pipeline.

2) Whether it is an internal or a leaf node, each tree node is attached to a list of rules. Analogous to internal rule list, the rule list attached to a leaf node is called a leaf-level rule list. Search in the rule lists is pipelined as well. We call such a pipeline Rule Pipeline.

3) When a packet reaches an internal tree node, the search in the internal rule list can be initiated at the same time the branching decision is made, by placing the rule list in a separate pipeline.

4) For the tree nodes mapped onto the same stage of the Tree Pipeline, their rule lists are mapped onto the same Rule Pipeline. Thus, there will be H+1 Rule Pipelines if the Tree Pipeline has H stages.

5) All Rule Pipelines have the same number of stages. The total number of clock cycles for a packet to pass the architecture is H+listSize, where list Size is the number of stage in a Rule Pipeline.

6) Consider two neighboring stages of Tree Pipeline, denoted A and B, where Stage B follows Stage A. The Rule Pipeline attached to Stage A outputs the matching results one clock cycle earlier than the Rule Pipeline attached to Stage B.

Instead of waiting for all matching results from all Rule Pipelines and directing them to a single priority resolver, we exploit the one clock cycle gap between two neigh-boring Tree Pipeline stages, to perform the partial priority resolving for the two previous matching results.

7) The Block RAMs in FPGAs are dual-port in nature. Both Tree Pipeline and Rule Pipelines can exploit this feature to process two packets per clock cycle. In other words, by duplicating the pipeline structure (i.e., logic), the throughput is doubled, while the memories are shared by the dual pipelines.

5.2.Tree-to-Pipeline Mapping:

The size of the memory in each pipeline stage must be determined before FPGA implementation[29]. However, when simply mapping each level of the decision tree onto a separate stage, the memory distribution across stages can vary widely. Allocating memory with the maximum size for each stage results in large memory wast age. Baboescu et al. propose a Ring pipeline architecture which employs TCAMs to achieve balanced memory distribution at the cost of halving the throughput to one packet per two clock cycles, i.e., 0.5 PPC, due to its non-linear structure.

Our task is to map the decision tree onto a linear pipeline (i.e., Tree Pipeline in our architecture) to achieve balanced memory distribution over stages, while sustaining a throughput of one packet per clock cycle (which can be further improved to 2 PPC by employing dual-port RAMs). The memory distribution across stages should be balanced not only for the Tree Pipeline, but also for all the Rule Pipelines. Note that the number of words in each stage of a Rule Pipelines depends on the number of tree nodes rather than the number of words in the corresponding stage of Tree Pipeline. The challenge comes from the various number of words needed for tree nodes. As a result, the tree- to-pipeline mapping scheme requires not only balanced memory distribution, but also balanced node distribution across stages. Moreover, to maximize the memory utilization in each stage, the sum of the number of words of all nodes in a stage should approach some power of 2.

Otherwise, for example, we need to allocate 2048 words for a stage consuming only 1025 words.

The above problem is a variant of bin packing problems, and can be proved to be NP-complete. We use a heuristic similar to our previous study of trie-based IP lookup, which al-lows the nodes on the same level of the tree to be mapped onto different stages. This provides more flexibility to map the tree nodes, and helps achieve a balanced memory and node distribution across the stages in a pipeline. Only one constraint must be followed.

Constraint 1: If node A is an ancestor of node B in the tree, then A must be mapped to a stage preceding the stage to which B is mapped.

We impose two bounds, namely Bm and Bn for the memory and node distribution, respectively. The values of the bounds are some power of 2. The criterion to set the bounds is to minimize the number of pipeline stages while achieving balanced distribution over stages. The complete tree-to-pipeline mapping algorithm, where n denotes a tree node, H the number of stages, the set of remaining nodes to be mapped onto stages, the number of words of the i th stage, and Ni the number of nodes mapped onto the i th stage. We manage two lists, Ready List and Next Ready List. The former stores the nodes that is available for filling the current stage, while the latter stores the nodes for filling the next stage. We start with mapping the nodes that are children of the root onto Stage 1. When filling a stage, the nodes in Ready List are popped out and mapped onto the stage, in the decreasing order of their heights. After a node is assigned to a stage, its children are pushed into Next Ready List. When a stage is full or Ready List becomes empty, we move on to the next stage. At that time, NextReadyList is merged into ReadyList. By these means, Constraint 1ismet. The complexity of this mapping algorithm is O(N), where N denotes the total number of tree nodes.

Our tree-to-pipeline mapping algorithm allows two nodes on the same tree level to be mapped to different stages. We implement this feature by using a simple method. Each node stored in the local memory of a pipeline stage has one extra field: the distance to the pipeline stage where the child node is stored. When a packet is passed through the pipeline, the distance value is decremented by 1 when it goes through a stage. When the distance value becomes 0, the child node's address is used to access the memory in that stage.

External SRAMs are usually needed to handle very large rule sets, while the number of external SRAMs is constrained by the number of I/O pins in our architecture. By assigning large values of Bm and Bn for one or two specific stages, our mapping algorithm can be extended to allocate a large number of tree nodes onto few external SRAMs which consume controllable number of I/O pins.

6.PERFORMANCE ANALYSIS

6.1.Experimental Setup:

For traditional 5-tuple packet classification, we used the rule sets. These synthetic rule sets are generated using Class Bench, with parameter fi les extracted from real-life rules. The size of the rule sets varies from hundreds of to tens of thousands of rules. Due to the lack of large-scale real-life OpenFlow rule sets, we generated synthetic 12-tuple OpenFlow-like rules to examine the effectiveness of our decision forest -based schemes for next generation packet classification. Each rule was composed of 12 header fields that follow the current OpenFlow specification.

We used 6-bit field for the ingress port and randomly set each field value. Concretely, we generated each rule as follows.

1) Each field is randomly set as a wildcard. When the field is not set as a wildcard, the following steps are executed.

2) For source/destination IP address fields, the prefix length is set randomly from between 1 and 32, and then the value is set randomly from its possible values.

3) For other fields, the value is set randomly from its possible values.

In this way, we generated four OpenFlow-like 12-tuple rule sets with 100, 200, 500, and 1K rules, each of which is independent of the others. Note that our generated rule sets include many impractical rules because each field value is set at random.

6.2.Algorithm Evaluation:

1) 12-Tuple Rule Sets: To evaluate the performance of the algorithms, we use following performance metrics:

• Average memory requirement (bytes) per rule. It is computed as the total memory requirement of a decision forest divided by the total number of rules for building the forest. It represents the scalability of our algorithms.

• Tree depth .It is defined as the maximum directed distance from the tree root to a leaf node. For a decision forest including multiple trees, we consider the maximum tree depth among these trees. A smaller tree depth leads to shorter pipelines and thus lower latency.

• Number of cutting fields (denoted Ncf) for building a decision tree. The Ncf of a decision forest is defined as the maximum Ncf among the trees in the forest. Using a smaller number of cutting fields results in less hardware for implementing cutting logic and smaller memory for storing cutting formation of each node.We

listSize = 64 , depthBound = 16 , and varied the number of trees P=1,2,3,4.

In the case of P=1, we observed memory explosion when the number of rules was increased from 100 to 1K. On the other hand, increasing P dramatically reduced the memory consumption, especially for the larger rule set. Almost 100-fold reduction in memory consumption was achieved for the 1K rules,

when P was increased just from 1 to 2. With P=3 or 4, the average memory requirement per rule remained on the same order of magnitude for different size of rule sets.

The tree depth and the number of cutting fields were also reduced by increasing P. With P=3 or

4, 6-fold and 3-fold reductions were achieved, respectively, in the tree depth and the number of cutting fields, compared with using a single decision tree.

2)5-Tuple Rule Sets: We set P=1and evaluated the effectiveness of our optimized decision-tree-based packet classification algorithm, by conducting experiments on 4 real-life 5-tuple rule sets of different sizes. In these experiments, we set listSize =8,which was optimal according to a series of tests where we found that a larger listSize resulted in lower memory requirement but deeper Rule Pipelines. The memory reduction became unremarkable when listSize >8 . According our algorithm kept the memory requirement to be linear with the number of rules, and thus achieved much better scalability than the original Hyper Cuts algorithm. Also, the depth of the decision tree generated using our algorithm was much smaller than that of Hyper Cuts, indicating a smaller delay for a packet to pass through the engine. Note that the tree depth is determined by not only the size of the rule set but also the characteristics of the rule set. All the decision tree-based algorithms share this problem that the tree depth is in deterministic.

The Tree Pipeline needed at least 9 stages to map the decision tree for ACL_10K. We conducted a series of experiments to find the optimal values for the memory and the node distribution bounds, i.e., Bm and Bn.When Bm ≤ 512 or Bn ≤ 128 , the Tree Pipeline needed more than 20 stages. When Bm ≥ 2048 or Bn ≥ 512, the memory and the node distribution over the stages was same as that using static mapping scheme which mapped each tree level onto a stage. Only when Bm = 1024,Bn = 256 , both memory and node distribution were balanced, while the number of stages needed was increased slightly to 11. Our mapping scheme outperformed the static mapping scheme with respect to both memory and node distribution.

7. CONCLUSION

This paper presented a novel fuzzy decision-tree-based linear multi-pipeline architecture on FPGAs for wire-speedmulti-field packet classification.We considered the next-generation packet classification problems where more than 5-tuple packet header fields would be classified. Several optimization techniques were proposed to reduce the memory requirement of the state-of-the-art decision-tree-based packet classification algorithm,so that 10K 5-tuple rules or 1K 12-tuple rules could fit in the on-chip memory of a single FPGA. Our architecture provided on-the-fly reconfiguration due to the linear memory-based architecture. Extensive simulation and FPGA implementation results demonstrate the effectiveness of our solution. The FPGA design supports 10K 5-tuple rules or 1K OpenFlow-like complex rules and sustains over 40 Gbps throughput for minimum size (40 bytes) packets. Our future work includes porting our design into real systems and evaluating its performance under real-life scenarios such as dynamic

rule updates.