Deep packet inspection is a detection technology regarding packet content and plays an increasingly important role in modern network intrusion detection systems (NIDSes) [1]. Due to its better expressiveness and flexibility, regular expression matching has been widely used in NIDSes. However, regular expression matching engenders much convenience to NIDSes, but also simultaneously brings significantly high computation and storage complexity, thereby prohibiting its wide usage in NIDSes applications. Thereby, how to design a regular expression matching algorithm that can achieve both time and space efficiency is of great challenge.
Deterministic finite automaton (DFA) and nondeterministic nondeterministic finite automaton (NFA), as two main finite automata, are typically used to implement regular expression matching. Although DFA has fast matching speed and deterministic matching performance, it causes a memory explosion problem. While NFA requires less memory, it suffers from slow matching speed and nondeterministic matching performance [2]. In a high-speed or resource-restricted environment, because both DFA and NFA automata require excessive memory, time or per-flow state, neither of them is suitable to implement a regular expression matching operation in a high-speed or resource-restricted setting. Recently, many research works have focused on either improving the matching speed or reducing the memory access spent in a regular expression matching.
In accelerating the matching speed of a regular expres-sion aspect, Kumar et al. [3] proposed a D2FA regular expression matching algorithm in order to accelerate the matching speed. Becchi and Cadambi [4] suggested a DFA regular expression matching algorithm based on state fusion, making use of a migration edge marking method in order to integrate a multiple non-equivalent state for guaranteeing DFA performance. In Brodie method, Brodie et al. [5] presented a new way that can increase the throughput of regular expression matching by expanding the alphabet set; however, this method could result in an exponential growth of memory requirement in the worst case. Moreover, although sampling techniques are introduced in order to accelerate regular expression matching [6], it does not suit all kinds of regular expression. Kumar et al. [7] applied default transitions in order to improve average performance. In eliminating the storage space explosion of the DFA aspect, Yu et al. [8] proposed state compression techniques based on mDFA in order to deflate state explosion. Kumar et al. [9] used heuristics in order to eliminate the insomnia problem existing in the regular expression matching algorithm. In the method of Wang et al. [10], a chain-based DFA deflation method for fast and scalable regular expression matching by using ternary content addressable memory (TCAM) is proposed to effectively deflate the DFA size. In order to prevent state explosion, Smith et al. [11] introduced a partial NFA to DFA conversion. In [12], a new art work extended finite automaton (XFA) based on auxiliary memory was introduced in order to reduce the DFA state explosion as well as to achieve a great reduction rate. However, XFA is not suitable for real-time applications in networks due to its significant startup overhead, high memory access and big storage space in regular expression matching.
This paper aims to solve the problem of high memory access and big storage space and long matching time during the regular expression matching process of XFA by proposing a new regular expression matching algorithm based on high-efficient finite automaton (HFA). We use some extra judging instruments at the starting state in order to reduce any unnecessary transition paths and eliminate any unnecessary state transitions. Consequently, the high memory access consumption and big storage space consumption and long matching time in the regular matching process of XFA can be efficiently improved. The simulation results convey that our proposed scheme can lower 40% memory access, save 45% storage space consumption, and reduce 12% matching time in the same regular expression matching process compared with XFA, but without degrading the matching quality.
The rest of this paper is organized as follows. In Section Ⅱ, the state space explosion of DFA is introduced. Then the proposed HFA method is presented in Section Ⅲ. The simulation result and analysis of the proposed scheme compared with XFA methods is presented in Section Ⅳ. In Section Ⅴ, we provide some conclusions.
State-space explosion is a main problem existing in combined DFA, which seriously influences its wide application of DFA. In this section, we mainly illustrate some related knowledge with regard to the DFA, including DFA definition, the reasons for state-space explosion of DFA, a general method to solve state-space explosion problem—XFA, and some problems existing in XFA.
DFA can be described as
In a DFA, a path is ambiguous when there is an ambiguous state in this path in which a state is ambiguous, given any finite sequence and state. Although the set of paths to this state exist in the same path suffix, not all can be expressed solely by the finite sequence. That is to say, an ambiguous state may usually be obtained by many distinct sequences. During the combining process of automata, the interaction between states in both ambiguous and unambiguous paths above will lead to a statespace explosion of DFA. The specific generating process is that the unambiguous states existing in the prefix of a path in one automaton can be replicated when it is combined with ambiguous states in a path in another automaton during the process of combination. The combined automata need to track the progress in matching both the unambiguous path and the ambiguous path independently [13], which will result in a state-space explosion of DFA. The amount of replication observed depends on the following two aspects: how extreme and pervasive the ambiguity is in the two important automata and how much interaction occurs between them. That is to say, automaton with limited levels of ambiguity can introduce comparatively small amounts of replication, whereas a path of infinite length can cause an entire automaton to be copied and therefore directly lead to exponential replication. Fig. 3 is the combined DFA for the regular expression of (.*
In order to solve the problem of state-space explosion above, an XFA is proposed for this problem [12]. The XFA uses some auxiliary variables to record some matched paths, which can save lots of state-space as well as improve the matching speed. Fig. 4 is the XFA for the combined regular expression of (.*
From Fig. 4, we can clearly see that XFA uses an auxiliary variable (Bit 1) in order to record the matched path (ab) in the regular expression of (.*
Although XFA can eliminate a DFA state explosion by adding auxiliary variables to record some completed matching results, at the same time, it also adds a great amount of redundancy migration side, which not only increases the number of storage space, but also leads to many unnecessary state transitions. Moreover, it adds memory accesses and matching time in the regular expression matching of XFA, therefore limiting its practical application. Fig. 5 is the XFA for the regular expression of (.*
From Figs. 5 and Figs. 6, we can see clearly that XFA has lots of redundancy migration sides, such as
In order to solve the problems existing in the regular expression matching of XFA above, we propose a new regular expression matching algorithm in this paper, which is based on HFA. The basic idea of HFA is that, we add some extra judging instruments at the starting state in order to reduce the storage space of the redundancy migration side as well as to eliminate many unnecessary state transitions. The specific realizing process for HFA is that it makes full use of the current state, judging instruments, auxiliary variable value and the input character for determining which next state to shift and which auxiliary variable to update during the regular expression matching process of HFA. As a result, the problems of high memory access and big storage space and long matching time existing in the regular expression matching process of XFA can be efficiently improved, which greatly improves the overall matching performance of XFA.
The working process of our proposed HFA method is comprised of some parts. Fig. 7 is the working flow composition of the HFA algorithm.
From Fig. 7, we can see that our proposed HFA algorithm is primarily made up of four parts: initialization, HFA_building, HFA_matching, and read result. The function of initialization mainly makes some necessary preparations before the program begins to run. HFA_building primarily builds up a HFA from an XFA. It mainly includes the addition of some extra judging instruments at the starting state and also reduces redundancy migration side operations. Fig. 8 is the part pseudocode of procedure HFA_building( ).
Read result operation mainly reads the matching result from the HFA. The function of HFA_matching mainly takes the matching operations according to the regular expression and input data stream. The main realization process of HFA_matching includes the following two operations.
Operation 1: Select which branching path to execute according to the current staring state, judging instruments with auxiliary variable value and input character. If some paths are met, jump Operation 1 to go on; otherwise, jump Operation 2 to go on.
Operation 2: Determine the matching path according to the current state and the input character. When the matching process reaches our predetermined path, the algorithm will set an auxiliary variable value as 1 to record it and take the corresponding process operate; then, jump Operation 2 to go on; otherwise, jump Operation 1 to go on.
Fig. 9 is the part pseudocode of procedure HFA_matching( ).
Take the regular expression (.*
Step 1: Build HFA from XFA according to the regular expression of (.*
Step 2: Set the initial value of auxiliary variable (Bit1) as 0 at the starting state;
Step 3: Judge whether to execute the P Q path according to the current staring state (P), judging instruments for auxiliary variable value (if(!Bit 1)) and input character (a);
Step 4: Decide which next state to transit through the current state (𝐐) and the input character (a or b). If the input character is b, transit to R state; if a, transit to 𝐐 state; if neither a or b, transit to P state. When the matching process reaches the R N path, the algorithm will set the auxiliary variable (Bit = 1) in order to record it. When the matching process finishes given the regular expression, our algorithm executes the corresponding matching operation and updates the corresponding auxiliary variables.
Step 5: Go on to the matching operation above until the end of the matching work;
Step 6: Read the matching results from the matching process of HFA.
Fig. 10 is the HFA matching process for the regular expression of (.*_{abc}.*_{def}). Fig. 11 is the HFA matching process for the regular expression of (.*
From Figs. 10 and Figs. 11, we can clearly observe that our proposed HFA method includes 7 states and only 19 migration edges in the regular expression of (.*
In order to verify the effectiveness of our proposed method above, we conduct some experiments. Our designed experiments mainly include four parts: simulation environment, lower memory access, save storage space, and reduce matching time.
The simulation environment is conducted on Microsoft Windows 7 operating systems, AMD-A63420M 4-core CPU processor, 2 G memory, 500 G hard disk, and VC++ 6.0 software tool. The testing regular rule set used in our experiment comes from Snort, whose snapshot was released on January 1, 2009. We extract all FTP rule sets and part HTTP rule sets from Snort as our testing object and grab the FTP data packet and HTTP data packet to match the data. Snort2Bro tool is used to change the Snort rule format into a Bro rule format as well as produce 82 FTP rules and 443 HTTP rules.
In our experiment, we compare our proposed HFA algorithm with the DFA algorithm and XFA algorithm, and evaluate three general important indicators: memory access, storage space, and matching time. As to the DFA algorithm, we produce a single DFA by making use of each regular expression. The combined DFA is obtained by merging the DFA algorithm. For the XFA algorithm, we generate a single XFA by adding some configuration information in the rules, such as notes, variable mapping, etc., and then use the merging XFA algorithm to produce a combined XFA. Our proposed HFA algorithm is produced based on XFA, but with the addition of some extra judging instruments at the starting state in order to reduce any unnecessary transition paths and eliminate any unnecessary state transitions. The detailed realization on the proposed scheme is shown in the proposed scheme part of this paper above. Table 1 is the main tool and parameters of our experiment, including the rule set name, the testing number of the rule set, compared indicator, compared methods and so on.
[Table 1.] Experimental tools and parameters
Experimental tools and parameters
In order to reduce data randomness, we obtain our experimental data based on the average value of 5 testings.
In our HFA method, reducing the number of unnecessary state transitions by adding some extra judging instruments at the starting state can save a great amount of memory access in the same matching process compared with XFA. Fig. 12 is the memory access consumption comparison result for three different matching methods.
From Fig. 12, we can observe that our proposed algorithm has a good memory access saving performance in three different matching algorithms above, reaching approximately 65% in memory access saving compared with that of DFA and reducing about 40% in memory access saving compared with that of XFA. The main reason for it is that our proposed method uses some judging instruments at the starting state in order to narrow and determine the next execution area. Hence, many unnecessary state transitions can be eliminated, which can save a great amount of memory access.
In this subsection, we evaluate the storage space performance of our proposed HFA scheme compared with the DFA and XFA methods from four indicators: state number, migration edge number, instruction number, and the total storage space. Table 2 is the storage space consumptions conditions for three methods in two different rule sets.
From Table 2, we can observe that our proposed HFA algorithm has superior results in three matching algorithms above in two different rule sets. DFA takes up the biggest storage space consumption, reaching about 4052 kB and 526324 kB in two different rule sets, respectively. XFA then followed, getting to approximately 402 kB and 6794 kB. However, our proposed scheme only takes up 223 kB and 3685 kB, which saves 93% in storage space compared with that of DFA and 45% in storage space compared with that of XFA. The main reason for this effect lies in the use of judging instruments at the starting state in our proposed method, which can reduce a great amount of storage spaces of unnecessary state transitions and migration edges. Note that the Null in Table 2 implies no instruction number.
[Table 2.] Storage space comparison in matching for three methods
Storage space comparison in matching for three methods
In this subsection, we evaluate the average matching time for three matching algorithms above in the FTP and HTTP rule sets. The specific experimental results are shown in Fig. 13.
From Fig. 13, we can see clearly that, compared to the XFA method, our proposed HFA algorithm shows a big improvement in matching time and portrays about 12% matching time saving. The main reason for this effect lies in the fact that our proposed method reduces redundancy migration edges and eliminates unnecessary state transitions by adding the judging instruments at the starting state, which can reduce many unnecessary matching processes. From Fig. 13, we can also observe that our proposedmatching state in each regular expression matching process of DFA. HFA is still lower in matching time compared to that of DFA. This result is due to the deterministic matching state in each regular expression matching process of DFA.
In this paper, a high-efficient regular expression matching algorithm based on HFA is proposed to improve the performance in the regular expression matching of XFA. In our scheme, we exactly narrow down the execute area size for the next step as well as eliminate any unnecessary state transitions by adding some judging instruments at the starting state, which can reduce a great amount of unnecessary memory access, storage space, and matching time in the regular expression matching of XFA. The simulation results convey that our proposed HFA scheme can slower 40% memory access, save about 45% storage space, and reduce about 12% matching time in the regular expression matching process compared with XFA, but without degrading the matching quality.