New Theoretical and Computational Results For Regular Languages (original) (raw)

Small NFAs from Regular Expressions: Some Experimental Results

2010

Abstract: Regular expressions (res), because of their succinctness and clear syntax, are the common choice to represent regular languages. However, efficient pattern matching or word recognition depend on the size of the equivalent nondeterministic finite automata (NFA).

From regular expressions to smaller NFAs

Theoretical Computer Science, 2011

Several methods have been developed to construct λ-free automata that represent a regular expression. Among the most widely known are the position automaton (Glushkov), the partial derivatives automaton (Antimirov) and the follow automaton (Ilie and Yu). All these automata can be obtained with quadratic time complexity, thus, the comparison criterion is usually the size of the resulting automaton. The methods that obtain the smallest automata (although, for general expressions, they are not comparable), are the follow and the partial derivatives methods. In this paper, we propose another method to obtain a λ-free automaton from a regular expression. The number of states of the automata we obtain is bounded above by the size of both the partial derivatives automaton and of the follow automaton. Our algorithm also runs with the same time complexity of these methods.

The complexity of restricted regular expressions and the synthesis problem for finite automata

Journal of Computer and System Sciences, 1981

It is known that for every restricted regular expression of length n there exists a nondeterministic finite automaton with n + 1 states giving rise to the upper bound of 2" + 1 on the number of states of the corresponding reduced automaton. In this note we show that this bound can be attained for all n ) 2, i.e., the upper bound 2" + 1 is optimal. An observation is then made about the synthesis problem for nondeterministic finite automata.

Efficiency of Reducing the Size of States of Finite State Aleshin Type Automata

In this paper we are implementing the regular expression matching is done using finite state Aleshin type automata, including non-deterministic finite state aleshin type automata (NAAs) and deterministic finite automata (DAAs). Storage space of automata is jointly determined by the number of states and transitions between states. A key issue is that the size of the automaton obtained from a regular expression is large, where size is defined as the number of states and transition arcs between states. The size of an automaton is crucial for the efficiency of the algorithms using three pattern matching based on regular expressions, size directly affects both time and space efficiency. NAAs and DAAs have their own advantages and disadvantages in regular expression matching. Keywords: deterministic finite state Aleshin type automata, non-deterministic finite state aleshin type automata (NAAs) and partial derivative automata I.INTRODUCTION NFAs can provide an exponentially more succinct description than DFAs but equivalence, inclusion, and universality are computationally hard for NFAs, while many of these problems can be solved in polynomial time for DFAs. The processing complexity for each character in the input is O (1) in a DFA, but O (n 2) for an NFA if all n states are active at the same time. The key feature of a DFA is that there is only one active state at any time; but converting an NFA into a DFA may generate O (Σ n) states. The size of a DFA, obtained fro m a regular expression, can increase exponentially; the DFA of a regular expression with thousands of patterns yields tens of thousands of states, which means memory consumption of thousands of megabytes. Another problem is that a minimal NFA is hard to co mpute [6]. How to use the matching e fficiency of a DFA and the storage efficiency of an NFA to realize matching is always a pursued goal in the field of regular exp ression matching. The regular expression is an important notation for specific patterns. Owing to its expressive power and flexib ility in describing useful patterns [1], regular expression matching technology based on finite automat a is widely used in networks and information processing, including applications for network real-t ime processing, protocol analysis, intrusion detection systems, intrusion prevention systems, deep packet inspection systems, and virus detection systems like Snort [2], Linu x L7-filter [3], and Bro [4]. Regular expressions are replacing explicit string patterns as the method of choice for describing patterns. However, with the increasing scale and number of regular exp ressions in a practical system, it is challenging to achieve good performance fo r pattern matching based on regular expressions. For example, the number of signatures in Snort has grown from 3166 in 2003 to 15,047 in 2009 and the pattern matching routines in Snort account for up to 70% of the total execution time with 80% of the instructions executed on real traces [5]. II. REGULAR EXPRESSION The term alphabet denotes any finite set of symbols. A string over an alphabet is a finite sequence of symbols drawn fro m that alphabet with the term word often used as a synonym for the term string. Let Σ be an alphabet and Σ * be the set of all words over Σ , i.e., Σ * denotes the set of all finite strings of symbols in Σ. If Σ is an alphabet, then any subset of Σ * is a language over Σ. The length of a word w ∈ Σ * , usually written as |w |, is the number of occurrences of symbols in w , with ε denoting the empty word whose length is 0. ∅ is the empty set, a ∈ Σ is an input symbol, and r and s are regular expressions. A regular expression describes a set of strings witho ut enumerating them exp licit ly. A regular expression over Σ , which can be recursively de fined, is defined as follo ws: (1)∅ and ε are regular expressions, denoting ∅ and {ε}, respectively. (2)If a is a symbol in Σ , then a is a regular expression that denotes {a}. (3)Suppose r and s are regular expressions denoting the languages L (r) and L (s). Then, (r) + (s), (r) · (s), r * , and (r) are also regular exp ressions denoting L (r) ∪ L (s), L(r)L (s), (L (r)) * , and L (r), respectively. (4)All regular exp ressions can be obtained by applying rules (1), (2), and (3) a fin ite number of t imes.

A Memory Efficient Regular Expression Matching by Compressing Deterministic Finite Automata

2015

Regular expressions are very meaningful and now-a-days broadly used to represent signatures of various attacks. The focal component of today’s security systems like intrusion detection and prevention system is a signature based regular expression matching. Deterministic finite automaton is often used to represent regular expressions. In regular expression matching, storage space of Deterministic finite automata is very important concern. A massive amount of memory is essential to store transition function of Deterministic finite automata. The method described in this paper reduces size of Deterministic finite automata which is in regular expression format. The performance of the regular expression matching by compressing Deterministic finite automata is evaluated by using regular expression set.

An optimal construction of finite automata from regular expressions

2008

We consider the construction of finite automata from their corresponding regular expressions by a series of digraph-transformations along the expression's structure. Each intermediate graph represents an extended finite automaton accepting the same language. The character of our construction allows a fine-grained analysis of the emerging automaton's size, eventually leading to an optimality result.

Finite Automata, Digraph Connectivity, and Regular Expression Size

Lecture Notes in Computer Science, 2008

Recently lower bounds on the minimum required size for the conversion of deterministic finite automata into regular expressions and on the required size of regular expressions resulting from applying some basic language operations on them, were given by Gelade and Neven [8]. We strengthen and extend these results, obtaining lower bounds that are in part optimal, and, notably, the presented examples are over a binary alphabet, which is best possible. To this end, we develop a different, more versatile lower bound technique that is based on the star height of regular languages. It is known that for a restricted class of regular languages, the star height can be determined from the digraph underlying the transition structure of the minimal finite automaton accepting that language. In this way, star height is tied to cycle rank, a structural complexity measure for digraphs proposed by Eggan and Büchi, which measures the degree of connectivity of directed graphs.

A Polynomial-time Regular Expressions Implementation

Cadernos do IME - Série Informática, 2017

Regular expressions are a notation to define regular languages in terms of simple composable operations. They are equivalent to finite automata in expressive power. In practice, however, modern regular expressions implementations diverge from the original theory. Most changes are made to allow greater expressive power. This convenience comes at the cost of making language membership a harder problem than it could be. In many modern languages, the regex language membership is a NP-complete problem. Besides, the way they are implemented sometimes causes expressions that could be processed in linear time to take exponential time. This fact may be seen as a security risk for many applications that use regular expressions. In this work, we suggest a simple implementation (based on Thompson's Construction Algorithm) that has superior worst-case performance than many popular implementations. We also introduce a notation for the automata created by this algorithm that makes the adopted implementation easier to understand.

An improved algorithm to accelerate regular expression evaluation

Proceedings of the 3rd ACM/IEEE Symposium on Architecture for networking and communications systems - ANCS '07, 2007

Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While deterministic finite automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. In [9], Kumar et al. propose Delayed Input DFAs (D 2 FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which corresponds directly to the memory bandwidth required to evaluate regular expressions.

The state complexity of random DFAs

Theoretical Computer Science, 2016

The state complexity of a Deterministic Finite-state automaton (DFA) is the number of states in its minimal equivalent DFA. We study the state complexity of random n-state DFAs over a k-symbol alphabet, drawn uniformly from the set [n] [n]×[k] × 2 [n] of all such automata. We show that, with high probability, the latter is α k n + O(√ n log n) for a certain explicit constant α k. 1 By symmetry, we may always take the state q = 1 to be the starting state.