Algebraic properties of structured context-free languages: old approaches and novel developments (original) (raw)

Precedence Automata and Languages

Lecture Notes in Computer Science, 2011

Operator precedence grammars define a classical Boolean and deterministic context-free family (called Floyd languages or FLs). FLs have been shown to strictly include the well-known visibly pushdown languages, and enjoy the same nice closure properties. We introduce here Floyd automata, an equivalent operational formalism for defining FLs. This also permits to extend the class to deal with infinite strings to perform for instance model checking.

Operator Precedence Languages: Their Automata-Theoretic and Logic Characterization

SIAM Journal on Computing, 2015

Operator precedence languages were introduced half a century ago by Robert Floyd to support deterministic and efficient parsing of context-free languages. Recently, we renewed our interest in this class of languages thanks to a few distinguishing properties that make them attractive for exploiting various modern technologies. Precisely, their local parsability enables parallel and incremental parsing, whereas their closure properties make them amenable for automatic verification techniques, including model checking. In this paper we provide a fairly complete theory of this class of languages: we introduce a class of automata with the same recognizing power as the generative power of their grammars; we provide a characterization of their sentences in terms of monadic second order logic as it has been done in previous literature for more restricted language classes such as regular, parenthesis, and input-driven ones; we investigate preserved and lost properties when extending the language sentences from finite length to infinite length (ω-languages). As a result, we obtain a class of languages that enjoys many nice properties of regular languages (closure and decidability properties, logic characterization) but is considerably larger than other families-typically parenthesis and input-driven ones-with the same properties, covering "almost" all deterministic languages. 1

Logic Characterization of Invisibly Structured Languages: the Case of Floyd Languages

2013

Operator precedence grammars define a classical Boolean and deterministic context-free language family (called Floyd languages or FLs). FLs have been shown to strictly include the well-known Visibly Pushdown Languages, and enjoy the same nice closure properties. In this paper we provide a complete characterization of FLs in terms of a suitable Monadic Second-Order Logic. Traditional approaches to logic characterization of formal languages refer explicitly to the structures over which they are interpreted-e.g, trees or graphs-or to strings that are isomorphic to the structure, as in parenthesis languages. In the case of FLs, instead, the syntactic structure of input strings is "invisible" and must be reconstructed through parsing. This requires that logic formulae encode some typical context-free parsing actions, such as shift-reduce ones.

Aperiodicity, Star-freeness, and First-order Logic Definability of Operator Precedence Languages

Logical Methods in Computer Science, 2023

A classic result in formal language theory is the equivalence among noncounting, or aperiodic, regular languages, and languages defined through star-free regular expressions, or first-order logic. Past attempts to extend this result beyond the realm of regular languages have met with difficulties: for instance it is known that star-free tree languages may violate the non-counting property and there are aperiodic tree languages that cannot be defined through first-order logic. We extend such classic equivalence results to a significant family of deterministic contextfree languages, the operator-precedence languages (OPL), which strictly includes the widely investigated visibly pushdown, alias input-driven, family and other structured context-free languages. The OP model originated in the '60s for defining programming languages and is still used by high performance compilers; its rich algebraic properties have been investigated initially in connection with grammar learning and recently completed with further closure properties and with monadic second order logic definition. We introduce an extension of regular expressions, the OP-expressions (OPE) which define the OPLs and, under the star-free hypothesis, define first-order definable and non-counting OPLs. Then, we prove, through a fairly articulated grammar transformation, that aperiodic OPLs are first-order definable. Thus, the classic equivalence of star-freeness, aperiodicity, and first-order definability is established for the large and powerful class of OPLs. We argue that the same approach can be exploited to obtain analogous results for visibly pushdown languages too.

Operator precedence and the visibly pushdown property

Journal of Computer and System Sciences, 2012

Operator precedence languages, designated as Floyd's Languages (FL) to honor their inventor, are a classical deterministic context-free family. FLs are known to be a boolean family, and have been recently shown to strictly include the Visibly Pushdown Languages (VPDL); the latter are FLs characterized by operator precedence relations determined by the alphabet partition. In this paper we give the non-obvious proves that FLs have the same closure properties that motivated the introduction of VPDLs, namely under reversal, concatenation and Kleene's star. Thus, rather surprisingly, the historical FL family turns out to be the largest known deterministic context-free family that includes the VPDL and has the same closure properties needed for applications to model checking and for defining mark-up languages such as HTML. As a corollary, an extended regular expression of precedence-compatible FLs is a FL and a deterministic parser for it can be algorithmically obtained.

Logic Characterization of Floyd Languages

ArXiv, 2012

Floyd languages (FL), alias Operator Precedence Languages, have recently received renewed attention thanks to their closure properties and local parsability which allow one to apply automatic verification techniques (e.g. model checking) and parallel and incremental parsing. They properly include various other classes, noticeably Visual Pushdown languages. In this paper we provide a characterization of FL in terms a monadic second order logic (MSO), in the same style as Buchi's one for regular languages. We prove the equivalence between automata recognizing FL and the MSO formalization.

Higher-Order Operator Precedence Languages

Electronic Proceedings in Theoretical Computer Science

Floyd's Operator Precedence (OP) languages are a deterministic context-free family having many desirable properties. They are locally and parallely parsable, and languages having a compatible structure are closed under Boolean operations, concatenation and star; they properly include the family of Visibly Pushdown (or Input Driven) languages. OP languages are based on three relations between any two consecutive terminal symbols, which assign syntax structure to words. We extend such relations to k-tuples of consecutive terminal symbols, by using the model of strictly locally testable regular languages of order k ≥ 3. The new corresponding class of Higher-order Operator Precedence languages (HOP) properly includes the OP languages, and it is still included in the deterministic (also in reverse) context free family. We prove Boolean closure for each subfamily of structurally compatible HOP languages. In each subfamily, the top language is called max-language. We show that such languages are defined by a simple cancellation rule and we prove several properties, in particular that max-languages make an infinite hierarchy ordered by parameter k. HOP languages are a candidate for replacing OP languages in the various applications where they have have been successful though sometimes too restrictive.

Aperiodicity, Star-freeness, and First-order Definability of Structured Context-Free Languages

ArXiv, 2020

A classic result in formal language theory is the equivalence among noncounting, or aperiodic, regular languages, and languages defined through star-free regular expressions, or first-order logic. Together with first-order completeness of linear temporal logic these results constitute a theoretical foundation for model-checking algorithms. Extending these results to structured subclasses of context-free languages, such as tree-languages did not work as smoothly: for instance W. Thomas showed that there are star-free tree languages that are counting. We show, instead, that investigating the same properties within the family of operator precedence languages leads to equivalences that perfectly match those on regular languages. The study of this old family of context-free languages has been recently resumed to enhance not only parsing (the original motivation of its inventor R. Floyd) but also to exploit their algebraic and logic properties. We have been able to reproduce the classic r...

Algebraic properties of operator precedence languages

Information and control, 1978

This paper presents new results on the algebraic ordering properties of operator precedence grammars and languages. This work was motivated by, and applied to, the mechanical acquisition or inference of operator precedence grammars. A new normal form of operator precedence grammars called homogeneous is defined. An algorithm is given to construct a grammar, called maxgrammar, generating the largest language which is compatible with a given precedence matrix. Then the class of free grammars is introduced as a special subclass of operator precedence grammars. It is shown that operator precedence languages corresponding to a given precedence matrix form a Boolean algebra.

Context-free languages and pushdown automata

1997

Suppose we want to generate a set of strings (a language) L over an alphabet Σ. How shall we specify our language? One very useful way is to write a grammar for L. A grammar is composed of a set of rules. Each rule may make use of the elements of Σ (which we'll call the terminal alphabet or terminal vocabulary), as well as an additional alphabet, the nonterminal alphabet or vocabulary. To distinguish between the terminal alphabet Σ and the non-terminal alphabet, we will use lowercase letters: a, b, c, etc. for the terminal alphabet and upper-case letters: A, B, C, S, etc. for the non-terminal alphabet. (But this is just a convention. Any character can be in either alphabet. The only requirement is that the two alphabets be disjoint.) A grammar generates strings in a language using rules, which are instructions, or better, licenses, to replace some nonterminal symbol by some string. Typical rules look like this: S → ASa, B → aB, A → SaSSbB. In context-free grammars, rules have a single non-terminal symbol (upper-case letter) on the left, and any string of terminal and/or non-terminal symbols on the right. So even things like A → A and B → ε are perfectly good context-free grammar rules. What's not allowed is something with more than one symbol to the left of the arrow: AB → a, or a single terminal symbol: a → Ba, or no symbols at all on the left: ε → Aab. The idea is that each rule allows the replacement of the symbol on its left by the string on its right. We call these grammars context free because every rule has just a single nonterminal on its left. We can't add any contextual restrictions (such as aAa). So each replacement is done independently of all the others. To generate strings we start with a designated start symbol often S (for "sentence"), and apply the rules as many times as we please whenever any one is applicable. To get this process going, there will clearly have to be at least one rule in the grammar with the start symbol on the left-hand side. (If there isn't, then the grammar won't generate any strings and will therefore generate ∅, the empty language.) Suppose, however, that the start symbol is S and the grammar contains both the rules S → AB and S → aBaa. We may apply either one, producing AB as the "working string" in the first case and aBaa in the second. Next we need to look for rules that allow further rewriting of our working string. In the first case (where the working string is AB), we want rules with either A or B on the left (any non-terminal symbol of the working string may be rewritten by rule at any time); in the latter case, we will need a rule rewriting B. If, for example, there is a rule B → aBb, then our first working string could be rewritten as AaBb (the A stays, of course, awaiting its chance to be replaced), and the second would become aaBbaa. How long does this process continue? It will necessarily stop when the working string has no symbols that can be replaced. This would happen if either: (1) the working string consists entirely of terminal symbols (including, as a special case, when the working string is ε, the empty string), or (2) there are non-terminal symbols in the working string but none appears on the left-hand side of any rule in the grammar (e.g., if the working string were AaBb, but no rule had A or B on the left). Example 4: L = {a n b 2n }. You should recognize that b 2n = (bb) n , and so this is just like the first example except that instead of matching a and b, we will match a and bb. So we want G = ({S, a, b}, {a, b}, R, S} where R = {S → aSbb, S → ε}. If you wanted, you could use an auxiliary nonterminal, e.g., G = ({S, B, a, b}, {a, b}, R, S} where R = {S → aSB, S → ε, B → bb}, but that is just cluttering things up. Example 5: L = {a n b n c m }. Here, the c m portion of any string in L is completely independent of the a n b n portion, so we should generate the two portions separately and concatenate them together. A solution is G = ({S, N, C, a, b, c}, {a, b, c}, R, S} where R = {S → NC, N → aNb, N → ε, C → cC, C → ε}. This independence buys us freedom: producing the c's to the right is completely independent of making the matching a n b n , and so could be done in any manner, e.g., alternate rules like C → CC, C → c, C → ε would also work fine. Thinking modularly and breaking the problem into more manageable subproblems is very helpful for designing CFG's. Example 6: L = {a n b m c n }. Here, the b m is independent of the matching a n …c n. But it cannot be generated "off to the side." It must be done in the middle, when we are done producing a and c pairs. Once we start producing the b's, there should be no more a, c pairs made, so a second nonterminal is needed. Thus we have G = ({S, B, a, b, c}, {a, b, c}, R, S} where R = {S → ε, S → aSc, S → B, B → bB, B → ε}. We need the rule S → ε. We don't need it to end the recursion on S. We do that with S → B. And we have B → ε. But if n = 0, then we need S → ε so we don't generate any a…c pairs.