Context-free languages and pushdown automata (original) (raw)
Suppose we want to generate a set of strings (a language) L over an alphabet Σ. How shall we specify our language? One very useful way is to write a grammar for L. A grammar is composed of a set of rules. Each rule may make use of the elements of Σ (which we'll call the terminal alphabet or terminal vocabulary), as well as an additional alphabet, the nonterminal alphabet or vocabulary. To distinguish between the terminal alphabet Σ and the non-terminal alphabet, we will use lowercase letters: a, b, c, etc. for the terminal alphabet and upper-case letters: A, B, C, S, etc. for the non-terminal alphabet. (But this is just a convention. Any character can be in either alphabet. The only requirement is that the two alphabets be disjoint.) A grammar generates strings in a language using rules, which are instructions, or better, licenses, to replace some nonterminal symbol by some string. Typical rules look like this: S → ASa, B → aB, A → SaSSbB. In context-free grammars, rules have a single non-terminal symbol (upper-case letter) on the left, and any string of terminal and/or non-terminal symbols on the right. So even things like A → A and B → ε are perfectly good context-free grammar rules. What's not allowed is something with more than one symbol to the left of the arrow: AB → a, or a single terminal symbol: a → Ba, or no symbols at all on the left: ε → Aab. The idea is that each rule allows the replacement of the symbol on its left by the string on its right. We call these grammars context free because every rule has just a single nonterminal on its left. We can't add any contextual restrictions (such as aAa). So each replacement is done independently of all the others. To generate strings we start with a designated start symbol often S (for "sentence"), and apply the rules as many times as we please whenever any one is applicable. To get this process going, there will clearly have to be at least one rule in the grammar with the start symbol on the left-hand side. (If there isn't, then the grammar won't generate any strings and will therefore generate ∅, the empty language.) Suppose, however, that the start symbol is S and the grammar contains both the rules S → AB and S → aBaa. We may apply either one, producing AB as the "working string" in the first case and aBaa in the second. Next we need to look for rules that allow further rewriting of our working string. In the first case (where the working string is AB), we want rules with either A or B on the left (any non-terminal symbol of the working string may be rewritten by rule at any time); in the latter case, we will need a rule rewriting B. If, for example, there is a rule B → aBb, then our first working string could be rewritten as AaBb (the A stays, of course, awaiting its chance to be replaced), and the second would become aaBbaa. How long does this process continue? It will necessarily stop when the working string has no symbols that can be replaced. This would happen if either: (1) the working string consists entirely of terminal symbols (including, as a special case, when the working string is ε, the empty string), or (2) there are non-terminal symbols in the working string but none appears on the left-hand side of any rule in the grammar (e.g., if the working string were AaBb, but no rule had A or B on the left). Example 4: L = {a n b 2n }. You should recognize that b 2n = (bb) n , and so this is just like the first example except that instead of matching a and b, we will match a and bb. So we want G = ({S, a, b}, {a, b}, R, S} where R = {S → aSbb, S → ε}. If you wanted, you could use an auxiliary nonterminal, e.g., G = ({S, B, a, b}, {a, b}, R, S} where R = {S → aSB, S → ε, B → bb}, but that is just cluttering things up. Example 5: L = {a n b n c m }. Here, the c m portion of any string in L is completely independent of the a n b n portion, so we should generate the two portions separately and concatenate them together. A solution is G = ({S, N, C, a, b, c}, {a, b, c}, R, S} where R = {S → NC, N → aNb, N → ε, C → cC, C → ε}. This independence buys us freedom: producing the c's to the right is completely independent of making the matching a n b n , and so could be done in any manner, e.g., alternate rules like C → CC, C → c, C → ε would also work fine. Thinking modularly and breaking the problem into more manageable subproblems is very helpful for designing CFG's. Example 6: L = {a n b m c n }. Here, the b m is independent of the matching a n …c n. But it cannot be generated "off to the side." It must be done in the middle, when we are done producing a and c pairs. Once we start producing the b's, there should be no more a, c pairs made, so a second nonterminal is needed. Thus we have G = ({S, B, a, b, c}, {a, b, c}, R, S} where R = {S → ε, S → aSc, S → B, B → bB, B → ε}. We need the rule S → ε. We don't need it to end the recursion on S. We do that with S → B. And we have B → ε. But if n = 0, then we need S → ε so we don't generate any a…c pairs.