a Racket AST Generator Generator (original) (raw)
8.15
Danny Yoo <dyoo@hashcollision.org>
1 Informal quickstartđź”—â„ą
Salutations! Let’s consider the following scenario: say that we’re given the following string:
(... and pretend that we don’t already know about the built-inread function.)
How do we go about turning this kind of string into a structured value? That is, how would we parse it?
We need to first consider the shape of the things we’d like to parse. The string above looks like a deeply nested list of words. How might we describe this formally? A convenient notation to describe the shape of these things isBackus-Naur Form (BNF). So let’s try to notate the structure of nested word lists in BNF.
nested-word-list: WORD | LEFT-PAREN nested-word-list* RIGHT-PAREN
What we intend by this notation is this: nested-word-list is either an atomic WORD, or a parenthesized list of any number ofnested-word-lists. We use the character * to represent zero or more repetitions of the previous thing, and we treat the uppercasedLEFT-PAREN, RIGHT-PAREN, and WORD as placeholders for atomic tokens.
See Installation for instructions on installingragg.
Here are a few examples of tokens:
> (require ragg/support) > (token 'LEFT-PAREN) (token-struct 'LEFT-PAREN #f #f #f #f #f #f) > (token 'WORD "crunchy" #:span 7) (token-struct 'WORD "crunchy" #f #f #f 7 #f) > (token 'RIGHT-PAREN) (token-struct 'RIGHT-PAREN #f #f #f #f #f #f)
Have we made progress? At this point, we only have a BNF description in hand, but we’re still missing a parser, something to take that description and use it to make structures out of a sequence of tokens.
It’s clear that we don’t yet have a program because there’s no #langline. We should add one. Put #lang ragg at the top of the BNF description, and save it as a file called "nested-word-list.rkt".
"nested-word-list.rkt"
#lang ragg nested-word-list: WORD | LEFT-PAREN nested-word-list* RIGHT-PAREN
Now it is a proper program. But what does it do?
> (require "nested-word-list.rkt") > parse #procedure:parse
It gives us a parse function. Let’s investigate what parsedoes for us. What happens if we pass it a sequence of tokens?
> (define a-parsed-value (parse (list (token 'LEFT-PAREN "(") (token 'WORD "some") (token 'LEFT-PAREN "[") (token 'WORD "pig") (token 'RIGHT-PAREN "]") (token 'RIGHT-PAREN ")")))) > a-parsed-value #<syntax (nested-word-list "(" (nested-word-list "some") (nested-word-list "[" (nested-word-list "pig") "]") ")")>
Wait... that looks suspiciously like a syntax object!
> (syntax->datum a-parsed-value) '(nested-word-list "(" (nested-word-list "some") (nested-word-list "[" (nested-word-list "pig") "]") ")")
That’s (some [pig]), essentially.
What happens if we pass it a more substantial source of tokens?
; tokenize: string -> (sequenceof token-struct?) ; Generate tokens from a string: ; For example: > (define token-source (tokenize "(welcome (to (((ragg)) ())))")) > (define v (parse token-source)) > (syntax->datum v) '(nested-word-list "(" (nested-word-list "welcome") (nested-word-list "(" (nested-word-list "to") (nested-word-list "(" (nested-word-list "(" (nested-word-list "(" (nested-word-list "ragg") ")") ")") (nested-word-list "(" ")") ")") ")") ")")
Welcome to ragg.
2 Introductionđź”—â„ą
ragg is a parsing framework for Racket with the design goal to be easy to use. It includes the following features:
- It provides a #lang for writing extended BNF grammars. A module written in #lang ragg automatically generates a parser. The output of this parser tries to followHTDPdoctrine; the structure of the grammar informs the structure of the Racket syntax objects it generates.
- The language uses a few conventions to simplify the expression of grammars. The first rule in the grammar is automatically assumed to be the starting production. Identifiers in uppercase are assumed to represent terminal tokens, and are otherwise the names of nonterminals.
- Tokenizers can be developed completely independently of parsers.ragg takes a liberal view on tokens: they can be strings, symbols, or instances constructed with token. Furthermore, tokens can optionally provide location: if tokens provide location, the generated syntax objects will as well.
- The underlying parser should be able to handle ambiguous grammars.
- It should integrate with the rest of the Racketlanguage toolchain.
2.1 Installationđź”—â„ą
At the time of this writing, Racket 5.3.2 is inpre-release.
If you are using a version of Racket > 5.3.1, then follow the instructions on thePLaneT2 page.- For those who are using Racket <= 5.3.1, you can download the following PLT package:
<ragg.plt> [md5sum: ab79038b40e510a5cf13363825c4aef4]
Last updated: Wednesday, January 16th, 2013
Once downloaded, either use DrRacket’s package installation features (Install PLT File... under DrRacket’s File menu), or use the command line:
2.2 Example: a small DSL for ASCII diagramsđź”—â„ą
To motivate ragg’s design, let’s look at the following toy problem: we’d like to define a language for drawing simple ASCII diagrams. We’d like to be able write something like this:
3 9 X; 6 3 b 3 X 3 b; 3 9 X;
whose interpretation should generate the following picture:
XXXXXXXXX XXXXXXXXX XXXXXXXXX XXX XXX XXX XXX XXX XXX XXXXXXXXX XXXXXXXXX XXXXXXXXX
2.3 Syntax and semanticsđź”—â„ą
We’re being very fast-and-loose with what we mean by the program above, so let’s try to nail down some meanings. Each line of the program has a semicolon at the end, and describes the output of several rows of the line drawing. Let’s look at two of the lines in the example:
- 3 9 X;: “Repeat the following 3 times: print "X" nine times, followed by a newline.”
- 6 3 b 3 X 3 b;: “Repeat the following 6 times: print " " three times, followed by "X" three times, followed by " " three times, followed by a newline.”
Then each line consists of a repeat number, followed by pairs of (number, character) chunks. We will assume here that the intent of the lowercased character b is to represent the printing of a 1-character whitespace " ", and for other uppercase letters to represent the printing of themselves.
Once we have a better idea of the pieces of each line, we have a better chance to capture that meaning in a formal notation. Once we have each instruction in a structured format, we should be able to interpret it with a straighforward case analysis.
Here is a first pass at expressing the structure of these line-drawing programs.
2.4 Parsing the concrete syntaxđź”—â„ą
"simple-line-drawing.rkt"
#lang ragg drawing: rows* rows: repeat chunk+ ";" repeat: INTEGER chunk: INTEGER STRING
Syntax and terminology describes ragg’s syntax in more detail.
We write a ragg program as an extended BNF grammar, where patterns can be:
- the names of other rules (e.g. chunk)
- literal and symbolic token names (e.g. ";", INTEGER)
- quantified patterns (e.g. + to represent one-or-more repetitions)
The result of a ragg program is a module with a parse function that can parse tokens and produce a syntax object as a result.
Let’s exercise this function:
> (require ragg/support) > (require "simple-line-drawing.rkt") > (define stx (parse (list (token 'INTEGER 6) (token 'INTEGER 2) (token 'STRING " ") (token 'INTEGER 3) (token 'STRING "X") ";"))) > (syntax->datum stx) '(drawing (rows (repeat 6) (chunk 2 " ") (chunk 3 "X") ";"))
Tokens can either be: plain strings, symbols, or instances produced by thetoken function. (Plus a few more special cases, one in which we’ll describe in a moment.)
Preferably, we want to attach each token with auxiliary source location information. The more source location we can provide, the better, as the syntax objects produced by parse will incorporate them.
Let’s write a helper function, a lexer, to help us construct tokens more easily. The Racket standard library comes with a module calledparser-tools/lex which can help us write a position-sensitive tokenizer:
There are a few things to note from this lexer example:
- The parse function can consume either sequences of tokens, or a function that produces tokens. Both of these are considered sources of tokens.
- As a special case for acceptable tokens, a token can also be an instance of the position-token structure of parser-tools/lex, in which case the token will try to derive its position from that of the position-token.
- The parse function will stop reading from a token source if any token is void.
- The parse function will skip over any token with the#:skip? attribute. Elements such as whitespace and comments will often have #:skip? set to #t.
2.5 From parsing to interpretationđź”—â„ą
We now have a parser for programs written in this simple-line-drawing language. Our parser will give us back syntax objects:
> (syntax->datum parsed-program) '(drawing (rows (repeat 3) (chunk 9 "X") ";") (rows (repeat 6) (chunk 3 " ") (chunk 3 "X") (chunk 3 " ") ";") (rows (repeat 3) (chunk 9 "X") ";"))
Moreover, we know that these syntax objects have a regular, predictable structure. Their structure follows the grammar, so we know we’ll be looking at values of the form:
(drawing (rows (repeat ) (chunk ) ... ";") ...)
where drawing, rows, repeat, and chunkshould be treated literally, and everything else will be numbers or strings.
Still, these syntax object values are just inert structures. How do we interpret them, and make them print? We did claim at the beginning of this section that these syntax objects should be fairly easy to case-analyze and interpret, so let’s do it.
This is a very quick-and-dirty treatment of syntax-parse. See the syntax/parse documentation for a gentler guide to its features.
Racket provides a special form called syntax-parse in thesyntax/parse library. syntax-parse lets us do a structural case-analysis on syntax objects: we provide it a set of patterns to parse and actions to perform when those patterns match.
As a simple example, we can write a function that looks at a syntax object and says #t if it’s the literal yes, and #f otherwise:
> (require syntax/parse) ; yes-syntax-object?: syntax-object -> boolean ; Returns true if the syntax-object is yes. > (yes-syntax-object? #'yes) #t > (yes-syntax-object? #'nooooooooooo) #f
Here, we use ~literal to let syntax-parse know thatyes should show up literally in the syntax object. The patterns can also have some structure to them, such as:
which matches on syntax objects that begin, literally, with drawing, followed by any number of rows (which are syntax objects too).
Now that we know a little bit more about syntax-parse, we can use it to do a case analysis on the syntax objects that our parse function gives us. We start by defining a function on syntax objects of the form (drawing rows-stx ...).
When we encounter a syntax object with (drawing rows-stx ...), then interpret-rows each rows-stx.
Let’s define interpret-rows now:
For a rows, we extract out the repeat-number out of the syntax object and use it as the range of the for loop. The inner loop walks across each chunk-stx and calls interpret-chunk on it.
Finally, we need to write a definition for interpret-chunk. We want it to extract out the chunk-size and chunk-string portions, and print to standard output:
With these definitions in hand, now we can pass it syntax objects that we construct directly by hand:
> (interpret-chunk #'(chunk 3 "X")) XXX > (interpret-drawing #'(drawing (rows (repeat 5) (chunk 3 "X") ";")))
or we can pass it the result generated by our parser:
> (interpret-drawing parsed-program) XXXXXXXXXXXXXXXXXXXXXXXXXXX XXX XXX XXX XXX XXX XXX XXXXXXXXXXXXXXXXXXXXXXXXXXX
And now we’ve got an interpreter!
2.6 From interpretation to compilationđź”—â„ą
(Just as a warning: the following material is slightly more advanced, but shows how writing a compiler for the line-drawing language reuses the ideas for the interpreter.)
Wouldn’t it be nice to be able to write something like:
3 9 X; 6 3 b 3 X 3 b; 3 9 X;
and have Racket automatically compile this down to something like this?
Well, of course it won’t work: we don’t have a #lang line.
Let’s add one.
"letter-i.rkt"
#lang ragg/examples/simple-line-drawing 3 9 X; 6 3 b 3 X 3 b; 3 9 X;
Now "letter-i.rkt" is a program.
How does this work? From the previous sections, we’ve seen how to take the contents of a file and interpret it. What we want to do now is teach Racket how to compile programs labeled with this #lang line. We’ll do two things:
- Tell Racket to use the ragg-generated parser and lexer we defined earlier whenever it sees a program written with#lang ragg/examples/simple-line-drawing.
- Define transformation rules for drawing, rows, andchunk to rewrite these into standard Racket forms.
The second part, the writing of the transformation rules, will look very similar to the definitions we wrote for the interpreter, but the transformation will happen at compile-time. (We could just resort to simply calling into the interpreter we just wrote up, but this section is meant to show that compilation is also viable.)
We do the first part by defining a module reader: amodule reader tells Racket how to parse and compile a file. Whenever Racket sees a#lang , it looks for a corresponding module reader in"/lang/reader".
Here’s the definition for"ragg/examples/simple-line-drawing/lang/reader.rkt":
"ragg/examples/simple-line-drawing/lang/reader.rkt"
#lang s-exp syntax/module-reader ragg/examples/simple-line-drawing/semantics #:read my-read #:read-syntax my-read-syntax #:whole-body-readers? #t (require ragg/examples/simple-line-drawing/lexer ragg/examples/simple-line-drawing/grammar) (define (my-read in) (syntax->datum (my-read-syntax #f in))) (define (my-read-syntax src ip) (list (parse src (tokenize ip))))
We use a helper module syntax/module-reader, which provides utilities for creating a module reader. It uses the lexer andragg-generated parser we defined earlier (saved intolexer.rktandgrammar.rktmodules), and also tells Racket that it should compile the forms in the syntax object using a module called "semantics.rkt".
For a systematic treatment on capturing the semantics of a language, see Programming Languages: Application and Interpretation.
Let’s look into "semantics.rkt" and see what’s involved in compilation:
The semantics hold definitions for compile-drawing,compile-rows, and compile-chunk, similar to what we had for interpretation with interpret-drawing, interpret-rows, andinterpret-chunk. However, compilation is not the same as interpretation: each definition does not immediately execute the act of drawing, but rather returns a syntax object whose evaluation will do the actual work.
There are a few things to note:
- ragg’s native data structure is the syntax object because the majority of Racket’s language-processing infrastructure knows how to read and write this structured value.
By the way, we can just as easily rewrite the semantics so thatcompile-rows does explicitly call compile-chunk. Often, though, it’s easier to write the transformation functions in this piecemeal way and depend on the Racket macro expansion system to do the rewriting as it encounters each of the forms.
Unlike in interpretation, compile-rows doesn’t compile each chunk by directly calling compile-chunk. Rather, it depends on the Racket macro expander to call each compile-XXX function as it encounters a drawing, rows, or chunk in the parsed value. The three statements at the bottom of "semantics.rkt" inform the macro expansion system to do this:
Altogether, ragg’s intent is to be a parser generator generator for Racket that’s easy and fun to use. It’s meant to fit naturally with the other tools in the Racket language toolchain. Hopefully, it will reduce the friction in making new languages with alternative concrete syntaxes.
The rest of this document describes the ragg language and the parsers it generates.
3 The languageđź”—â„ą
3.1 Syntax and terminologyđź”—â„ą
A program in the ragg language consists of the language line#lang ragg, followed by a collection of rules andline comments.
A rule is a sequence consisting of: a rule identifier, a colon":", and a pattern.
A rule identifier is an identifier that is not in upper case.
A token identifier is an identifier that is in upper case.
An identifier is a character sequence of letters, numbers, and characters in "-.!$%&/<=>?^_~@". It must not contain* or +, as those characters are used to denote quantification.
A pattern is one of the following:
- an implicit sequence of patterns separated by whitespace
- a terminal: either a literal string or a token identifier
- a rule identifier
- a choice pattern: a sequence of patterns delimited with | characters.
- a quantifed pattern: a pattern followed by either * (“zero or more”) or + (“one or more”)
- an optional pattern: a pattern surrounded by [ and ]
- an explicit sequence: a pattern surrounded by ( and )
A line comment begins with either # or ; and continues till the end of the line.
For example, in the following program:
#lang ragg ;; A parser for a silly language sentence: verb optional-adjective object verb: greeting optional-adjective: ["happy" | "frumpy"] greeting: "hello" | "hola" "aloha" object: "world" | WORLD
the elements sentence, verb, greeting, and object are rule identifiers. The first rule, sentence: verb optional-adjective object, is a rule whose right side is an implicit pattern sequence of three sub-patterns. The uppercased WORLD is a token identifier. The fourth rule in the program associates greeting with a choice pattern.
More examples:
- ABNF for binary strings that contain an equal number of zeros and ones.
| #lang ragg |
| ----------------------------------------------------------------- |
| equal: [zero one | one zero] ;; equal number of "0"s and "1"s. |
| zero: "0" equal | equal "0" ;; has an extra "0" in it. |
| one: "1" equal | equal "1" ;; has an extra "1" in it. |
| ----------------------------------------- | ------ |
| json: number | string | |
| | array | object |
| number: NUMBER | |
| string: STRING | |
| array: "[" [json ("," json)*] "]" | |
| object: "{" [kvpair ("," kvpair)*] "}" | |
| kvpair: ID ":" json | |
The ragg github source repositoryincludesseveral more examples.
3.2 Syntax errorsđź”—â„ą
Besides the basic syntax errors that can occur with a malformed grammar, there are a few other classes of situations that #lang ragg will consider as syntax errors.
ragg will raise a syntax error if the grammar:
- doesn’t have any rules.
- has a rule with the same left hand side as any other rule.
- refers to rules that have not been defined. e.g. the following program:
should raise an error because bar has not been defined, even thoughfoo refers to it in an optional pattern. - uses the token name EOF; the end-of-file token type is reserved for internal use by ragg.
- contains a rule that has no finite derivation. e.g. the following program:
#lang ragg infinite-a: "a" infinite-a should raise an error because no finite sequence of tokens will satisfyinfinite-a.
Otherwise, ragg should be fairly tolerant and permit even ambiguous grammars.
3.3 Semanticsđź”—â„ą
A program written in #lang ragg produces a module that provides a few bindings. The most important of these is parse:
Parses the sequence of tokens according to the rules in the grammar, using the first rule as the start production. The parse must completely consumetoken-source.
The token source can either be a sequence, or a 0-arity function that produces tokens.
A token in ragg can be any of the following values:
- a string
- a symbol
- an instance produced by token
- an instance produced by the token constructors of parser-tools/lex
- an instance of parser-tools/lex’s position-token whoseposition-token-token is a token.
A token whose type is either void or 'EOF terminates the source.
If parse succeeds, it will return a structured syntax object. The structure of the syntax object follows the overall structure of the rules in the BNF. For each rule r and its associated pattern p,parse generates a syntax object #'(r p-value) wherep-value’s structure follows a case analysis on p:
- For implicit and explicit sequences of patterns p1,p2, ..., the corresponding values, spliced into the structure.
- For terminals, the value associated to the token.
- For rule identifiers: the associated parse value for the rule.
- For choice patterns: the associated parse value for one of the matching subpatterns.
- For quantifed patterns and optional patterns: the corresponding values, spliced into the structure.
Consequently, it’s only the presence of rule identifiers in a rule’s pattern that informs the parser to introduces nested structure into the syntax object.
If the grammar has ambiguity, ragg will choose and return a parse, though it does not guarantee which one it chooses.
If the parse cannot be performed successfully, or if a token in thetoken-source uses a type that isn’t mentioned in the grammar, thenparse raises an instance of exn:fail:parsing.
It’s often convenient to extract a parser for other non-terminal rules in the grammar, and not just for the first rule. A ragg-generated module also provides a form called make-rule-parser to extract a parser for the other non-terminals:
Constructs a parser for the name of one of the non-terminals in the grammar.
For example, given the ragg program"simple-arithmetic-grammar.rkt":
"simple-arithmetic-grammar.rkt"
#lang ragg expr : term ('+' term)* term : factor ('*' factor)* factor : INT
the following interaction shows how to extract a parser for terms.
> (require "simple-arithmetic-grammar.rkt") > (define term-parse (make-rule-parser term)) > (syntax->datum (parse tokens)) '(expr (term (factor 3) "*" (factor 4))) > (syntax->datum (term-parse tokens)) '(term (factor 3) "*" (factor 4)) > (define another-token-sequence (list (token 'INT 1) "+" (token 'INT 2) "*" (token 'INT 3))) > (syntax->datum (parse another-token-sequence)) '(expr (term (factor 1)) "+" (term (factor 2) "*" (factor 3))) ; Note that term-parse will break on another-token-sequence ; as it does not know what to do with the "+" > (term-parse another-token-sequence) Encountered parsing error near token '+ ("+") while parsing #f [line=#f, column=#f, offset=#f]
Finally, the module provides a set of all the used token types in the grammar in all-token-types:
A set of all the token types used in a grammar.
For example:
> (require "simple-arithmetic-grammar.rkt") > all-token-types (set '* '+ 'INT)
4 Support APIđź”—â„ą
The ragg/support module provides functions to interact withragg programs. The most useful is the token function, which produces tokens to be parsed.
Creates instances of token-structs.
The syntax objects produced by a parse will inject the value val in place of the token name in the grammar.
If #:skip? is true, then the parser will skip over it during a parse.
The token structure type.
Rather than directly using the token-struct constructor, please use the helper function token to construct instances.
The exception raised when parsing fails.
exn:fail:parsing implements Racket’s prop:exn:srclocproperty, so if this exception reaches DrRacket’s default error handler, DrRacket should highlight the offending locations in the source.
5 Caveats and things to dođź”—â„ą
Here are a few caveats and future aims for ragg.
- ragg doesn’t currently have a good story about operator precedence. Future versions of ragg will support the specification of operator precedence to deal with grammar ambiguity, probably by extending the BNF grammar rules in #lang ragg with keyword arguments.
- I currently depend on the lexer framework provided byparser-tools/lex, which has a steeper learning curve than I’d like. A future version of ragg will probably try to provide a nicer set of tools for defining lexers.
- The underlying parsing engine (an Earley-style parser) has not been fully optimized, so it may exhibit degenerate parse times. A future version ofragg will guarantee O(n3) time bounds so that at the very least, parses will be polynomial-time.
- ragg doesn’t yet have a good story on dealing with parser error recovery. If a parse fails, it tries to provide the source location, but does little else.
- ragg is slightly misnamed: what it really builds is a concrete syntax tree rather than an abstract syntax tree. A future version of raggwill probably support annotations on patterns so that they can be omitted or transformed in the parser output.
6 Miscellaneous and thanksđź”—â„ą
Thanks to Matthew Flatt for pointing me to cfg-parser from thecfg-parser library. Joe Politz gave me good advice and feedback. Also, he suggested the name “ragg”. Other alternatives I’d been considering were “autogrammar” or “chompy”. Thankfully, he is a better Namer than me. Daniel Patterson provided feedback that led tomake-rule-parser. Robby Findler and Guillaume Marceau provided steadfast suggestions to look into other parsing frameworks likeSDF andSableCC. Special thanks to Shriram Krishnamurthi, who convinced me that other people might find this package useful.