a Racket AST Generator Generator (original) (raw)

8.15

Danny Yoo <dyoo@hashcollision.org>

1 Informal quickstartđź”—â„ą

Salutations! Let’s consider the following scenario: say that we’re given the following string:

(... and pretend that we don’t already know about the built-inread function.)

How do we go about turning this kind of string into a structured value? That is, how would we parse it?

We need to first consider the shape of the things we’d like to parse. The string above looks like a deeply nested list of words. How might we describe this formally? A convenient notation to describe the shape of these things isBackus-Naur Form (BNF). So let’s try to notate the structure of nested word lists in BNF.

nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN

What we intend by this notation is this: nested-word-list is either an atomic WORD, or a parenthesized list of any number ofnested-word-lists. We use the character * to represent zero or more repetitions of the previous thing, and we treat the uppercasedLEFT-PAREN, RIGHT-PAREN, and WORD as placeholders for atomic tokens.

See Installation for instructions on installingragg.

Here are a few examples of tokens:

> (require ragg/support)
> (token 'LEFT-PAREN)
(token-struct 'LEFT-PAREN #f #f #f #f #f #f)
> (token 'WORD "crunchy" #:span 7)
(token-struct 'WORD "crunchy" #f #f #f 7 #f)
> (token 'RIGHT-PAREN)
(token-struct 'RIGHT-PAREN #f #f #f #f #f #f)

Have we made progress? At this point, we only have a BNF description in hand, but we’re still missing a parser, something to take that description and use it to make structures out of a sequence of tokens.

It’s clear that we don’t yet have a program because there’s no #langline. We should add one. Put #lang ragg at the top of the BNF description, and save it as a file called "nested-word-list.rkt".

"nested-word-list.rkt"

#lang ragg
nested-word-list: WORD
| LEFT-PAREN nested-word-list* RIGHT-PAREN

Now it is a proper program. But what does it do?

> (require "nested-word-list.rkt")
> parse
#procedure:parse

It gives us a parse function. Let’s investigate what parsedoes for us. What happens if we pass it a sequence of tokens?

> (define a-parsed-value (parse (list (token 'LEFT-PAREN "(") (token 'WORD "some") (token 'LEFT-PAREN "[") (token 'WORD "pig") (token 'RIGHT-PAREN "]") (token 'RIGHT-PAREN ")"))))
> a-parsed-value
#<syntax (nested-word-list "(" (nested-word-list "some") (nested-word-list "[" (nested-word-list "pig") "]") ")")>

Wait... that looks suspiciously like a syntax object!

> (syntax->datum a-parsed-value)
'(nested-word-list "(" (nested-word-list "some") (nested-word-list "[" (nested-word-list "pig") "]") ")")

That’s (some [pig]), essentially.

What happens if we pass it a more substantial source of tokens?

; tokenize: string -> (sequenceof token-struct?)
; Generate tokens from a string:
; For example:
> (define token-source (tokenize "(welcome (to (((ragg)) ())))"))
> (define v (parse token-source))
> (syntax->datum v)
'(nested-word-list "(" (nested-word-list "welcome") (nested-word-list "(" (nested-word-list "to") (nested-word-list "(" (nested-word-list "(" (nested-word-list "(" (nested-word-list "ragg") ")") ")") (nested-word-list "(" ")") ")") ")") ")")

Welcome to ragg.

2 Introductionđź”—â„ą

ragg is a parsing framework for Racket with the design goal to be easy to use. It includes the following features:

2.1 Installationđź”—â„ą

2.2 Example: a small DSL for ASCII diagramsđź”—â„ą

To motivate ragg’s design, let’s look at the following toy problem: we’d like to define a language for drawing simple ASCII diagrams. We’d like to be able write something like this:

3 9 X;
6 3 b 3 X 3 b;
3 9 X;

whose interpretation should generate the following picture:

XXXXXXXXX
XXXXXXXXX
XXXXXXXXX
XXX
XXX
XXX
XXX
XXX
XXX
XXXXXXXXX
XXXXXXXXX
XXXXXXXXX

2.3 Syntax and semanticsđź”—â„ą

We’re being very fast-and-loose with what we mean by the program above, so let’s try to nail down some meanings. Each line of the program has a semicolon at the end, and describes the output of several rows of the line drawing. Let’s look at two of the lines in the example:

Then each line consists of a repeat number, followed by pairs of (number, character) chunks. We will assume here that the intent of the lowercased character b is to represent the printing of a 1-character whitespace " ", and for other uppercase letters to represent the printing of themselves.

Once we have a better idea of the pieces of each line, we have a better chance to capture that meaning in a formal notation. Once we have each instruction in a structured format, we should be able to interpret it with a straighforward case analysis.

Here is a first pass at expressing the structure of these line-drawing programs.

2.4 Parsing the concrete syntaxđź”—â„ą

"simple-line-drawing.rkt"

#lang ragg
drawing: rows*
rows: repeat chunk+ ";"
repeat: INTEGER
chunk: INTEGER STRING

Syntax and terminology describes ragg’s syntax in more detail.

We write a ragg program as an extended BNF grammar, where patterns can be:

The result of a ragg program is a module with a parse function that can parse tokens and produce a syntax object as a result.

Let’s exercise this function:

> (require ragg/support)
> (require "simple-line-drawing.rkt")
> (define stx (parse (list (token 'INTEGER 6) (token 'INTEGER 2) (token 'STRING " ") (token 'INTEGER 3) (token 'STRING "X") ";")))
> (syntax->datum stx)
'(drawing (rows (repeat 6) (chunk 2 " ") (chunk 3 "X") ";"))

Tokens can either be: plain strings, symbols, or instances produced by thetoken function. (Plus a few more special cases, one in which we’ll describe in a moment.)

Preferably, we want to attach each token with auxiliary source location information. The more source location we can provide, the better, as the syntax objects produced by parse will incorporate them.

Let’s write a helper function, a lexer, to help us construct tokens more easily. The Racket standard library comes with a module calledparser-tools/lex which can help us write a position-sensitive tokenizer:

There are a few things to note from this lexer example:

2.5 From parsing to interpretationđź”—â„ą

We now have a parser for programs written in this simple-line-drawing language. Our parser will give us back syntax objects:

> (syntax->datum parsed-program)
'(drawing (rows (repeat 3) (chunk 9 "X") ";") (rows (repeat 6) (chunk 3 " ") (chunk 3 "X") (chunk 3 " ") ";") (rows (repeat 3) (chunk 9 "X") ";"))

Moreover, we know that these syntax objects have a regular, predictable structure. Their structure follows the grammar, so we know we’ll be looking at values of the form:

(drawing (rows (repeat )
(chunk ) ... ";")
...)

where drawing, rows, repeat, and chunkshould be treated literally, and everything else will be numbers or strings.

Still, these syntax object values are just inert structures. How do we interpret them, and make them print? We did claim at the beginning of this section that these syntax objects should be fairly easy to case-analyze and interpret, so let’s do it.

This is a very quick-and-dirty treatment of syntax-parse. See the syntax/parse documentation for a gentler guide to its features.

Racket provides a special form called syntax-parse in thesyntax/parse library. syntax-parse lets us do a structural case-analysis on syntax objects: we provide it a set of patterns to parse and actions to perform when those patterns match.

As a simple example, we can write a function that looks at a syntax object and says #t if it’s the literal yes, and #f otherwise:

> (require syntax/parse)
; yes-syntax-object?: syntax-object -> boolean
; Returns true if the syntax-object is yes.
> (yes-syntax-object? #'yes)
#t
> (yes-syntax-object? #'nooooooooooo)
#f

Here, we use ~literal to let syntax-parse know thatyes should show up literally in the syntax object. The patterns can also have some structure to them, such as:

which matches on syntax objects that begin, literally, with drawing, followed by any number of rows (which are syntax objects too).

Now that we know a little bit more about syntax-parse, we can use it to do a case analysis on the syntax objects that our parse function gives us. We start by defining a function on syntax objects of the form (drawing rows-stx ...).

When we encounter a syntax object with (drawing rows-stx ...), then interpret-rows each rows-stx.

Let’s define interpret-rows now:

For a rows, we extract out the repeat-number out of the syntax object and use it as the range of the for loop. The inner loop walks across each chunk-stx and calls interpret-chunk on it.

Finally, we need to write a definition for interpret-chunk. We want it to extract out the chunk-size and chunk-string portions, and print to standard output:

With these definitions in hand, now we can pass it syntax objects that we construct directly by hand:

> (interpret-chunk #'(chunk 3 "X"))
XXX
> (interpret-drawing #'(drawing (rows (repeat 5) (chunk 3 "X") ";")))

or we can pass it the result generated by our parser:

> (interpret-drawing parsed-program)
XXXXXXXXXXXXXXXXXXXXXXXXXXX XXX XXX XXX XXX XXX XXX XXXXXXXXXXXXXXXXXXXXXXXXXXX

And now we’ve got an interpreter!

2.6 From interpretation to compilationđź”—â„ą

(Just as a warning: the following material is slightly more advanced, but shows how writing a compiler for the line-drawing language reuses the ideas for the interpreter.)

Wouldn’t it be nice to be able to write something like:

3 9 X;
6 3 b 3 X 3 b;
3 9 X;

and have Racket automatically compile this down to something like this?

Well, of course it won’t work: we don’t have a #lang line.

Let’s add one.

"letter-i.rkt"

#lang ragg/examples/simple-line-drawing
3 9 X;
6 3 b 3 X 3 b;
3 9 X;

Now "letter-i.rkt" is a program.

How does this work? From the previous sections, we’ve seen how to take the contents of a file and interpret it. What we want to do now is teach Racket how to compile programs labeled with this #lang line. We’ll do two things:

The second part, the writing of the transformation rules, will look very similar to the definitions we wrote for the interpreter, but the transformation will happen at compile-time. (We could just resort to simply calling into the interpreter we just wrote up, but this section is meant to show that compilation is also viable.)

We do the first part by defining a module reader: amodule reader tells Racket how to parse and compile a file. Whenever Racket sees a#lang , it looks for a corresponding module reader in"/lang/reader".

Here’s the definition for"ragg/examples/simple-line-drawing/lang/reader.rkt":

"ragg/examples/simple-line-drawing/lang/reader.rkt"

#lang s-exp syntax/module-reader
ragg/examples/simple-line-drawing/semantics
#:read my-read
#:read-syntax my-read-syntax
#:whole-body-readers? #t
(require ragg/examples/simple-line-drawing/lexer
ragg/examples/simple-line-drawing/grammar)
(define (my-read in)
(syntax->datum (my-read-syntax #f in)))
(define (my-read-syntax src ip)
(list (parse src (tokenize ip))))

We use a helper module syntax/module-reader, which provides utilities for creating a module reader. It uses the lexer andragg-generated parser we defined earlier (saved intolexer.rktandgrammar.rktmodules), and also tells Racket that it should compile the forms in the syntax object using a module called "semantics.rkt".

For a systematic treatment on capturing the semantics of a language, see Programming Languages: Application and Interpretation.

Let’s look into "semantics.rkt" and see what’s involved in compilation:

The semantics hold definitions for compile-drawing,compile-rows, and compile-chunk, similar to what we had for interpretation with interpret-drawing, interpret-rows, andinterpret-chunk. However, compilation is not the same as interpretation: each definition does not immediately execute the act of drawing, but rather returns a syntax object whose evaluation will do the actual work.

There are a few things to note:

Altogether, ragg’s intent is to be a parser generator generator for Racket that’s easy and fun to use. It’s meant to fit naturally with the other tools in the Racket language toolchain. Hopefully, it will reduce the friction in making new languages with alternative concrete syntaxes.

The rest of this document describes the ragg language and the parsers it generates.

3 The languageđź”—â„ą

3.1 Syntax and terminologyđź”—â„ą

A program in the ragg language consists of the language line#lang ragg, followed by a collection of rules andline comments.

A rule is a sequence consisting of: a rule identifier, a colon":", and a pattern.

A rule identifier is an identifier that is not in upper case.

A token identifier is an identifier that is in upper case.

An identifier is a character sequence of letters, numbers, and characters in "-.!$%&/<=>?^_~@". It must not contain* or +, as those characters are used to denote quantification.

A pattern is one of the following:

A line comment begins with either # or ; and continues till the end of the line.

For example, in the following program:

#lang ragg
;; A parser for a silly language
sentence: verb optional-adjective object
verb: greeting
optional-adjective: ["happy" | "frumpy"]
greeting: "hello" | "hola" "aloha"
object: "world" | WORLD

the elements sentence, verb, greeting, and object are rule identifiers. The first rule, sentence: verb optional-adjective object, is a rule whose right side is an implicit pattern sequence of three sub-patterns. The uppercased WORLD is a token identifier. The fourth rule in the program associates greeting with a choice pattern.

More examples:

| ----------------------------------------------------------------- |
| equal: [zero one | one zero] ;; equal number of "0"s and "1"s. |
| zero: "0" equal | equal "0" ;; has an extra "0" in it. |
| one: "1" equal | equal "1" ;; has an extra "1" in it. |

| ----------------------------------------- | ------ |
| json: number | string | |
| | array | object |
| number: NUMBER | |
| string: STRING | |
| array: "[" [json ("," json)*] "]" | |
| object: "{" [kvpair ("," kvpair)*] "}" | |
| kvpair: ID ":" json | |

The ragg github source repositoryincludesseveral more examples.

3.2 Syntax errorsđź”—â„ą

Besides the basic syntax errors that can occur with a malformed grammar, there are a few other classes of situations that #lang ragg will consider as syntax errors.

ragg will raise a syntax error if the grammar:

Otherwise, ragg should be fairly tolerant and permit even ambiguous grammars.

3.3 Semanticsđź”—â„ą

A program written in #lang ragg produces a module that provides a few bindings. The most important of these is parse:

Parses the sequence of tokens according to the rules in the grammar, using the first rule as the start production. The parse must completely consumetoken-source.

The token source can either be a sequence, or a 0-arity function that produces tokens.

A token in ragg can be any of the following values:

A token whose type is either void or 'EOF terminates the source.

If parse succeeds, it will return a structured syntax object. The structure of the syntax object follows the overall structure of the rules in the BNF. For each rule r and its associated pattern p,parse generates a syntax object #'(r p-value) wherep-value’s structure follows a case analysis on p:

Consequently, it’s only the presence of rule identifiers in a rule’s pattern that informs the parser to introduces nested structure into the syntax object.

If the grammar has ambiguity, ragg will choose and return a parse, though it does not guarantee which one it chooses.

If the parse cannot be performed successfully, or if a token in thetoken-source uses a type that isn’t mentioned in the grammar, thenparse raises an instance of exn:fail:parsing.

It’s often convenient to extract a parser for other non-terminal rules in the grammar, and not just for the first rule. A ragg-generated module also provides a form called make-rule-parser to extract a parser for the other non-terminals:

Constructs a parser for the name of one of the non-terminals in the grammar.

For example, given the ragg program"simple-arithmetic-grammar.rkt":

"simple-arithmetic-grammar.rkt"

#lang ragg
expr : term ('+' term)*
term : factor ('*' factor)*
factor : INT

the following interaction shows how to extract a parser for terms.

> (require "simple-arithmetic-grammar.rkt")
> (define term-parse (make-rule-parser term))
> (syntax->datum (parse tokens))
'(expr (term (factor 3) "*" (factor 4)))
> (syntax->datum (term-parse tokens))
'(term (factor 3) "*" (factor 4))
> (define another-token-sequence (list (token 'INT 1) "+" (token 'INT 2) "*" (token 'INT 3)))
> (syntax->datum (parse another-token-sequence))
'(expr (term (factor 1)) "+" (term (factor 2) "*" (factor 3)))
; Note that term-parse will break on another-token-sequence
; as it does not know what to do with the "+"
> (term-parse another-token-sequence)
Encountered parsing error near token '+ ("+") while parsing
#f [line=#f, column=#f, offset=#f]

Finally, the module provides a set of all the used token types in the grammar in all-token-types:

A set of all the token types used in a grammar.

For example:

> (require "simple-arithmetic-grammar.rkt")
> all-token-types
(set '* '+ 'INT)

4 Support APIđź”—â„ą

The ragg/support module provides functions to interact withragg programs. The most useful is the token function, which produces tokens to be parsed.

Creates instances of token-structs.

The syntax objects produced by a parse will inject the value val in place of the token name in the grammar.

If #:skip? is true, then the parser will skip over it during a parse.

The token structure type.

Rather than directly using the token-struct constructor, please use the helper function token to construct instances.

The exception raised when parsing fails.

exn:fail:parsing implements Racket’s prop:exn:srclocproperty, so if this exception reaches DrRacket’s default error handler, DrRacket should highlight the offending locations in the source.

5 Caveats and things to dođź”—â„ą

Here are a few caveats and future aims for ragg.

6 Miscellaneous and thanksđź”—â„ą

Thanks to Matthew Flatt for pointing me to cfg-parser from thecfg-parser library. Joe Politz gave me good advice and feedback. Also, he suggested the name “ragg”. Other alternatives I’d been considering were “autogrammar” or “chompy”. Thankfully, he is a better Namer than me. Daniel Patterson provided feedback that led tomake-rule-parser. Robby Findler and Guillaume Marceau provided steadfast suggestions to look into other parsing frameworks likeSDF andSableCC. Special thanks to Shriram Krishnamurthi, who convinced me that other people might find this package useful.