Issues With The Lojban Formal Grammar (original) (raw)
HISTORICAL INTEREST ONLY
This directory records the initial Lojban PEG parser work, but is only really of historical interest at this point. To that end,here's a tarball of all of this stuff in its complete gory detail for investigation.
As of Jan 2025, the best place for a maintained Lojban PEG grammar ishttps://github.com/lojban/ilmentufa.
It is my opinion, and that of others in the Lojban community, that the available grammars for Lojban are not sufficiently formalized. Please note that this page makes essentially no reference to the YACC grammar because of its unreadability; the YACC and BNF grammars are assumed to be equivalent.
Points we are concerned about:
- The BNF used is non-standard. In places it is very non-standard. This hinders attempts to formally analyze it.
- The morphology is not formalized at all. All of the terminal productions are hand-waved and, in practice, handled by code which thus far has been separately written by everyone to produce a program that parses Lojban.
- The elidable terminators are not formalized at all. The elidable terminators, again, have been handled by code which thus far has been separately written by everyone to produce a program that parses Lojban.
I, for one, feel that it is extremely misleading to say that Lojban is formally parseable while this state of affairs exists, so I'm trying to fix it.
Project Files
In order of relevance; older stuff is farther down.
- camxes, the Rats! based PEG parser itself, as a Java JAR file. Please do not ask me for help on running it; I'm very bad with Java.
- The by-hand modified PEG grammar for camxes, which is mostly my work. See the "Changes Made To The PEG Grammar" section for what I've done to it over the automatically generated version. The current version of the morphology to go with this, which is mostly xorxes' work, can be found atthe BPFK morphology page. Note that the morphology does not have to be in a separate file and, in fact, the two files are merged before processing, but all of the things in the morphology file are done first to make the grammar itself easier to read. There's a bit of interface code between the main grammar and the morphology; some of it is in the main grammar itself, and some is in the morphology header file.
- An old version of the by-hand modified PEG morphology
- The folder for the Rats! parser generated from the PEG above. Please do not ask me how to make it run; I am very, very bad with Java. The Howto.cook file explains how to build it; anything not referenced there is probably irrelevant. The command line I use to run it from that directory is "/usr/local/java/bin/java -Xss64m -jar lojban_peg_parser.jar", with the sentence to parse passed on standard in. I suggest piping it to a pager such as "less" or "more" as it is currently set in debug mode and hence produces a lot of output. Help making the java portions less stupid would be appreciated!
- I have recently updated the PEG to Rats! converter to have it produce a Rats! grammar that automatically generate a (very ugly) parse tree. The Perl code is even uglier than the parse tree. It doesn't deal well with anything that could be considered lexing; all such productions should either have "NORATS" at the start of the line or be after the line "; --- NORATS ---" in the grammar.
- A set of test sentences for testing changes to the PEG grammar (by comparison to the official parser and jbofihe). Contains (in order) a bunch of test sentences I used for testing pre-processor tokens, all the example sentences from the refgram, all of Alice (one paragraph per line) and all the IRC logs as of the end of March 2004. Total is 34 thousand lines. Lines marked with "-- GOOD" are those that I have carefully examined and determined to be valid Lojban; those marked with "-- BAD" are invalid Lojban. These two are generally only used when one or more parsers gets it wrong.
- The automatically generated PEG version. This version is programatically generated from the expanded ABNF version.
- The perl program that did the BNF to ABNF conversion.
- An expanded ABNF version. The only differences between this and the file above are the addition of productions for the various selma'o and a few example brivla and cmene.
- An ABNF version of the grammar file. All changes were made programmatically, and should make no actual difference.
- The perl program that did the ABNF to PEG conversion.
- The original bnf.300 grammar file.
- A discussion between me and John Cowan about the elidable terminators issue. As of 10 Feb 2004, no real conclusions were reached.
- A link to the ABNF standard, AKA RFC 2234. ABNF is very widely used in various RFC documents.
My Approach
I've decided CFGs (and hence BNF in any form, let alone yacc) are simply not the right formalism for Lojban. See "Old Approach" at the bottom of this page for an explanation as to why. I thought we were stuck with them, though, and that the only other option for clean, elegant formalism was a full-on context-sensitive grammar (shudder).
I was wrong. I recently found Parsing Expression Grammars, which seems perfect for what Lojban needs. Note that that's "((Parsing Expression) Grammars)", not "(Parsing (Expression Grammars))". I am currently working on a PEG for Lojban. The inital version, which is just an automated conversion of the BNF with some morphological information added, already parses most of Lojban!
Please note that while I am not aiming for "bug-for-bug" compatilibilty with either the current official parser _or_the grammar definition in grammar.300, I am trying to make sure that differences only occur in areas covered by the preprocessor section of grammar.300, which was very much limited by what YACC was able to handle, rather than how the Reference Grammar said the language worked or what a listener or speaker would expect.
Methodology
I've tried to do as much as I can programatically, because makes it easier to convince people that what I produce is equivalent to the original grammar. The BNF is converted to ABNF, some very simple productions are added, and the result is converted to PEG.
Unfortunately, a fair amount of by-hand work is then required. For one thing, the PEG grammar requires writing productions to lexically break up the input. For another, PEG grammars are sensitive to the order of elements, preferring earlier options, and the BNF has several places where taking the earliest option that matches is guaranteed to fail later.
Actual testing is done by converting the PEG grammar to a form suitable for use by a PEG parser generator (of which I am aware of two: Pappy, which generates Haskell parsers, and Rats!, which generates Java parsers). I've been using Rats!, due to having problems with Pappy.
Improvements
- 'si' handling is quite different in the way it interacts with "lo'u" and "zo". Current rule: 'si' is ignored in "lo'u...le'u", but a string of 'si' *afterwards* is honored one word per 'si', including 'si' and 'zo', because neither have any special power in the lo'u clause. As an interesting side effect, "lo'u mi le'u si lo'u mi le'u" works. It means "lo'u mi lo'u mi le'u".
- BU handling actually works. It doesn't in the official parser as of 29 Mar 2004 ("bu bu broda" passes, and "ky bu bu broda" fails).
- '.y.' is completely ignored; the only way it can interact with the rest of the grammar is as "zo .y." or as ".y. bu". It can actually have more than one y in a row (i.e. ".yyyyy."), because I at least often use it that way on IRC.
- Multiple BAhE in a row are allowed. It is assumed this will be used for emphasis. Note that the current official parser also accepts this, even though it should not according to grammar.300.
- "su" can be backed out of with "si", allowing a speaker to save themselves from a potentially crushing mistake.
- "!" and "?" are treated as white space. Probably other things should be added to that list (q and w?).
- Groups of "si" and everything up to a "sa" are both erased at the beginning of a string. This may or may not be justifiable according to grammer.300; no-one's really sure. This means that sentences like "si si si" and "sa" are legal, as well as sentences like "le broda sa .i mi cusku".
- SA and SI now interact in a more obvious fashion. For example, "le broda brode brodi .y. sa le si la broda brode brodi" is equivalent to "la broda brode brodi". Just using "sa" would not work because "le" and "lo" are in different selma'o.
- Interactions between ZOI, SI, and SA are much richer. The goal is to achieve something more like what a user would 'expect', given the basic definitions of those words. Details:
- The first SI after the close of a ZOI clause erases the closing delimiter, allowing one to add to the protected text. "zoi gy weeble gy si bob gy" is equivalent to "zoi gy weeble bob gy".
- Two consecutive SI after the close of a ZOI erases the non-Lojban text itself; while it would theoretcially be possible to have consecutive SI after the close of a ZOI erase individual words inside the ZOI protected text, this is a bad idea because (for example) breaking up a bird call into words makes very little sense.
So, for example, "zoi gy da da da gy si si de gy" is equivalent to "zoi gy de gy". - The interaction of these two features leads to a somewhat strange, but very minor, side effect: It is impossible to add to the protected text inside a zoi clause (i.e. using a single SI after the closing delimiter) any text that starts with "si" (unless it then goes on to be something that looks like a Lojban brivla or cmene), because it will be interpreted as two SI, causing erasure of the entire protected text.
- Three consecutive SI after the close of a ZOI erases everything but the ZOI itself, so that, for example, "zoi gy da da da gy si si si dy weeble dy" is equivalent to "zoi dy weeble dy".
- Four consecutive SI after the close of a ZOI erases the entire ZOI clause, including the ZOI.
- Similarily, after ZO+word, a single SI deletes the word, causing the next word to be caught by the ZO, but two SI delets both the word and ZO.
- Because of the SA and SI interaction enhancements, the fast way to delete and accidental ZOI is to close the delimiter and say "sa zoi si", and then continue on. For example, "broda zoi gy da da da da gy sa zoi si da" is equivalent to "broda da".
- Multiple zei are handeled in a different order; historically, "broda zei zei broda" was "(broda zei zei) broda" and "zei zei broda" was invalid. In my parser, it's "broda (zei zei broda)", and "zei zei broda" is "zei type-of broda". This was accidental at first, but it was pointed out that with the old way it was essentially impossible to say "zei type-of lujvo", which this fixes.
- None of si, sa, su, y, or zei are allowed as zoi delimiters, since delimiters are not scarce so it doesn't make sense to block the more useful interpretation.
- Multiple sa in a row delete back to further previous instances of that selma'o. For example, "le le broda cu brode sa sa le brodi" is the same as "le brodi".
- Y is ignored anywhere except in front of BU.
- As a very special case, ZOI SA clauses can accept arbitrary strings, to handle things like "zoi foo booz foo co si si WEEB! foo dysa zoi bar baz bar" (which, in case you're wondering is equivalent to "zoi bar baz bar").
- Allows things like "byfy doi mark cu broda", whereas before a boi would have been required after "byfy".
- Allows "free" in many more places.
- Allows things like ".i fi'o broda bo mi klama".
- Allows "lo broda joi lo broda" without a ku before the joi.
Limitations
- Error reporting is essentially non-existant. This may not be fixable.
By-Hand Changes Made To The PEG Grammar
This section enumerates the changes that were made to the PEG grammar starting from the automatically generated version.
If an entry looks like "* [number][letter] -- [stuff]", the number and letter are a reference to a section of the pre-processing guide in grammar.300. It means that that section is intended to implement (or help implement) that rule.
- Fixing up of things like (NAI+)? into NAI*, and removing extraneous (...).
- Change of = to
<-
- 3 -- Fixed the 'text' production to ignore everything after fa'o
- Left-factored rp-expression (this is a syntactic change only).NOTE: I'm not completely certain I did this correctly, so please take a look if you're into this sort of thing. The old form:
rp-expression <- rp-operand rp-operand operator
rp-operand <- operand / rp-expression
New form:
rp-expression <- (operand / operand rp-operand operator) rp-operand operator
rp-operand <- operand / rp-expression
- Re-ordered some selma'o productions due to how PEGs work. For example:
FA <- "fa" / "fe" / "fi" / "fo" / "fu" / "fai" / "fi'a"
won't work because the 'fa' will match first, even if the word is actually 'fai'. Same with 'fi' and "fi'a". So it was re-ordered to:
FA <- "fai" / "fa" / "fe" / "fo" / "fu" / "fi'a" / "fi"
- Added productions 'Spacing' and 'Spaces', for handling whitespace.
- 4c -- Added 'post-cmavo', which goes after every cmavo string in every selma'o. post-cmavo only accepts a string if it is not followed by a member of BU. It also requires the string to be followed by 'post-cmavo-spacing', which was also added. post-cmavo-spacing allows an optional trailing '.' and either some spaces or another cmavo. This is probably an approximation, and will likely need to be reviewed when stricter morphology is included.
- Added some basic morphology constructions, as follows. Please note that these are preliminary and know imperfect. For example, "la fo''''o" is perfectly acceptable to these preliminary rules.
- consonant and vowel
- other-letter, which is ['y,.]
- lojban-letter, whichi s any of the above
- cmene-letter, which is lojban-letter plus upper-case versions
- CMAVO (which contained just a list of all selma'o) was moved to 'known-cmavo'. CMAVO became either known-cmavo or a consonant or '.', followed by a vowel, followed by any number of single-quote vowel pairs or vowels or both, followed by cmavo-spacing.
- CMENE is now an optional ".", one or more cmene-letters, a consonant, and spacing.
- BRIVLA is now basically a tester for consonant in the first 5 characters and ending with a vowel.
- 4e -- Added (UI NAI?)+ to the end of 'spaces'. This allows any word, basically, to be followed by any number of UI or UI-NAI pairs.
- 2a, not working -- removed the ZOI production from sumti-6, as this can't be made to work without a pre-processor.
- 2b -- In the ABNF version I changed 'any-word' to be a brivla or a cmene or a cmavo. This, along with the alread extant "ZO any-word" rule, handles zo.
- 2c -- added lohu-tail to handle lo'u...[first le'u]
- 2e -- Added 'si-clause' to 'spaces', which takes any nesting "word si" pairs. Also tweaked ZO and LOhU so that they refuse to process SI in this way. This required making a copy of 'any-word' that wouldn't handle SI clauses, and a fair bit of tweaking of LOhU. Also added si-clause as on option to the beginning of text. 'si' seems to be working with lo'u. -- incomple; "zo si si mi".
- 4c -- Re-ordered sumti-6 to start with ZO, LOhU, and LU productions, in that order. This fixed BU interaction with LOhU.
- 4c -- Re-ordered fragment to put 'terms' out in front, to fix ".abu" (and probably other BU problems)
- 4c -- Re-ordered sumti-6 to put BU just before LU.
- 2b,2e -- Added a SI clause to zo, allowing things like "zo si .y. si fi" to do the thing a human would expect.
- 4e -- Added Y to all the spaces functions, so that the example above *actually* works. Created "absorb-indicators" to do this. Scattered "Y*" liberally throughout the grammar, so it's ignored basically everywhere.
- 4c -- Made "indicator" not work before BU. Allowed Y without BU at the beginning of text as a free token.
- Added '.' to the spaces functions, so it's treated just like ' '. Much easier. Removed the leading '.' from relevant cmavo. This necessitated changes to the CMAVO and CMENE productions.
- 4e -- Added a special case to allow "zo y" and "y bu" to work ("y" is ignored everywhere else).
- 4e -- added "NAI CAI?", DAhO, FUhO and FUhE to absorb-indicators.
- 4a,others? -- Added a second option to 'sentence', which contains only 'bridi-tail', so ZEI will work properly in cases where the first word could be a sumti.
- 4a -- moved ZEI productions to the front of tanru-unit-2.
- 4b -- Added 'pre-cmavo' to all selma'o except BAhE, SI and BU. Put BAhE in pre-cmavo. Aded a special case for "BAhE BU".
- 2g -- Added "su-clause" to the beginning of text, to handles starting SU clauses.
- 2g -- Added "su-clause" after NIhO and TUhE; LU and TO contain 'text' already.
- 2f -- Added "[selma'o]-sa-clause" to *every* selma'o, along with "[selmaho]-no-SA-handling", and any-word-no-SA-handling and friends..
- Reordered indicators & free in text to have "indicators free+" be first.
- Added 'text-1' and 'paragraphs?' to text-1 to match the YACC grammar (bug in the BNF).
- Added a second clause to statement-2 and statement-3 so that sentences with statement clauses would be preferred.
- Fixed a parenthesis error in sumti-6.
- Made the morphology a bit more sane.
- Reordered space-interval to make the longest option come first.
- Added an option to text without the CMENE eater, so that "bab zei bab", for example, works at the start of text, instead of having the first 'bab' eaten by 'text' itself.
- More morphology fixes.
- Fixed up reverse polish notation again; seems to actually work now.
- Re-ordered vocative to handles "coi doi".
- Re-ordered 'text' massively to have joiks not_necessarily_ get eaten. Also made the 'paragraphs' in text-1 not optional, but made text-1 optional in 'text' (which was the same thing).
- Added "!gek" to "term" to allow things like "mi pu gi [stuff] gi [stuff]" to not try to treat the second term as the start of a tensed sumti.
- Added "!(stag? BO) !(stag? KE)" to bridi-tail-1 to have it not eat giheks at every possible opportunity.
- Added "!MOI" to quantifier to allow "pamoi" and such to work.
- Fixed BRIVLA to not end in 'y' and CMENE to not stop just before a 'y'.
- Added a check to not match cmavo if they are *immediately* followed by a string of non-spaces ending in a consonant.
- Explained to the two occurences of lerfu-string that aren't followed by MOI that if they are followed by MOI they shouldn't match.
- Fixed a translation bug; at some point the optionality got dropped from the first "NUhU free*" in termset.
- Fixed a bug in text-1: instead of allowing ijek [text-1]xor paragraphs, it required both.
- Allowed ! and ? as spaces.
- Prioritized sumti-tail to not have a quantifier, so "le pa roi broda" would work.
- Added su-clause to NIhO cases that didn't have it.
- Added a special case to allow 'su' to be the first word in a text block.
- Expanded text and text-1 to more accurately match grammar.300, and to have better preferential behaviour.
- Re-ordered simple-tense-modal to prefer ((time space? / space time?) CAhA).
- Tried to clean up the morphology, once again, by moving morphology checks to before words.
- Minor, cosmetic changes to make things work better with peg2rats.pl
- Implementation of 'si' and 'sa' at the beginning of strings.
- Fixed zo so that zo + any-word is itself a possible outcome of any-word, so that "zo irk zei broda" works, for example.
- ZOI handling added using semantic predicates and parser actions (and a few changes to peg2rats.pl).
- Added EOF and not cases to "text". There might be a better way of handing the 12 (!) productions that "text" now has, but I don't know what it is. Broke out parts of "text" as sub-productions as part of this.
- Implemented multiple "sa" handling, such that each extra sa takes things back to an early instance of the following selma'o. As a side effect, there can be any number of sa at the beginning of a string.
- Cleaned up space handling a bit.
- Re-ordered "fragment" to maximally prefer prenexes.
- Various tweaking of magic word handling. BU handling is now inline with BPFK decisions as of 16 Jun 2004.
- Various re-ordering, re-naming and addition of tags to help with automated implementation of various parser features.
- Fixed a bug where "lu na jo li'u" would be read as "lu fragment( na ) [elidede li'u] joik-jek [errore]". This only affected "NA JA" inside a quoting structure that referred back to "text".
- Fixed a bug where "la cmen zei [anything]" would fail due to "la cmen" being preferred.
- Moved indicators around so they would be visible without -b.
- Fixed a bug with ba'e and indicators.
- Change (BOI free*)? in sumti-6 to BOI? free*. This allows things like "byfy doi mark cu broda", whereas before a boi would have been required after "byfy". In the case of conflict, "MAI" wins.
- Added "!BU" after BRIVLA, so the bare sentence "slaka bu" would work.
- Stopped ZEI from absorbing indicators.
- Fixed a bug in lerfu-string + MAI handling where "le me bypyfyky moi" would break into "(le mi bypyfy) (ky moi)", rather than lerfu-string being greedy (thus causing failure).
- Change (BOI free*)? to BOI? free* in the other few places it occured.
- Fixed FUhE to not absorb indicators (doesn't work too well otherwise).
- Added indictor absorbtion to bu-clause.
- In tanru-unit-2, changed "ME free* sumti (MEhU free*)? (MOI free*)?" to "ME free* (sumti / lerfu-string) (MEhU free*)? (MOI free*)?" because that seems to have been the intention, and it was broken by me scattering !MOI in various lerfu-string situations.
- Changed all things like (TERM free*)? to TERM? free*. This should allow free in many more places. It seems to have made no changes WRT the test suite, but is still a somewhat experimental change. This change took place is RCS version 1.26. Note that it has now been thoroughly tested by comparing the tree output of the two versions for everything in test-sentences.txt; no changes were found except for in sentences exploiting the new stag functionality (see below).
- Also made stag equivalent to tag, which allows things like ".i fi'o broda bo mi klama". This is rather less experimental; we understand pretty well what it does. Same change number.
- Fixed up a very subtle interaction bug between normal ZOI and ZOI in a si clause.
- Seperated morphology into a separate file, lojban_morphology.peg.
- Many changes to cause Magic Words handling to conform to the current BPFK proposal (as of 30 Nov 2004).
- Added !selbri-1 after tag in term to stop "mi bai klama" from being read as "mi bai ku klama". -- NOT SUFFICIENT; breaks "mi broda lo nu brodo da ca brode".
- Discovered unnecessary complexity in statement-2; streamlined to match the BNF after no reason for it could be found.
- Turned all (ek / joik) into joik-ek.
- Broke mex-forethought and fore-operands out of mex-2 for clearer parse trees.
Old Approach
For a while I was trying to adjust the BNF grammar (after conversion to ABNF) to do The Right Thing with respect to elidable terminators, because that's the hard part. I have since come to the conclusion that elidable terminators can merely be made optional if longest-match disambiguation is used, but that puts us in the realm of specifying the parser again, which is what I was trying to avoid.
I still think that one could probably get BNF to do The Right Thing with respect to elidable terminators, but it would be Very,VERY hard. I would be surprised if it could be done without expanding the grammar by a factor of 20 or so. No, that's not an exaggeration or a joke.
Update 20 Sep 2005: I am no longer so sure that it's possible, but I don't have any good reason; just a change in my gut feeling. I also think a 20 times size increase is probably conservative.
To give you a sense of what I mean, consider fixing 'kei'. This requires having the grammar descending from a NU clause to eat all brivla it sees until the next kei. Because BNF is inherently ambiguous, forcing this requires that every place where two brivla could occur next to each other be re-written`to only form two separate selbri when there is a kei between them, but only inside a NU clause. If this is possible in BNF/CFGs, and I'm not totally certain it is, it requires nearly doubling the size of the grammar because you have to have everything under 'subsentence' copied into a "[foo]_during_NU" form, or whatever.
When you're done with that, try another big elidable terminator, like 'ku'. This will require the same thing, but the ku additions to the grammar and the nu additions to the grammar must work nested, in either order. That's two more complete sets, not including the 'ku' or 'kei' sets. You now have a grammar on the order of four times the original size, and you've fixed only two elidable terminators.
Good luck; let me know when you're done.