18 May 2013 20:56
Re: [ANN] ParseKit
Juancarlo Añez <apalala <at> gmail.com>
2013-05-18 18:56:30 GMT
2013-05-18 18:56:30 GMT
Ter,
The above is not possible, so one winds up with many rules like this:
Cheers,
Just a note on why I reluctantly had to opt out of ANTLR and seek PEG for my ongoing projects:
- Software AG Natural allows relevant language words to be used as identifiers. In fact, words are keywords or not depending on context.
Hi! What specific weakness in v3 causes this to be invalid? Is it just a predicate?
In ANTLR, one you define a word in the tokenizer, it always behaves as a keyword.
Alternatives like using:
: WORD {WORD.text == "MY_KEYWORD"}?
Make the language too ambiguous in ANTLR terms.
In PEG, you can do exactly the above, with a semantic action, because there's no ambiguity by PEG's definition.
- Natural also makes many of the relevant language words optional. The predictive ANTLR v3 grammar was inundated with lookaheads to resolve the ambiguity caused by that.
I see. Wasn't the PEG during the same thing? Note that if you set k=1 option then ANTLR v3 more or less have to backtrack constantly, leading it to behave exactly like a PEG.
It's not just the backtracking which distinguishes PEG prom predicting parser. It is the ability to have the parser behave differently depending on context.
PEG easily allows for embedded languages within one grammar (the case for Natural FORMAT statement, and COBOL's PICTURE statement). In a predictive parser, one language would interfere with the parsing of the other, unless you do heavy magic with the lexer to separate the sub-languages, and build a different grammar for each sub-language.
- Natural has exactly one statement that is not led by a keyword: assignment. And most statements lack terminators, so ANTLR would consume the left hand side of an assignment unless a negative lookahead was added to most rules.
- COBOL programs embed subprograms in CICS and SQL, and those must be parsed to have a complete translation.
this sounds like you need a scannerless parser. v4 would handle this no problem, if you're willing to treat each character as a token.
I actually need a scanner that takes care of white-space, comments, case-insensitivity, and word boundaries. The grammars would be very convoluted without one, and we're talking about grammars which are several thousand lines, even when they don't contain semantic actions.
It was time-consuming and difficult to deal with the above with a mandated tokenizer, and with v3's allergy to apparent ambiguity (I couldn't try v4 because the Python runtime wasn't ready).I know. They are both stupid programming languages... but they are real, and there are millions of lines of source code written in them and still alive. PEG deals with both easily. You may be right in that cases like these are rare, though.I'm not clear on how a PEG has an advantage here; could you be more specific?
This is a simple rule from the Natural grammar in ANTLR:
lvalue:indexed((line_ref) => line_ref|(label_ref) => label_ref|() -> indexed);
I think you can imagine what the need to explicitely remove the ambiguity reported by ANTLR leads to in more complex statements. This is how the original rule looked like:
lvalue:indexed (line_ref|label_ref)?;
I don't know the ANTLR's internals, but I speculate that what happens is that all the optional stuff in Natural throws off the calculation of FOLLOW sets, and that forces you to provide the predictive sequences explicit.
This is what the rules in Grako's grammar syntax:
lvalue=indexed [line_ref|label_ref];
Another problem was that:
- Statements Natural have optional termination tokens,
- Many statements act on lists of expressions.
- There an assignment statement of the form: lvalue ':=' expression.
In PEG, you solve the problem caused by the above by introducing a negative lookeahead at the end of lists. In ANTLR that doesn't work because the parser is already committed to continue the list by the time it sees the negative lookahead and/or because negatives do consume input.
atom_list
:
(~(assignment) atom)+
;
The above is not possible, so one winds up with many rules like this:
end_transaction_params:(lvalue ':=')=> ()|(atom)=> atom end_transaction_params|();
In Grako, the above would be:
end_transaction_params
=
{atom !(lvalue ':=')}
;
Cheers,
--
Juancarlo Añez
Juancarlo Añez
_______________________________________________ PEG mailing list PEG <at> lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg
RSS Feed