Juancarlo Añez | 18 May 2013 20:56
Picon
Gravatar

Re: [ANN] ParseKit

Ter,


Just a note on why I reluctantly had to opt out of ANTLR and seek PEG for my ongoing projects:
  • Software AG Natural allows relevant language words to be used as identifiers. In fact, words are keywords or not depending on context.
Hi! What specific weakness in v3 causes this to be invalid? Is it just a predicate?

In ANTLR, one you define a word in the tokenizer, it always behaves as a keyword.

Alternatives like using:

: WORD {WORD.text == "MY_KEYWORD"}? 

Make the language too ambiguous in ANTLR terms.

In PEG, you can do exactly the above, with a semantic action, because there's no ambiguity by PEG's definition.
 
  • Natural also makes many of the relevant language words optional. The predictive ANTLR v3 grammar was inundated with lookaheads to resolve the ambiguity caused by that.
I see. Wasn't the PEG during the same thing? Note that if you set k=1 option then ANTLR v3 more or less have to backtrack constantly, leading it to behave exactly like a PEG.

It's not just the backtracking which distinguishes PEG prom predicting parser. It is the ability to have the parser behave differently depending on context. 

PEG easily allows for embedded languages within one grammar (the case for Natural FORMAT statement, and COBOL's PICTURE statement). In a predictive parser, one language would interfere with the parsing of the other, unless you do heavy magic with the lexer to separate the sub-languages, and build a different grammar for each sub-language.
 
  • Natural has exactly one statement that is not led by a keyword: assignment. And most statements lack terminators, so ANTLR would consume the left hand side of an assignment unless a negative lookahead was added to most rules.
  • COBOL programs embed subprograms in CICS and SQL, and those must be parsed to have a complete translation.

 this sounds like you need a scannerless parser. v4 would handle this no problem, if you're willing to treat each character as a token.

I actually need a scanner that takes care of white-space, comments, case-insensitivity, and word boundaries. The grammars would be very convoluted without one, and we're talking about grammars which are several thousand lines, even when they don't contain semantic actions.
 
 
It was time-consuming and difficult to deal with the above with a mandated tokenizer, and with v3's allergy to apparent ambiguity (I couldn't try v4 because the Python runtime wasn't ready).

I know. They are both stupid programming languages... but they are real, and there are millions of lines of source code written in them and still alive. PEG deals with both easily. You may be right in that cases like these are rare, though.

 I'm not clear on how a PEG has an advantage here; could you be more specific?

This is a simple rule from the Natural grammar in ANTLR:

lvalue
    :
    indexed
    (
        (line_ref) => line_ref
        |
        (label_ref) => label_ref
        |
        () -> indexed
    )
    ;

I think you can imagine what the need to explicitely remove the ambiguity reported by ANTLR leads to in more complex statements. This is how the original rule looked like:

lvalue
    :
    indexed (line_ref|label_ref)?
    ;

I don't know the ANTLR's internals, but I speculate that what happens is that all the optional stuff in Natural throws off the calculation of FOLLOW sets, and that forces you to provide the predictive sequences explicit.

This is what the rules in Grako's grammar syntax:

lvalue
    =
    indexed [line_ref|label_ref]
    ;

Another problem was that:
  1. Statements Natural have optional termination tokens,
  2. Many statements act on lists of expressions.
  3. There an assignment statement of the form: lvalue ':=' expression.
In PEG, you solve the problem caused by the above by introducing a negative lookeahead at the end of lists. In ANTLR that doesn't work because the parser is already committed to continue the list by the time it sees the negative lookahead and/or because negatives do consume input.

atom_list
    :
    (~(assignment) atom)+
    ;

The above is not possible, so one winds up with many rules like this:

end_transaction_params
    :
    (lvalue ':=')=> ()
    |
    (atom)=> atom end_transaction_params
    |
    ()
    ;

In Grako, the above would be:

end_transaction_params
    =
    {atom !(lvalue ':=')}
    ;

Cheers,

--
Juancarlo Añez
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Francisco Mota | 6 May 2013 15:30
Picon
Gravatar

Incremental Parsing for PEG

Hi everyone. Long time follower but first time poster, here.

Is there an off-the-shelf available incremental parser for PEGs?

Incremental parsing is where a small change in the input string necessitates only a small amount of additional processing. The idea is that you parse the string once, and then adding or removing characters from the string in the middle causes only a small amount of reprocessing.

That is, you have some signature like:

parse : String -> Syntax
insert : Int * String * Syntax -> Syntax
remove : Int * Int * Syntax -> Syntax

where parse(s) is used for the initial parse, insert(i,s,t) will insert the string s at position i of t, and reparse, and remove(i,k,t) will remove k characters from position i of t.

The trivial algorithm is to just reparse the whole string. The downside is that insert and remove operation has complexity linear in the size of the whole string, not just the inserted part.

The more efficient way is to only modify the parts of the parse tree that are affected. Packrat parsing can be used to this effect. You keep the cache around after parsing the string the first time. When the string is updated, you invalidate only some sections of the cache, and you update the indices of the rest of the cache, and then you reparse using the same cache. The vast majority of the string will reparse using the cache, so the cost of this operation is more dependant on the depth of the syntax tree, rather than its length. (Top-level nodes will have been invalidated.)

So my question is: has anyone implemented this algorithm?

Take care,
Francisco Mota
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Juancarlo Añez | 6 May 2013 12:51
Picon
Gravatar

Touchy-feely


On Mon, May 6, 2013 at 5:16 AM, James Pike <james <at> chilon.net> wrote:
If you mean that respectful disagreement and discussion is not
tolerated, then no, this mailing list is not touchy-feely.

However if you mean that outbursts and impolite/blunt statements are
tolerated, then I think no, these are not welcome on any mailing list
in spite of the fact that many notable open source figures display such
characteristics.

James,

Thanks. 

Yet, so far, this list has felt like a colosseum.

Maybe they are thus competitive at MIT?

Cheers,

--
Juancarlo Añez
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Juancarlo Añez | 5 May 2013 16:57
Picon
Gravatar

Re: [ANN] ParseKit

Hello Todd,

Interesting!

Why did you include a tokenizer in a PEG parser?

Cheers,


On Thu, May 2, 2013 at 1:44 PM, Todd Ditchendorf <todd.ditchendorf <at> gmail.com> wrote:
Hello PEG Mailing List.

I'd like to announce the release of an updated version of my Objective-C framework, ParseKit.

ParseKit is an Objective-C library that converts PEG-style grammars to Packrat parsers in the form of Objective-C source code suitable for use on the iOS or Mac OS X platforms. 

ParseKit is heavily inspired by ANTLR, and ParseKit grammars have a very similar syntax.

Parsekit's home page is http://parsekit.com

I've written a tutorial that describes the new PEG/Packrat features: http://itod.github.io/ParseKitMiniMathExample/

The parsers produced by ParseKit are recursive descent, deterministic, packrat (memoizing), and backtracking as I believe would be expected for a PEG/Packrat tool. ParseKit parsers also have the Semantic Predicate feature copied directly from ANTLR.

ParseKit also includes an *extremely weak* version of ANTLR's LL(*) concept: Static grammar analysis will produce simple single-token prediction branching for any decision which is LL(1), but use full backtracking speculation for any non-LL(1) decision. However, ParseKit does not implement anything nearly as sophisticated as ANTLR's full LL(*)/DFA feature.

Thanks to Terence Parr and Bryan Ford for your contributions to this field. I have really enjoyed reading your books/papers and have learned a great deal from you in this process.

Todd Ditchendorf

_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg




--
Juancarlo Añez
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Todd Ditchendorf | 2 May 2013 20:14
Picon
Gravatar

[ANN] ParseKit

Hello PEG Mailing List.

I'd like to announce the release of an updated version of my Objective-C framework, ParseKit.

ParseKit is an Objective-C library that converts PEG-style grammars to Packrat parsers in the form of Objective-C source code suitable for use on the iOS or Mac OS X platforms. 

ParseKit is heavily inspired by ANTLR, and ParseKit grammars have a very similar syntax.

Parsekit's home page is http://parsekit.com

I've written a tutorial that describes the new PEG/Packrat features: http://itod.github.io/ParseKitMiniMathExample/

The parsers produced by ParseKit are recursive descent, deterministic, packrat (memoizing), and backtracking as I believe would be expected for a PEG/Packrat tool. ParseKit parsers also have the Semantic Predicate feature copied directly from ANTLR.

ParseKit also includes an *extremely weak* version of ANTLR's LL(*) concept: Static grammar analysis will produce simple single-token prediction branching for any decision which is LL(1), but use full backtracking speculation for any non-LL(1) decision. However, ParseKit does not implement anything nearly as sophisticated as ANTLR's full LL(*)/DFA feature.

Thanks to Terence Parr and Bryan Ford for your contributions to this field. I have really enjoyed reading your books/papers and have learned a great deal from you in this process.

Todd Ditchendorf
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Roman Redz | 10 Apr 2013 21:45
Picon

From EBNF to PEG

An improved version of my paper "From EBNF toPEG", due to appear in Fundamenta Informaticae, is now available from http://www.romanredz.se/pubs.htm.
The paper tries to answer the question " when an EBNF grammar can be used as its own PEG parser?".
It is inspired by a result from Medeiros about PEGs for LL(1) grammars.
Some sufficient conditions are presented, leading eventually to the notion of "LL(1p)" grammars, where a top-down parser may choose the next action by examining the input within the reach of one parsing procedure ("1p") , rather than by looking at the next letter.
This may have some use for inserting the "cut" operator, but a mechanical procedure for doing this does not seem easy.
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Juancarlo Añez | 29 Mar 2013 20:36
Picon
Gravatar

About cut making previous memos irrelevant in Grako

After, the recent experiments, I reviewed the implementation of cut in Grako.

Grako's implementation of cut is in the spirit of Prolog, which means that a cut applies only to the nearest choice (optional, closure) in the runtime stack. That means that memos for positions before that of the last choice have a non-zero theoretical probability of being reused.

The explanation for the low impact of my optimistic memo pruning on performance should be that the grammar is for a structured, largely unambiguous, and quite probably LL(k) programming language, which makes it unlikely that previous memos are reused after any structured construct ("IF" cut) has been entered.

Cheers,

--
Juancarlo Añez
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Juancarlo Añez | 28 Mar 2013 21:25
Picon
Gravatar

Cutting memoization

Hello Kota,

I'm reading your paper more carefully now (Packrat Parsers can Handle Practical Grammars in Mostly Constant Space).

If I understand correctly, if a cut is seen at position P in the input, then all memoization for positions prior to P can be discarded?

Thanks in advance!

Cheers,

--
Juancarlo Añez
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Juancarlo Añez | 27 Mar 2013 01:08
Picon
Gravatar

Introduction, and Grako, and cut over PEG

Hello,

I just found about this list!

I'm a software engineer with a good part of my career spent on language theory and programming language implementation.

I recently wrote the tool Grako[1], basically because I was hired to deal with Software AG Natural, and COBOL with embedded CICS and SQL, the LLxxx tools I knew made it too hard (I tried), and the PEG parsers around seemed not to be up to thousands of lines of unencumbered grammars.

One of the biggest questions I had while building Grako was about how to implement a "cut" feature, much like the one Prolog-style parsers have. It is something  I have not seen in other PEG parsers, but that I've tried to implement to my best because I think it is of value.  The original question is at SO [2]. I'd be very interested in your opinions.

Cheers!

[1] https://pypi.python.org/pypi/grako
[2] http://goo.gl/AnNbp

--
Juancarlo Añez

_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Mikhail Gusarov | 15 Dec 2012 23:14
Favicon
Gravatar

PEG C parser generator for non-char input symbols

Good day.

I am looking for a PEG parser generator for C language that uses input
characters different from 'char': the grammar I am implementing is
specified in UTF-16 16-bit words, so matching using regular 8-bit
chars would not be convenient.

Looks like both peg and pacc can be modified to allow it, so the
question is whether there is anything that works "out of the box".

Best regards,
Mikhail Gusarov.
Alexey Shamrin | 30 Oct 2012 16:12
Picon

PEG grammar help

Hello,

I'm trying to write a PEG grammar that would match this input:

  a.a(a).a

And would not match this input:

  a.a(a)

I want my input to always end with ".a".

I tried the following grammar in PEG.js [1] and language.js [2]:

  start = expr !.
  expr = a (dot a / left a right)* dot a
  a = 'a'
  dot = '.'
  left = '('
  right = ')'

But it doesn't work. In PEG.js my first input fails with "Expected "("
or "." but end of input found.".

It seems PEG doesn't try to backtrack at this point. Why doesn't it?
What grammar can work for me?

[1]: http://pegjs.majda.cz/online
[2]: http://languagejs.com/

Alexey

Gmane