Roman Redziejowski | 23 Jul 14:09 2014
Picon

Breaking cut operator

I believe Ger Hobbelt discovered a problem.

The "cut point", in the sense of a point after which there is no use of backtracking, is the property of a specific selection operator.

Once B in Ger's examples sucessfully consumed A, it indeed does not need to backtrack (as this must fail).
But G in the second example has to backtrack after failing B, even if B passed its cut point. So B has no right to erase the history because it may be used by the containing expression.
So the cut points should somehow be considered in the entire hierarchy of parsing expressions.

I recently wrote a note about cut points, to be presented at the CSP workshop this year. It was inspired by a section in the paper by Maidl et al. about producing error messages; apparently failures in 'no return' areas defined by cut points are those relevant ones.
I missed there the fact that cut points are local.

(My note can be found at http://www.romanredz.se/papers/CSP2014.pdf. Remember: it is preliminary and not peer-reviewed yet.)

Roman
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Ger Hobbelt | 10 Jul 16:14 2014

Breaking the Mizushima(2010) cut operator

L.S.,

I'm looking for feedback = to be corrected about mistakes, any, more or less obvious, because I think I ran into an issue and after having doublechecked several times what I consider to be the cause several times, I can't find any goof(s) in it anymore.

The observation below is written as a fact, but I know sometimes I overlook the obvious elephant in the room. :'-(  And I want to make sure I got this one nailed.

Thanks!

...

# Intro #

Take grammar1:

G = A + B / B               (1)
B = A + C / C               (2)
A = a
C = c 

Matching Inputs:

c                            (1)
a+c                          (2)
a+a+c                        (3)

Application of M's cut operator rules will *not* add a cut (marked as '#') in rule G(1) as the ExtFirst(e) set for both the A+B and B choices will include the 'a' prefix (full common 'first' prefix: 'a+').
It does add an automatic cut operator to rule B(2) as there the union of the ExtFirst of both choices is empty:

B = A # + C / C               (2)

And this automatic cut insertion is correct as it does not alter the behaviour of the grammar in any way.

If we would be using, as suggested in the next paper, BITES (Redziejowski, 2011) rather than the FIRST basic building block used by Mizushima et al(2010), things don't change: The inferred ExtBites(e) would deliver 'a+' rather than just 'a' but that's about it for this example grammar.


# A Breaking Grammar #

Take grammar2, which is a minimal alteration of grammar1 which accepts the same input set ('language') as grammar1:

G = B / A + B               (1)
B = A + C / C               (2)
A = a
C = c 

Matching Inputs:

c                            (1)
a+c                          (2)
a+a+c                        (3)

Note that rule G has 'flipped' both choices but still parses the same input.
Mazushima's algorithm does not mind, so, once again, only rule B gets a cut:

G = B / A + B               (1)
B = A # + C / C             (2)
A = a
C = c 

as the ExtFirst() sets turn up the same results: B-choice1 = {'a'}, B-choice2 = {'c'}. Also upgrading to Redziejowski's BITES does not change things.


## The Problem ##

However, the cut in B will now kill the memoization cache for the initial 'a' parsed by B in the inputs 'a+c' and 'a+a+c'. When we now inspect the latter, it will fail the first B choice, backtracking to G, where it will be matched by the second choice in G, but this time around, without any help of memoization for the initial 'a' it would otherwise have received in grammar1. 

In other words: for the input 'a+a+c', the second evaluation of the leading matching 'A = a' at input column 0 (now in rule G(1), in grammar1 its second evaluation is in the first choice of rule B(2)) does not receive memoization support anymore. 
This means that using the cut operator, using the Mazushima/Redziejowski ExtFirst/ExtBites approach, for clearing out and reducing your memoization cache footprint, breaks=removes the memoization support unintended, hence the grammar does not exhibit guaranteed linear time behaviour any more.


## Side Notes ##

This is the minimal grammar example that I can devise to showcase the issue. 
Having rule G as

G = B / A + B               (1)

looks suspicious to the human eye, when you're aware of 'prefix hiding', even while the notation as such does *not* imply 'prefix hiding' as we're looking at non-terminals here, but that suspicion can take a long time to develop when this construct occurs in a production grammar along the lines of grammar3 below:

G = B / D                   (1)
B = A # + C / C             (2)
D = A + B                   (3)
A = a
C = c 

where rule D is 'obfuscating' the 'A +' common prefix in both choices of rule G(1). Of course, in a production grammar, the B(2) and D(3) definitions will very probably be further apart if this issue would not indirectly be caught by 'probable trouble' in the visual patterns of the grammar definition itself.
('probable trouble' being a human brain thing, not something a machine would use. Neither is 'probable trouble' a sure-fire thing; it's just that when you look at grammars long and often enough, you may get the feeling 'something is odd with this one'. And you *may* be right. Then again, you may be not.)


# Conclusion #

As the Mazushima cut operator only looks into the future (ExtFirst) but does not consider history/context of the rule currently under scrutiny (i.e. the rule G's behaviour when we determine rule B(2) can have a cut inserted) I expect a solution/fix to this issue would be derived from propagating the invoking rule's First/Bites info into the rule-under-scrutiny in order to add the outer context into the condition set for the automatic insertion of the cut operator. (To be addressed further in another email.)


# References #

Mizushima et al, 2010: Packrat Parsers can handle practical grammars in mostly constant space (Mizushima, Maeda, Yamaguchi, 2010)

Redziejowski, 2011: BITES instead of FIRST for parsing expression grammar (Redziejowski, 2011) -- 'upgrading' Mizushima's algorithm (ExtFirst -> ExtBites) is already suggested in its ch.1:Introduction. 

...

I have searched the Nets to the best of my ability but wasn't able to find earlier mention of this issue; if I missed it, I love to hear about it and a reference to it. :-)  (Because if I missed this one, I'm bound to have missed others in that vicinity too.)


Thank you for bearing with me.

 
Met vriendelijke groeten / Best regards,

Ger Hobbelt

--------------------------------------------------
web:    http://www.hobbelt.com/
        http://www.hebbut.net/
mail:   ger <at> hobbelt.com
mobile: +31-6-11 120 978
--------------------------------------------------
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Mackenzie High | 10 Jul 10:14 2014
Picon

First Official Release of Snowflake: A Graphical PEG Parser

Greetings everyone,

I would like to officially announce the release of the Java based Snowflake project, which allows users to test PEG grammars on sample input as they write them. In addition, the project is a parser-generator. 



I envision that this program would especially be useful for students who are learning about parsing, since it allows one to immediately see the parse-tree that results from a given grammar and sample input. 

I hope someone finds this program useful. I would be happy to hear any of your questions, comments, or bug reports, regarding this project. 


Sincerely, 
Mackenzie High


_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Juancarlo Añez | 5 Jul 16:19 2014
Picon

Grako 3.1.0

Grako 3.1.0 has been released.

Grako is a PEG/Packrat parser generator for Python. It takes grammars in a variation of EBNF as input, and parses in-memory, or generates a structured parser in Python.

New features in this release:
  • Support for direct and indirect left recursion in grammars.
  • Arguments in grammar rules.
  • Rule inclusion.
  • Rule inheritance
  • Include files.
  • JSON-compatible abstract syntax trees (AST).
  • AST walkers.
Cheers,

--
Juancarlo Añez
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Mathias Doenitz | 26 May 22:04 2014

Re: error handling in PEGs

> hi. That makes sense. It’s basically identifying a no viable alternative situation if the char occurs
at the start of a rule.
> 
> I think one could get better error messages and recovery using tokens instead of scannerless but one gives
up composability/modularity. For token errors, it’s easy to just skip over stuff in the lexer. The
parser need not even see those.
> 
>  when using tokens, it seems one could get a better error message like “missing ID, found INT” instead
of “invalid input ‘3’ expected [a-zA-Z]+” or whatever.

I’m not sure that tokenizing will always result in better error messages.
In fact, there are situation in which tokenizing appears to actually *decrease* error message quality
(for example when there are typos in language keywords).

The approach I outlined for PEGs gives you a list of rule “traces” for an error position whereby each
trace is the stack of rules that failed.
There are several conceivable ways to turn such a list of traces into a proper error message.

Going up the trace to the first parent rule which failed at its first sub-rule (rather than somewhere in the
middle) for reporting an expected alternative is only one option.
You could also allow the user to somehow explicitly mark the rules that are considered to be terminals which
would traditionally be matched by the lexer. (This marking could also be minimally invasive like
ALL-CAPS naming). And if you allow for special constructs in the grammar that give the user even more
control over what and how errors are reported (like in the paper that Sérgio pointed to) the
possibilities multiply.

However, the negative syntactic predicate available in PEGs can pose somewhat of a problem for fully
automatic error reporting. This starts with the determination of the error position (do you count
progress made in a negative syntactic predicate towards the overall progress the parser has made or not?)
and also affects the display of “expected” input (e.g.: “expected anything but FOO or BAR”).

> Single token insertion and deletion is great, although I wonder how well it works with characters
compared to tokens.

Actually it works well and can be superior to recovery on a token stream.
Again, consider a typo in a keyword (like `publc` rather than `public`). Since recovery for PEGs runs on the
same level that the user usually makes mistakes on (i.e. the character level rather than the token level)
single insertion/deletion/replacement is quite effective in real-world applications according to
our experiences.

(Forgot to mention before that parboiled doesn’t only try single-char deletion and insertion but also
replacement, which is essentially a deletion followed by an insertion.)

> Yep, if ANTLR cannot repair inside a rule, it consumes tokens until something in  what could follow
appears. Do you consider the actual call stack or the complete follow set? If you have the call stack, you
can limit the set of things to consume to just what is possible given the input as opposed to giving all input.

parboiled 1.x determines the follower-set by working up through a rule trace, so the follower-set is not
determined through a static analysis but does take the actual input into account.
However, it’s not yet as smart as it could be. Ideally we’d attempt resynchronisation on *all* rule
traces for the current error and choose the best one (i.e. the one that skips the least number of input chars).
IIRC parboiled currently always resynchronises on the first trace only.

Cheers,
Mathias

---
mathias <at> parboiled.org
http://www.parboiled.org

On 26 May 2014, at 18:30, Terence Parr <parrt <at> cs.usfca.edu> wrote:

> 
> On May 26, 2014, at 8:48 AM, Mathias Doenitz <mathias <at> parboiled.org> wrote:
> 
>> Terence,
>> 
>> for error reporting parboiled applies this logic:
>> 
>> 1. Determine error position (first input pos that cannot be parsed), if this is end-of-input we have no
error and are done.
>> 
>> 2. Restart the parser from the very beginning and “watch" the parser try to match the error position.
Record all the rules (rule stacks) that are applied against the error position. From this list of rules we
can construct proper error messages like:
>> 
>>    Invalid input 'x', expected 'f', Digit, hex or UpperAlpha (line 1, column 4):
>>    abcx
>>       ^
>> 
>>    4 rules mismatched at error location:
>>      targetRule / | / "fgh" / 'f'
>>      targetRule / | / Digit
>>      targetRule / | / hex
>>      targetRule / | / UpperAlpha
> 
> hi. That makes sense. It’s basically identifying a no viable alternative situation if the char occurs
at the start of a rule.
> 
> I think one could get better error messages and recovery using tokens instead of scannerless but one gives
up composability/modularity. For token errors, it’s easy to just skip over stuff in the lexer. The
parser need not even see those.
> 
>  when using tokens, it seems one could get a better error message like “missing ID, found INT” instead
of “invalid input ‘3’ expected [a-zA-Z]+” or whatever.
> 
>> Restarting the parser for gathering error info is acceptable in almost all use cases since errors are
infrequent and added overhead here is not a problem.
> 
> I have no problem restarting parsers upon syntax errors; it’s what I recommend people do in ANTLR 4 for
maximum speed. ;)   That way, we can use a faster mechanism for the common case and fail over to a more
expensive one to determine if the syntax error is correct or a result of a weakness in the first strategy.
BTW, I will be presenting the ALL(*) at OOPSLA this year; here’s more on the strategy:
> 
> http://www.antlr.org/papers/allstar-techreport.pdf
> 
> It basically properly builds a parser for any CFG you give it that does not have indirect left recursion.
(ANTLR rewrites direct left-recur under the hood.) No backtracking and it does all analysis on-the-fly,
memoizing lookahead-to-predicted-alternative mappings with DFA.
> 
> I’ve actually been using the supplied parser interpreter these days instead of generating grammars
for instant gratification :) Generating code and compiling is tedious during prototyping.
> 
>> Error recovery:
>> Here we have the following logic:
>> 
>> 1. Try to “remove” the error position from the input and see whether the parser can continue.
>> 2. If not, try to insert a single character that the parser “expects” (see reporting) at the error
position. If this allows us to continue parsing, great.
>> 3. If no single-char fix “repairs” the input we resynchronise the parser in a similar way that ANTLR
does it (IIRC), i.e. we skip ahead in the input until the parser can continue.
> 
> Single token insertion and deletion is great, although I wonder how well it works with characters
compared to tokens.
> 
> Yep, if ANTLR cannot repair inside a rule, it consumes tokens until something in  what could follow
appears. Do you consider the actual call stack or the complete follow set? If you have the call stack, you
can limit the set of things to consume to just what is possible given the input as opposed to giving all input.
> 
> Thanks very much for the summary.
> 
> Ter
Mathias Doenitz | 26 May 22:04 2014

Re: error handling in PEGs

<forgot to CC the list before>
---

Terence,

for error reporting parboiled applies this logic:

1. Determine error position (first input pos that cannot be parsed), if this is end-of-input we have no
error and are done.

2. Restart the parser from the very beginning and “watch" the parser try to match the error position.
Record all the rules (rule stacks) that are applied against the error position. From this list of rules we
can construct proper error messages like:

   Invalid input 'x', expected 'f', Digit, hex or UpperAlpha (line 1, column 4):
   abcx
      ^

   4 rules mismatched at error location:
     targetRule / | / "fgh" / 'f'
     targetRule / | / Digit
     targetRule / | / hex
     targetRule / | / UpperAlpha

Restarting the parser for gathering error info is acceptable in almost all use cases since errors are
infrequent and added overhead here is not a problem.

Error recovery:
Here we have the following logic:

1. Try to “remove” the error position from the input and see whether the parser can continue.
2. If not, try to insert a single character that the parser “expects” (see reporting) at the error
position. If this allows us to continue parsing, great.
3. If no single-char fix “repairs” the input we resynchronise the parser in a similar way that ANTLR
does it (IIRC), i.e. we skip ahead in the input until the parser can continue.

Cheers,
Mathias

---
mathias <at> parboiled.org
http://www.parboiled.org

On 26 May 2014, at 17:38, Terence Parr <parrt <at> cs.usfca.edu> wrote:

> Hi Mathias, Could you summarize quickly how it recovers and reports errors? I think the basic strategy
initially was to simply report an error at the token following the deepest correctly matched token, but I
don’t remember anything about recovery.
> 
> PEGs are so beautiful and simple it makes one proud to be a computer scientist! I’ve heard reports
however of people backing off to a more traditional deterministic parser for reasons of side effecting
action execution during the parse and error recovery.  For example, the xtext guys tried to get away from
ANTLR by building  their own PEG based system but error recovery and reporting gave them lots of trouble and
so they returned to ANTLR.
> 
> Ter
> On May 26, 2014, at 8:35 AM, Mathias Doenitz <mathias <at> parboiled.org> wrote:
> 
>> Terence,
>> 
>> parboiled (http://parboiled.org, http://parboiled2.org) does what I’d consider good error reporting.
>> The 1.x version also supports error recovery in a nice fashion.
>> 
>> Cheers,
>> Mathias
>> 
>> ---
>> mathias <at> parboiled.org
>> http://www.parboiled.org
>> 
>> On 26 May 2014, at 17:30, Terence Parr <parrt <at> cs.usfca.edu> wrote:
>> 
>>> Hi. Just curious. Has anyone figured out how to get good error reporting and recovery in PEGs? I.e.,
competitive with deterministic parsers?
>>> 
>>> Terence
>>> 
>>> 
>>> _______________________________________________
>>> PEG mailing list
>>> PEG <at> lists.csail.mit.edu
>>> https://lists.csail.mit.edu/mailman/listinfo/peg
>> 
> 
Aaron Moss | 26 May 14:50 2014
Picon
Picon

Re: New PEG Parsing Algorithm

On 2014-05-26 8:30 AM, 罗勇刚(Yonggang Luo)  wrote:
> Sounds interesting algorithm, anyway, the worst case time complexity is
> not appealing enough, if the nbb can be reduced to |rules| then it's
> will be really good:)

I'm still working on experimental validation, but I believe that under 
some reasonably real-world constraints on the grammar and input the 
bounds drop to constant space/linear time, so I expect the real-world 
runtime to be linear in the input size, with polynomial worst case 
behaviour on the sort of pathological cases where the naive recursive 
algorithm takes exponential time, while at the same time having space 
usage more like the recursive algorithm then the linear space packrat uses.

If the amount of backtracking and the nesting depth of the grammar are 
both bounded by a constant I get those bounds; any LR(k) grammar should 
have the former property, and Mizushima et al. and Might et al. (cited 
in the paper) both have some experimental results suggesting a constant 
amount of backtracking for the Java, XML, JSON & Python tests they ran. 
As for constant nesting depth, I don't have the citation for it (any 
help, anyone?) but I'm pretty sure someone in software engineering has 
demonstrated that the statement nesting depth of source code (and maybe 
even machine generated data formats) tends to be bounded by a small 
constant.

_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Aaron Moss | 26 May 00:07 2014
Picon
Picon

Re: New PEG Parsing Algorithm

On 2014-05-25 7:19 PM, 罗勇刚(Yonggang Luo)  wrote:
> Do you deal with left-recursive properly?

Yes; left recursion causes the naive recursive algorithm to go into an 
infinite loop, my algorithm instead detects it and inserts an "infinite 
loop" failure expression which resolves to "not a match".

Given this ability to detect left recursion without crashing the 
program, I'd like to investigate ways to give it useful semantics at 
some point in the future, but I've got a lot of other things I'd like to 
do with this work first.

_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg
Francis Galiegue | 12 May 15:49 2014
Picon

Another PEG project: grappa

Hello,

Here is an old/new PEG parser generator; new because it doesn't even
have one month, but is already fully functional because it is, in
fact...

Old, since it is a fork and continuation of parboiled version 1, which
has stopped evolution (his author is now onto parboiled2).

It is entirely in Java, and like parboiled1 and unlike pretty much all
other parsers out there (except parboiled2), grammars are written
entirely in Java, compiled with your code, and parsers are generated
at runtime.

Project home: https://github.com/parboiled1/grappa
Original parboiled: https://github.com/sirthias/parboiled

Regards,
--

-- 
Francis Galiegue, fgaliegue <at> gmail.com, https://github.com/fge
JSON Schema in Java: http://json-schema-validator.herokuapp.com
Parsers in pure Java: https://github.com/parboiled1/grappa (redde
Caesaris: https://github.com/sirthias)
Rostislav Sacek | 30 Dec 23:59 2013
Picon

[ANN] LPegLJ v.20

I have released version 0.20 of LPegLJ
https://github.com/sacek/LPegLJ/
LPegLJ is an PEG parser based on LPeg v0.12 and all code is implemented in
LuaJIT 2.x.

New features:
- Faster then previous LPegLJ 0.12.
VM engine is about 1.5 times faster, creating patterns is about 4 times
faster.
- Loading and saving patterns
- Support direct and indirect left recursion based on Sergio Medeiros
algorithm
(http://arxiv.org/abs/1207.0443)

Feedback welcome.

All code is MIT licensed.
Rostislav Sacek
Henri Tuhola | 10 Dec 02:24 2013
Picon

Inspiration for recognition based grammars

While I was studying left recursion, I realized that it becomes complex and unpredictable very fast. Allowing left recursion in peg parser is almost like a permission for making it extremely complicated. Suddenly right recursion starts affecting what ends up actually parsed and it's neither nice that the solution is some wrap-around papered over the simple ways to parse with parsing expression grammars.

I noticed another property that isn't emphasized a lot about pegs. If you scan from right to left instead of left to right, the same grammar may match a different language. Regular expressions have same behavior though.

It all gave me a reason to question the sanity in parsing expression grammars. The idea to construct parser from recognition rules sounds intuitive though. Perhaps there's a way to construct recognition based parsing that doesn't depend on direction of evaluation as much.

Human parses the text by looking at some small area at once. Perhaps it would make sense imitating that. Here's some observations:

Even if PEG parsing doesn't force that, it seems preferable to mark some words, such as 'if', 'for' and 'else', as keywords.  "term = identifier"  seems to be less specific than  "statement = if ... else ..."  or  "statement = for ... in" .

Infix expressions alone are unambiguous, and if human doesn't know their precedence then only then they become ambiguous. For example: "4 + 5*3" is easy to understand because precedence is very much standardized at that point. The "4 + 3 << 2" is more ambiguous.

If some pattern is lacking, the pieces that formed parts of that pattern are garbage. For example superfluous '(' and ')' parentheses. Some such garbage patterns cause ambiguous situations as well.

I don't know whether I explore this idea further. It feels more promising because there's chance for better error reporting there. It could also work in noisy settings. Haven't studied whether there's something similar existing already.
_______________________________________________
PEG mailing list
PEG <at> lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Gmane