Steve Bennett | 17 Nov 2007 06:00
Picon

Welcome!

That was quite amusing, I read the "Welcome to your new list" message before the wikitech-l message. Anyway, a list just for parser discussion is good.

Here's a bit of ANTLR grammar I wrote to handle basic article structure: paragraph blocks and "special blocks", where two consecutive blocks of the same type need an extra linefeed. Since I haven't written any Lex or Yacc before, I'm still wrestling a bit with what are probably fairly basic problems. In this case, I found the requirement of an extra linefeed quite challenging to implement without ambiguity problems.

As it is, this does work, but spews out a huge number of warnings and even an apparently non-fatal "fatal error". I presume some of these problems can be avoided through semantic and syntactic predicates, if not backtracking, memoization (no, that's not a typo). Any ANTLR experts here?

Steve

--
grammar paras;

article : pseries? (sseries (EOF| pseries))*;
pseries : para (N+ para)* N*;
sseries : specialblock (N+ specialblock)* N*;


specialblock
: (spaceblock|listblock)+;

spaceblock
: spaceline+;

spaceline
: SPECIALCHAR char* N;

listblock
: (listitem)+;
listitem: (bulletitem | numberitem | indentitem | defitem);

bulletitem
: BULLETCHAR (listitem | (nonlistchar char*)? N);

numberitem
: NUMBERCHAR (listitem | (nonlistchar char*)? N);

indentitem
: INDENTCHAR (listitem | (nonlistchar char*)? N);

defitem
: DEFCHAR (nonindentchar)* (definition | INDENTCHAR? N );
definition
: ':' char+ N;

BULLETCHAR: '*';
NUMBERCHAR: '#';
INDENTCHAR: ':';
DEFCHAR : ';';

para : (nonspecialchar char* N)+;

listchar: BULLETCHAR | NUMBERCHAR | INDENTCHAR | DEFCHAR;

SPECIALCHAR
: ' ';
nonlistchar
: SPECIALCHAR | nonspecialchar;
char : nonlistchar | listchar;
nonindentchar
: nonlistchar | BULLETCHAR | NUMBERCHAR | DEFCHAR;
N : '\r'? '\n' ;

nowiki : NOWIKI;
NOWIKI : '<nowiki>'( options {greedy=false;} : . )*'</nowiki>';

nonspecialchar
: NONSPECIALCHAR | nowiki;

NONSPECIALCHAR
: ('A'..'Z'| 'a'..'z' | '0'..'9' | '\'' | '"' | '(' | ')')+;
--

PS you might notice the above grammar implements two "improvements" to the ;definition:term notation:

1. The ;definition has to be the last item in the list.  Constructs like ##;## are worthless.
2. A trailing : is treated literally.

Steve Bennett | 17 Nov 2007 06:29
Picon

Image grammar

Here's another one, at the bottom of

http://www.mediawiki.org/wiki/User:Stevage
(note, mw_img_thumbnail means "the magic word 'img_thumbnail', however that is defined".)

The problem I have here is the options for the image: you'd like the word "thumbnail" to be a token, but then if you get a case like:

 [[image:finger.jpg|Note the impressive thumbnails.]] 

you get one token for "thumbnail" rather than "t" and "h" etc.

Solutions I can think of so far:
1) Explicitly make the match for text to be 'a'..'z' | 'A'..'Z' | MW_img_thumbnail | ...
2) Make tokens for individual letters (Aa, Bb...) then make the parser recognise a pattern like Tt + Hh + Uu + Mm...
3) Make a token which is '|thumbnail', then use some trick to distinguish '|thumbnailblah' from '|thumbnail|'.
4) Like 1), but use a localised lexer so that those words are only tokens in this specific context.
5) Just match text, then use special markup at the parser level to look into the text that was matched.

I've tried 1) and 2) and they both work. I'll probably try 5) next because 3) is just ugly.

Anyone have any comments or suggestions?

I really think writing the grammar in ANTLR is our best bet at this point. Advantages:
1) We're talking about actual, parseable grammar in an actual syntax, rather than the half-arsed EBNF/BNF we've done so far.
2) We can use ANTLRWorks to play with the grammar, visualise it etc.
3) One of the goals is to allow 3rd party parsers to generate code in a variety of languages. ANTLR already has 5 code targets and more (perhaps including PHP) are on the way.

Downsides:
1) ANTLR can't yet generate a parser in PHP. However, there may exist Java->PHP or C->PHP translators or something.

Steve

David Gerard | 17 Nov 2007 12:54
Picon
Gravatar

Re: Welcome!

On 17/11/2007, Steve Bennett <stevagewp <at> gmail.com> wrote:

> Here's a bit of ANTLR grammar I wrote to handle basic article structure:

I see from [[:en:ANTLR]] that ANTLR compiles to C++, Java, Python and
C# - not PHP. How feasible will it be to get PHP from this?

- d.

David Gerard | 17 Nov 2007 13:06
Picon
Gravatar

Fwd: New parser in the works - please help

Just sent this to wikipedia-l and foundation-l - I figured they would
be good places to ask.

- d.

---------- Forwarded message ----------
From: David Gerard <dgerard <at> gmail.com>
Date: 17 Nov 2007 12:05
Subject: New parser in the works - please help
To: wikipedia-l <at> lists.wikimedia.org

http://lists.wikimedia.org/mailman/listinfo/wikitext-l

Wikitext-l was formed from a recent discussion on wikitech-l about the
need to sanely reimplement the current parser, which is a Horrible
Mess and pretty much impossible to reimplement in another language.

The MediaWiki parser definition is literally "whatever the PHP parser
does." Some of what it does is arguably very wrong, pathological,
magical or just a Stupid Parser Trick. So the list has been formed to
come up with a grammar that defines all the useful parts of the
present parser, and so can be used by anyone to implement a MediaWiki
wikitext parser. This will be useful for other software, for WYSIWYG
editing extensions ... all manner of things.

Some of what some people would think of as a "stupid parser trick" is
in fact important - e.g. L'''uomo'' which renders as L<i>uomo</i>
(necessary for French and Italian).

So: we need to know what MediaWiki quirks are supporting important
constructs in languages other than English (which is the language the
list is in, and is the native language of most of the participants),
and particularly in non-European languages.

This list is unlikely to implement new features, e.g. (an example
brought up by GerardM) the double-apostrophe in Neapolitan. But we
really need to know about present important features that wouldn't be
obvious to an English-speaker going through the present parser code.

- d.

Steve Bennett | 17 Nov 2007 14:25
Picon

Re: Welcome!

On 11/17/07, David Gerard <dgerard <at> gmail.com> wrote:
On 17/11/2007, Steve Bennett <stevagewp <at> gmail.com> wrote:

> Here's a bit of ANTLR grammar I wrote to handle basic article structure:


I see from [[:en:ANTLR]] that ANTLR compiles to C++, Java, Python and
C# - not PHP. How feasible will it be to get PHP from this?

Yeah, I addressed that in my second email. There are four roads I can think of:

1) Help implement the PHP target.
2) Compile to one of the other targets, then translate (possibly using an automated tool)
3) Translate the original grammar to Lex or whatever.
4) Compile to one of the other targets (eg, C) then link to that from the PHP code. Apparently that makes it harder for 3rd parties to run, but I can't really speak to why.

Option 3 is not so bad. We need a formal grammar. A formal grammar written in ANTLR is an incredibly useful thing, and if it's slightly inconvenient for our immediate parser-writing purposes, so be it. ANTLR is so expressive that whatever *other* mechanism we could be writing it in (eg, EBNF with English descriptions for semantically disambiguating ambiguous syntax), ANTLR syntax would *still* be a better way of expressing it, even if we don't use a parser directly generated by ANTLR.

Steve
Jay R. Ashworth | 17 Nov 2007 15:23

Re: Image grammar

On Sat, Nov 17, 2007 at 04:29:16PM +1100, Steve Bennett wrote:
>    I really think writing the grammar in ANTLR is our best bet at this point.
>    Advantages:
>    1) We're talking about actual, parseable grammar in an actual syntax, rather
>    than the half-arsed EBNF/BNF we've done so far.
>    2) We can use ANTLRWorks to play with the grammar, visualise it etc.
>    3) One of the goals is to allow 3rd party parsers to
>    generate code in a variety of languages. ANTLR already has 5 code targets an
>    d more (perhaps including PHP) are on the way.
>    Downsides:
>    1) ANTLR can't yet generate a parser in PHP. However, there may exist
>    Java->PHP or C->PHP translators or something.

But: if it can produce a parser in *any* langauge, then we have
something to run the test suite against, with a little harness
rewiring, which makes it easier to sell both the retargeting work and
the switch-MW-to-this work.

PS: could you find your mailer's HTML knob and turn it off?

PPS: thanks for running with this; I think it's going to turn out well.

Cheers,
-- jra
--

-- 
Jay R. Ashworth                   Baylink                      jra <at> baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com                     '87 e24
St Petersburg FL USA      http://photo.imageinc.us             +1 727 647 1274

Steve Bennett | 17 Nov 2007 17:00
Picon

Re: Image grammar

On 11/17/07, Steve Bennett <stevagewp <at> gmail.com> wrote:
The problem I have here is the options for the image: you'd like the word "thumbnail" to be a token, but then if you get a case like:

 [[image:finger.jpg|Note the impressive thumbnails.]] 

you get one token for "thumbnail" rather than "t" and "h" etc.

Solutions I can think of so far:
1) Explicitly make the match for text to be 'a'..'z' | 'A'..'Z' | MW_img_thumbnail | ...
2) Make tokens for individual letters (Aa, Bb...) then make the parser recognise a pattern like Tt + Hh + Uu + Mm...
3) Make a token which is '|thumbnail', then use some trick to distinguish '|thumbnailblah' from '|thumbnail|'.
4) Like 1), but use a localised lexer so that those words are only tokens in this specific context.
5) Just match text, then use special markup at the parser level to look into the text that was matched.


Omg it's so much easier than that.
6) Use a syntactic predicate:

option : (magicword '|') => magicword
| caption;

magicword
: 'magicword';

Translation: If the next two tokens are some magicword and the pipe, then match the magic word. Otherwise, treat it as a caption.

That was easy. Woot. I thought things were a lot more complicated because ANTLRWorks sneakily doesn't support predicates in its Interpreter mode, only in its Debugger mode. I say "sneakily" because the error it reports looks like an error in your code...

>But: if it can produce a parser in *any* langauge, then we have
>something to run the test suite against, with a little harness
>rewiring, which makes it easier to sell both the retargeting work and
>the switch-MW-to-this work.

Oh, that's a good benefit too: we can regression test the new *grammar* against the old *parser*. Obviously it won't all work, and will require hacks to get all those magic words and stuff into the grammar. Perhaps someone could look into creating some tests that don't require the preprocessor (no templates, no magic variables) and that focus on specific language features...or maybe they already exist, I haven't looked.

Steve

Steve Bennett | 17 Nov 2007 17:01
Picon

Re: Image grammar

> PS: could you find your mailer's HTML knob and turn it off?

Ok. Is this better? What was the problem before? On my end, I notice
some really strange word-wrapping issues with Gmail on Opera.

Steve

David Gerard | 17 Nov 2007 22:05
Picon
Gravatar

Re: Welcome!

On 17/11/2007, Steve Bennett <stevagewp <at> gmail.com> wrote:

> 1) Help implement the PHP target.
> 2) Compile to one of the other targets, then translate
> (possibly using an automated tool)
>  3) Translate the original grammar to Lex or whatever.

Mmm. Whichever of these is used, you'd need a note in parser.php that
"DO NOT PATCH DIRECTLY, THIS IS GENERATED CODE" and that parser
changes should be made to the ANTLR or lex grammar.

> 4) Compile to one of the other targets (eg, C) then link to that from the
> PHP code. Apparently that makes it harder for 3rd parties to run, but I
> can't really speak to why.

As I understand it, the issue is hosted copies of MediaWiki where the
user can only use PHP, not compile anything or run arbitrary binaries
or touch httpd.conf.

I expect where a user *does* have compiler access, a C implementation
would be the parser implementation of choice.

- d.

Jay R. Ashworth | 18 Nov 2007 01:08

Re: Image grammar

On Sun, Nov 18, 2007 at 03:01:08AM +1100, Steve Bennett wrote:
> > PS: could you find your mailer's HTML knob and turn it off?
> 
> Ok. Is this better? What was the problem before? On my end, I notice
> some really strange word-wrapping issues with Gmail on Opera.

Yeah; that was text.

Just that HTML is strongly deprecated on mailing lists, both because
it's less efficient than ASCII (sometimes strongly), and also hard to
quote.

Cheers,
-- jra
--

-- 
Jay R. Ashworth                   Baylink                      jra <at> baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates     http://baylink.pitas.com                     '87 e24
St Petersburg FL USA      http://photo.imageinc.us             +1 727 647 1274


Gmane