Helder | 9 Jan 02:26 2011
Picon

Wikitext Madness of the Day: Internal Links

[Wikitext-l] Wikitext Madness of the Day: Internal Links
Andreas Jonsson andreas.jonsson at kreablo.se
Thu Aug 12 07:47:51 UTC 2010

2010-08-12 09:40, Daniel Kinzler skrev:
> Tomaž Šolc schrieb:
>    
>> Hi
>>
>>      
>>> (The actual regexp is using (.+?), but using an empty string as link
>>> text will still produce a link for some strange reason.)
>>>        
>> Empty string is actually a special case: such links will use the name of
>> the target page minus the namespace prefix for the anchor.
>>      
> but that's done on save, i guess before the actual parsing.
>
> Pipe trick syntax: [[User:Foo|]] will turn into [[User:Foo|Foo]] upon save.
>
>    

OK, that explains it.

> Yay arcane special cases.
>
>    

Yay, utomatically fixing the user input.  That'll make things a lot
easier.  :-)

/Andreas

But see also bug 5004 ("Keep the pipe trick ([[Piped link|]]) syntax in wikitext instead of converting it at pre-save transform"):
https://bugzilla.wikimedia.org/show_bug.cgi?id=5004

Helder

_______________________________________________
Wikitext-l mailing list
Wikitext-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Andreas Jonsson | 26 Jan 09:27 2011
Picon

A specialized lexer generator for wikitext

Hi,

From my experiments with an Antlr based parser, I am personally
convinced that it is possible to write a formal parser for Mediawiki
syntax that is sufficiently compatible with the original to be used as
a replacement.  But it is, however, questionable whether using Antlr
is the best option for implementing such a parser.  It has several
drawbacks: It is hard to follow the code in the lexical analyser.  The
algorithm used by Antlr doesn't match the requirements for wikitext
very well.  I would guess that my implementation contains many errors
where the lexer will fail to accept the input.  Since Antlr is not
designed for languages where everything is valid input, it is very
hard to find these errors by inspecting the code.  Also an an
external tool for generating semantic predicates is necessary so the
description of the lexical rules are divided into two separate parts.
Furthermore, for a recursive descendent algorithm is not the fastest
options when it comes to parsing.  If the token stream is sufficiently
refined a LALR parser can be used instead.

I believe that it is always possible to transform a wikitext to a
token stream that can be easily parsed.  Thus, I would propose writing
a specialized lexer generator that is suitable for
generating a lexer that does precisely this.

As a consequence, the wikitext specification would be more compact; it
would be easier to maintain for multiple target languages and the
resulting parser would be smaller and faster than an Antlr generated
one.  The grammar over the resulting token stream will be context free
and parsable with any standard parsing algorithm.

Here are my preliminary thoughts on the requirement of the lexer
generator:

1. The input to the lexer generator is a table of tokens where the
   properties of each token is described.

2. The output from the lexer generator is a sequence of tokens.  Each
   token may be associated with a string and a vector of attributes.

3. Suitable pattern matching algorithm.

   There should be four basic classes of characters:

   1. New line characters

   2. Special characters (that needs escaping)

   3. Ordinary characters

   4. Ignored characters

   If no token production rule matches, the characters should just be
   lumped together into tokens of longest sequence of characters of
   the same class.

   A bonus objective is to make these character classes runtime
   configurable, since the set of characters that need escaping differs for
   different applications.

   A token production rule consists of a simple regular pattern that
   is used for preliminary matching of a candidate production rule.
   The matched text can in a second step be matched by a PCRE or
   otherwise processed to finally decide whether to accept the
   production or not.

4. Toggling token productions.

   The context is effectively defined by the set of enabled token
   productions, which means there are a very large number of different
   contexts. Therefore we need a more dynamic context mechanism than
   what are supported by tools like flex and javacc.  We need a
   mechanism where individual token productions can be dynamically
   disabled and enabled.  The code for doing this will be inferred from
   the properties of the tokens and generated.

5. Late token determination.

   The semantic meaning of a sequence of apostrophes cannot be
   resolved until the end of a text line.  The most efficient method
   to handle this special case is to produce a set of pseudo tokens
   that can be resolved into specific tokens when reaching the end of
   line.  No lookahead would be required for this.

   In my previous implementation I solved apostrophe parsing by parser
   lookahead, which is costly.  This mechanism alone seemed to account
   for about 30% of the execution time of the parser.

   Furthermore, a token representing the start of an internal link can
   be further resolved into a "red link" or "blue link" at the end of
   the lexical analysis.

6. Speculative execution.

   Lookahead is problematic for context sensitive analysis.  Instead
   we use speculative execution.  Thus, there must be a mechanism to
   save and restore the full state of the analysis, as well as the
   state of input and output streams.

7. Simple action language

   Only a few operations needs to be supported in the actions.  To
   simplify code generation for multiple languages, it is a good idea
   to define a simple language for the actions that is independent of
   target language to avoid having to maintain the actions per target
   language.

8. Multilingual code generation

   It is desirable to support at least PHP, Javascript and C.

The lexer generator will generate code that assumes that a runtime
environment that implements the following APIs exist in the target
languages:

// Input

public interface InputStream {
    InputStreamMark mark();
    void rewind(InputStreamMark mark);
    void advance(int characters);
    String getText(InputStreamMark mark);
    int getLine();
    int getColumn();
}

// Token management

enum TokenType { ... }

public interface TokenFactory {
    <T extends Token> T newToken(TokenType tokenType);
}

public interface Token {
    void setText(String text);
    String getText();
    void setPosition(int line, int column);
}

public interface AttributeToken extends Token {
    void setAttributes(AttributeList attributeList);
}

public interface HeadingToken extends AttributeToken {
    void setLevel(int level);
}

public interface LinkToken extends Token {
    void setLinkTarget(String linkTarget);
}

public interface InternalLinkToken extends LinkToken {
    void setLinkType(LinkType linkType);
    void setTargetURL(String url);
}

public interface TagToken extends AttributeToken {
    void setTagName(String name);
}

// Output token stream

public interface TokenStream {
    void put(Token token);
    TokenStreamMark mark();
    Token takeBack();
    void discard(TokenStreamMark mark);
}

// Application interaction

public interface Application {
    boolean isValidLink(String text);
    boolean isValidURL(String text);
    boolean linkPrefixesEnabled();
    Set<String> getURLProtocols();
    Set<String> getBlockTags();
    Set<String> getInlineTags();
    void resolveLinks(Set<InternalLinkToken> linkTokens);
}

Best regards,

Andreas Jonsson
Andreas Jonsson | 26 Jan 09:30 2011
Picon

A specialized lexer generator for wikitext

Hi,

From my experiments with an Antlr based parser, I am personally
convinced that it is possible to write a formal parser for Mediawiki
syntax that is sufficiently compatible with the original to be used as
a replacement.  But it is, however, questionable whether using Antlr
is the best option for implementing such a parser.  It has several
drawbacks: It is hard to follow the code in the lexical analyser.  The
algorithm used by Antlr doesn't match the requirements for wikitext
very well.  I would guess that my implementation contains many errors
where the lexer will fail to accept the input.  Since Antlr is not
designed for languages where everything is valid input, it is very
hard to find these errors by inspecting the code.  Also an an
external tool for generating semantic predicates is necessary so the
description of the lexical rules are divided into two separate parts.
Furthermore, for a recursive descendent algorithm is not the fastest
options when it comes to parsing.  If the token stream is sufficiently
refined a LALR parser can be used instead.

I believe that it is always possible to transform a wikitext to a
token stream that can be easily parsed.  Thus, I would propose writing
a specialized lexer generator that is suitable for
generating a lexer that does precisely this.

As a consequence, the wikitext specification would be more compact; it
would be easier to maintain for multiple target languages and the
resulting parser would be smaller and faster than an Antlr generated
one.  The grammar over the resulting token stream will be context free
and parsable with any standard parsing algorithm.

Here are my preliminary thoughts on the requirement of the lexer
generator:

1. The input to the lexer generator is a table of tokens where the
   properties of each token is described.

2. The output from the lexer generator is a sequence of tokens.  Each
   token may be associated with a string and a vector of attributes.

3. Suitable pattern matching algorithm.

   There should be four basic classes of characters:

   1. New line characters

   2. Special characters (that needs escaping)

   3. Ordinary characters

   4. Ignored characters

   If no token production rule matches, the characters should just be
   lumped together into tokens of longest sequence of characters of
   the same class.

   A bonus objective is to make these character classes runtime
   configurable, since the set of characters that need escaping differs for
   different applications.

   A token production rule consists of a simple regular pattern that
   is used for preliminary matching of a candidate production rule.
   The matched text can in a second step be matched by a PCRE or
   otherwise processed to finally decide whether to accept the
   production or not.

4. Toggling token productions.

   The context is effectively defined by the set of enabled token
   productions, which means there are a very large number of different
   contexts. Therefore we need a more dynamic context mechanism than
   what are supported by tools like flex and javacc.  We need a
   mechanism where individual token productions can be dynamically
   disabled and enabled.  The code for doing this will be inferred from
   the properties of the tokens and generated.

5. Late token determination.

   The semantic meaning of a sequence of apostrophes cannot be
   resolved until the end of a text line.  The most efficient method
   to handle this special case is to produce a set of pseudo tokens
   that can be resolved into specific tokens when reaching the end of
   line.  No lookahead would be required for this.

   In my previous implementation I solved apostrophe parsing by parser
   lookahead, which is costly.  This mechanism alone seemed to account
   for about 30% of the execution time of the parser.

   Furthermore, a token representing the start of an internal link can
   be further resolved into a "red link" or "blue link" at the end of
   the lexical analysis.

6. Speculative execution.

   Lookahead is problematic for context sensitive analysis.  Instead
   we use speculative execution.  Thus, there must be a mechanism to
   save and restore the full state of the analysis, as well as the
   state of input and output streams.

7. Simple action language

   Only a few operations needs to be supported in the actions.  To
   simplify code generation for multiple languages, it is a good idea
   to define a simple language for the actions that is independent of
   target language to avoid having to maintain the actions per target
   language.

8. Multilingual code generation

   It is desirable to support at least PHP, Javascript and C.

The lexer generator will generate code that assumes that a runtime
environment that implements the following APIs exist in the target
languages:

// Input

public interface InputStream {
    InputStreamMark mark();
    void rewind(InputStreamMark mark);
    void advance(int characters);
    String getText(InputStreamMark mark);
    int getLine();
    int getColumn();
}

// Token management

enum TokenType { ... }

public interface TokenFactory {
    <T extends Token> T newToken(TokenType tokenType);
}

public interface Token {
    void setText(String text);
    String getText();
    void setPosition(int line, int column);
}

public interface AttributeToken extends Token {
    void setAttributes(AttributeList attributeList);
}

public interface HeadingToken extends AttributeToken {
    void setLevel(int level);
}

public interface LinkToken extends Token {
    void setLinkTarget(String linkTarget);
}

public interface InternalLinkToken extends LinkToken {
    void setLinkType(LinkType linkType);
    void setTargetURL(String url);
}

public interface TagToken extends AttributeToken {
    void setTagName(String name);
}

// Output token stream

public interface TokenStream {
    void put(Token token);
    TokenStreamMark mark();
    Token takeBack();
    void discard(TokenStreamMark mark);
}

// Application interaction

public interface Application {
    boolean isValidLink(String text);
    boolean isValidURL(String text);
    boolean linkPrefixesEnabled();
    Set<String> getURLProtocols();
    Set<String> getBlockTags();
    Set<String> getInlineTags();
    void resolveLinks(Set<InternalLinkToken> linkTokens);
}

Best regards,

Andreas Jonsson
Andreas Jonsson | 26 Jan 09:46 2011
Picon

A specialized lexical analysator generator for wikitext

Hi,

(Sorry for the multiple posts, but I seem to have a problem with my mail client.)

From my experiments with an Antlr based parser, I am personally
convinced that it is possible to write a formal parser for Mediawiki
syntax that is sufficiently compatible with the original to be used as
a replacement.  But it is, however, questionable whether using Antlr
is the best option for implementing such a parser.  It has several
drawbacks: It is hard to follow the code in the lexical analyser.  The
algorithm used by Antlr doesn't match the requirements for wikitext
very well.  I would guess that my implementation contains many errors
where the lexer will fail to accept the input.  Since Antlr is not
designed for languages where everything is valid input, it is very
hard to find these errors by inspecting the code.  Also an an
external tool for generating semantic predicates is necessary so the
description of the lexical rules are divided into two separate parts.
Furthermore, for a recursive descendent algorithm is not the fastest
options when it comes to parsing.  If the token stream is sufficiently
refined a LALR parser can be used instead.

I believe that it is always possible to transform a wikitext to a
token stream that can be easily parsed.  Thus, I would propose writing
a specialized lexical analyser generator that is suitable for
generating a lexer that does precisely this.

As a consequence, the wikitext specification would be more compact; it
would be easier to maintain for multiple target languages and the
resulting parser would be smaller and faster than an Antlr generated
one.  The grammar over the resulting token stream will be context free
and parsable with any standard parsing algorithm.

Here are my preliminary thoughts on the requirement of the lexer
generator:

1. The input to the lexer generator is a table of tokens where the
   properties of each token is described.

2. The output from the lexer generator is a sequence of tokens.  Each
   token may be associated with a string and a vector of attributes.

3. Suitable pattern matching algorithm.

   There should be four basic classes of characters:

   1. New line characters

   2. Special characters (that needs escaping)

   3. Ordinary characters

   4. Ignored characters

   If no token production rule matches, the characters should just be
   lumped together into tokens of longest sequence of characters of
   the same class.

   A bonus objective is to make these character classes runtime
   configurable, since the set of special characters differs for
   different applications.

   A token production rule consists of a simple regular pattern that
   is used for preliminary matching of a candidate production rule.
   The matched text can in a second step be matched by a PCRE or
   otherwise processed to finally decide whether to accept the
   production or not.

4. Toggling token productions depending on context.

   The context is effectively defined by the set of enabled token
   productions, which means there are a very large number of different
   contexts. Therefore we need a more dynamic context mechanism than
   what are supported by tools like flex and javacc.  We need a
   mechanism where individual token productions can be dynamically
   disabled and enabled.

5. Late token determination.

   The semantic meaning of a sequence of apostrophes cannot be
   resolved until the end of a text line.  The most efficient method
   to handle this special case is to produce a set of pseudo tokens
   that can be resolved into specific tokens when reaching the end of
   line.  No lookahead would be required for this.

   In my previous implementation I solved apostrophe parsing by parser
   lookahead, which is costly.  This mechanism alone seemed to account
   for about 30% of the execution time of the parser.

   Furthermore, a token representing the start of an internal link can
   be further resolved into a "red link" or "blue link" at the end of
   the lexical analysis.

6. Speculative execution.

   Lookahead is problematic for context sensitive analysis.  Instead
   we use speculative execution.  Thus, there must be a mechanism to
   save and restore the full state of the analysis, as well as the
   state of input and output streams.

7. Simple action language

   Only a few operations needs to be supported in the actions.  To
   simplify code generation for multiple languages, it is a good idea
   to define a simple language for the actions that is independent of
   target language to avoid having to maintain the actions per target
   language.

8. Multilingual code generation

   It is desirable to support at least PHP, Javascript and C.

The lexer generator will generate code that assumes that a runtime
environment that implements the following APIs exist in the target
languages:

// Input

public interface InputStream {
    InputStreamMark mark();
    void rewind(InputStreamMark mark);
    void advance(int characters);
    String getText(InputStreamMark mark);
    int getLine();
    int getColumn();
}

// Token management

enum TokenType { ... }

public interface TokenFactory {
    <T extends Token> T newToken(TokenType tokenType);
}

public interface Token {
    void setText(String text);
    String getText();
    void setPosition(int line, int column);
}

public interface AttributeToken extends Token {
    void setAttributes(AttributeList attributeList);
}

public interface HeadingToken extends AttributeToken {
    void setLevel(int level);
}

public interface LinkToken extends Token {
    void setLinkTarget(String linkTarget);
}

public interface InternalLinkToken extends LinkToken {
    void setLinkType(LinkType linkType);
    void setTargetURL(String url);
}

public interface TagToken extends AttributeToken {
    void setTagName(String name);
}

// Output token stream

public interface TokenStream {
    void put(Token token);
    TokenStreamMark mark();
    Token takeBack();
    void discard(TokenStreamMark mark);
}

// Application interaction

public interface Application {
    boolean isValidLink(String text);
    boolean isValidURL(String text);
    boolean linkPrefixesEnabled();
    Set<String> getURLProtocols();
    Set<String> getBlockTags();
    Set<String> getInlineTags();
    void resolveLinks(Set<InternalLinkToken> linkTokens);
}

_______________________________________________
Wikitext-l mailing list
Wikitext-l <at> lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitext-l
Andreas Jonsson | 26 Jan 10:28 2011
Picon

A specialized lexer generator for wikitext

(Starting a paragraph with the words "From my" for some reason
seems to be a problem for my email client.  Sorry about that.)

Hi,

My experiments with an Antlr based parser have led me to believe
it is possible to write a formal parser for Mediawiki
syntax that is sufficiently compatible with the original to be used as
a replacement.  But it is, however, questionable whether using Antlr
is the best option for implementing such a parser.  It has several
drawbacks: It is hard to follow the code in the lexical analyser.  The
algorithm used by Antlr doesn't match the requirements for wikitext
very well.  I would guess that my implementation contains many errors
where the lexer will fail to accept the input.  Since Antlr is not
designed for languages where everything is valid input, it is very
hard to find these errors by inspecting the code.  Also an an
external tool for generating semantic predicates is necessary so the
description of the lexical rules are divided into two separate parts.
Furthermore, for a recursive descendent algorithm is not the fastest
options when it comes to parsing.  If the token stream is sufficiently
refined a LALR parser can be used instead.

I believe that it is always possible to transform a wikitext to a
token stream that can be easily parsed.  Thus, I would propose writing
a specialized lexer generator that is suitable for
generating a lexer that does precisely this.

As a consequence, the wikitext specification would be more compact; it
would be easier to maintain for multiple target languages and the
resulting parser would be smaller and faster than an Antlr generated
one.  The grammar over the resulting token stream will be context free
and parsable with any standard parsing algorithm.

Here are my preliminary thoughts on the requirement of the lexer
generator:

1. The input to the lexer generator is a table of tokens where the
   properties of each token is described.

2. The output from the lexer generator is a sequence of tokens.  Each
   token may be associated with a string and a vector of attributes.

3. Suitable pattern matching algorithm.

   There should be four basic classes of characters:

   1. New line characters

   2. Special characters (that needs escaping)

   3. Ordinary characters

   4. Ignored characters

   If no token production rule matches, the characters should just be
   lumped together into tokens of longest sequence of characters of
   the same class.

   A bonus objective is to make these character classes runtime
   configurable, since the set of characters that need escaping differs for
   different applications.

   A token production rule consists of a simple regular pattern that
   is used for preliminary matching of a candidate production rule.
   The matched text can in a second step be matched by a PCRE or
   otherwise processed to finally decide whether to accept the
   production or not.

4. Toggling token productions.

   The context is effectively defined by the set of enabled token
   productions, which means there are a very large number of different
   contexts. Therefore we need a more dynamic context mechanism than
   what are supported by tools like flex and javacc.  We need a
   mechanism where individual token productions can be dynamically
   disabled and enabled.  The code for doing this will be inferred from
   the properties of the tokens and generated.

5. Late token determination.

   The semantic meaning of a sequence of apostrophes cannot be
   resolved until the end of a text line.  The most efficient method
   to handle this special case is to produce a set of pseudo tokens
   that can be resolved into specific tokens when reaching the end of
   line.  No lookahead would be required for this.

   In my previous implementation I solved apostrophe parsing by parser
   lookahead, which is costly.  This mechanism alone seemed to account
   for about 30% of the execution time of the parser.

   Furthermore, a token representing the start of an internal link can
   be further resolved into a "red link" or "blue link" at the end of
   the lexical analysis.

6. Speculative execution.

   Lookahead is problematic for context sensitive analysis.  Instead
   we use speculative execution.  Thus, there must be a mechanism to
   save and restore the full state of the analysis, as well as the
   state of input and output streams.

7. Simple action language

   Only a few operations needs to be supported in the actions.  To
   simplify code generation for multiple languages, it is a good idea
   to define a simple language for the actions that is independent of
   target language to avoid having to maintain the actions per target
   language.

8. Multilingual code generation

   It is desirable to support at least PHP, Javascript and C.

The lexer generator will generate code that assumes that a runtime
environment that implements the following APIs exist in the target
languages:

// Input

public interface InputStream {
    InputStreamMark mark();
    void rewind(InputStreamMark mark);
    void advance(int characters);
    String getText(InputStreamMark mark);
    int getLine();
    int getColumn();
}

// Token management

enum TokenType { ... }

public interface TokenFactory {
    <T extends Token> T newToken(TokenType tokenType);
}

public interface Token {
    void setText(String text);
    String getText();
    void setPosition(int line, int column);
}

public interface AttributeToken extends Token {
    void setAttributes(AttributeList attributeList);
}

public interface HeadingToken extends AttributeToken {
    void setLevel(int level);
}

public interface LinkToken extends Token {
    void setLinkTarget(String linkTarget);
}

public interface InternalLinkToken extends LinkToken {
    void setLinkType(LinkType linkType);
    void setTargetURL(String url);
}

public interface TagToken extends AttributeToken {
    void setTagName(String name);
}

// Output token stream

public interface TokenStream {
    void put(Token token);
    TokenStreamMark mark();
    Token takeBack();
    void discard(TokenStreamMark mark);
}

// Application interaction

public interface Application {
    boolean isValidLink(String text);
    boolean isValidURL(String text);
    boolean linkPrefixesEnabled();
    Set<String> getURLProtocols();
    Set<String> getBlockTags();
    Set<String> getInlineTags();
    void resolveLinks(Set<InternalLinkToken> linkTokens);
}

Best regards,

Andreas Jonsson
Platonides | 26 Jan 16:00 2011
Picon

Re: A specialized lexer generator for wikitext

Andreas Jonsson wrote:
> (Starting a paragraph with the words "From my" for some reason
> seems to be a problem for my email client.  Sorry about that.)

This is the fourth copy received here. Perhaps the problem was not
sending but retrieving after sending?

Problems with lines beginning with "From" look like mbox storage issues.

Gmane