Richard Y. Kim | 18 May 2002 06:59

next language to be supported by semantic

Greetings,

I think the best way for me to gain active understanding of all
aspects of semantic is to go through the process of writing the parser
for a new language.  I plan to improve semantic documentation as I go
along.

Python seems like a good choice. It seems to be popular, and so I am
interested in learning it. Perl's hairy syntax is too much for me to
even consider it.

Before I jump in, I thought I should ask your comments.
Any thoughts?

_______________________________________________________________
Hundreds of nodes, one monster rendering program.
Now that's a super model! Visit http://clustering.foundries.sf.net/
Richard Y. Kim | 26 May 2002 12:45

python's use of indentation for block structure

>>>>> "EL" == Eric M Ludlam <eric <at> siege-engine.com> writes:
    EL>   [ ... ]
    EL> 
    EL> One of the things in common with all the languages
    EL> supported by Semantic is that they use some sort of
    EL> list construct such as { } to surround function
    EL> bodies.  Make is an exception and uses a lex stage
    EL> hack to get around it.  This is used to allow the
    EL> parser to run fast enough that users don't notice.
    EL> 
    EL> Python does not use { } around it's function body
    EL> which can only be detected via a parsing stage so a
    EL> lex hack cannot be used to reliably get around the
    EL> problem.

I'm not so sure that lexer cannot provide the necessary
tokens such that the parser state for python is not that
much different from C or Java. Python does not use braces to
express block structures. Instead it uses indentations. The
Python language reference manual (see
<http://www.python.org/doc/current/ref/indentation.html>)
seems to assume that the lexer is supposed to generate
INDENT and DEDENT tokens which seems to be equivalent to {
and } tokens in C. For example

1 def f(x):
2 	simple_statement
3 	compount_statement
4 		simple_statement
5 		simple_statement
(Continue reading)

Richard Y. Kim | 26 May 2002 12:48

python Grammar file

Eric,

I forgot to attach the grammar file as I said I would in my last
email.  So here it is.  Note the small size!

# Grammar for Python

# Note:  Changing the grammar specified in this file will most likely
#        require corresponding changes in the parser module
#        (../Modules/parsermodule.c).  If you can't make the changes to
#        that module yourself, please co-ordinate the required changes
#        with someone who can; ask around on python-dev for help.  Fred
#        Drake <fdrake <at> acm.org> will probably be listening there.

# Commands for Kees Blom's railroad program
#diagram:token NAME
#diagram:token NUMBER
#diagram:token STRING
#diagram:token NEWLINE
#diagram:token ENDMARKER
#diagram:token INDENT
#diagram:output\input python.bla
#diagram:token DEDENT
#diagram:output\textwidth 20.04cm\oddsidemargin  0.0cm\evensidemargin 0.0cm
#diagram:rules

# Start symbols for the grammar:
#	single_input is a single interactive statement;
#	file_input is a module or sequence of commands read from an input file;
#	eval_input is the input for the eval() and input() functions.
(Continue reading)

Richard Y. Kim | 26 May 2002 13:20

semantic lexer for python

Eric,

I would like to share with you some difficulties and possible
solutions that I've encountered as I attempt to write semantic lexer
for python.  I include what I have so far for semantic-python.el as
well as a diff for semantic.el at the end of this email.

Problem #1: No 'newline token is generated if a line containing code
also contains trailing comment.  This is easily demonstrated using the
following makefile:

  0: #
  1: all : one
  2:
  3: one :        # one
  4:      echo one

The token list generated by semantic-flex is

  ((symbol        3 .  6)   # 1: all
   (whitespace    6 .  7)   # 1:
   (punctuation   7 .  8)   # 1: :
   (whitespace    8 .  9)   # 1:
   (symbol        9 . 12)   # 1: one
   (newline      12 . 13)   # 1:
   (newline      13 . 14)   # 2:
   (symbol       14 . 17)   # 3: one
   (whitespace   17 . 18)   # 3:
   (punctuation  18 . 19)   # 3: :
                            # 3: MISSING newline HERE
(Continue reading)

David Ponce | 28 May 2002 17:46

Re: semantic lexer for python

Hi Eric & Richard,

Following Richard's work and remarks about `semantic-flex' and python
I submit you the attached version of `semantic-flex' hacked to catch
indentation.  I don't tested it intensively but it could be a starting
point to enhance `semantic-flex' ;-)

I defined a new buffer local option `semantic-flex-enable-indents'.
When non-nil `semantic-flex' catches indentation (at beginning of
lines) and inserts corresponding pseudo-syntactic 'indent tokens in
the returned token stream.  I called such tokens pseudo-syntactic ones
because they don't actually match data in the input source
(`semantic-flex' don't move the point when it catches one).

I used the form (indent . N), where N is the `current-indentation'
value, because I think it could be more useful than a token of the
form (indent START . END).  Particularly because the true indentation
value can be different of (- END START) when there are tab characters.

Thus, after evaluating something like this:

(let ((semantic-flex-enable-indents t)
      (semantic-flex-enable-whitespace t))
  (semantic-flex-buffer))

it is possible to get the following stream:

((indent . 8) (whitespace 1 . 2) (symbol 2 . 5))

from a buffer containing:
(Continue reading)

Richard Y. Kim | 29 May 2002 08:50

Re: semantic lexer for python

David,

I like your `semantic-flex-enable-indents' based code in all
respects compared with my initial gross hack. Your code is
simpler and more general.

I don't yet understand what wisent-flex offers, but I assume
it can keep track of "stack of indentations" and properly
compute INDENT and DEDENT tokens for use by the parser. I'll
study wisent-flex so that I understand what you are talking
about.  After that, I'll see if I can use your modified
semantic-flex along with wisent-flex and see if I can finish
off the python lexer.

Thanks for good ideas.

>>>>> "DP" == David Ponce <david <at> dponce.com> writes:
    DP> 
    DP> Hi Eric & Richard,
    DP> Following Richard's work and remarks about `semantic-flex' and python
    DP> I submit you the attached version of `semantic-flex' hacked to catch
    DP> indentation.  I don't tested it intensively but it could be a starting
    DP> point to enhance `semantic-flex' ;-)
    DP> 
    DP> I defined a new buffer local option `semantic-flex-enable-indents'.
    DP> When non-nil `semantic-flex' catches indentation (at beginning of
    DP> lines) and inserts corresponding pseudo-syntactic 'indent tokens in
    DP> the returned token stream.  I called such tokens pseudo-syntactic ones
    DP> because they don't actually match data in the input source
    DP> (`semantic-flex' don't move the point when it catches one).
(Continue reading)

Richard Y. Kim | 29 May 2002 09:26

Re[2]: semantic lexer for python

Eric and Dave,

>>>>> "EL" == Eric M Ludlam <eric <at> siege-engine.com> writes:
    EL> 
    EL> That is an interesting idea.  It seems interesting
    EL> that in your example the indent and whitespace
    EL> reference the same text?

Yes in most cases.  In Dave's code however, INDENT tokens
may be generated by the empty string at the beginning of
lines without leading white spaces!

Also a key difference betwen INDENT and whitespace tokens is
that Dave's INDENT token does not consume any input
characters!  Dave stuck in an entry in the middle of `cond'
clauses that may generate INDENT tokens, but it does not
move the current point.  There is no infinite recursion,
because the cond clause Dave added always evaluates to `nil'
so that it goes on to the next cond clause *always*. I had
to look at the code for a couple of minutes before I
understood what was going on.  Completely legal code, but
unusual use of the `cond' form.  I have no problem with the
code so long as we add a comment in capital letters what is
going on.

Despite the fact that Dave turned on both
semantic-flex-enable-indents and
semantic-flex-enable-whitespace in his sample code, the two
are independent features, i.e., they can be turn on/off
independently.  I say this, because my first concern when I
(Continue reading)

David Ponce | 29 May 2002 10:48

Re: semantic lexer for python

Hi Richard & Eric,

[...]
> Yes in most cases.  In Dave's code however, INDENT tokens
> may be generated by the empty string at the beginning of
> lines without leading white spaces!

That is what I tried to achieve ;-)

> Also a key difference betwen INDENT and whitespace tokens is
> that Dave's INDENT token does not consume any input
> characters!  Dave stuck in an entry in the middle of `cond'
> clauses that may generate INDENT tokens, but it does not
> move the current point.  There is no infinite recursion,
> because the cond clause Dave added always evaluates to `nil'
> so that it goes on to the next cond clause *always*. I had
> to look at the code for a couple of minutes before I
> understood what was going on.  Completely legal code, but
> unusual use of the `cond' form.  I have no problem with the
> code so long as we add a comment in capital letters what is
> going on.

You're right! It is an unusual use of `cond' that should be
emphasized.  Unless you (or Eric) got a better idea on how to
implement that ;-)

> Despite the fact that Dave turned on both
> semantic-flex-enable-indents and
> semantic-flex-enable-whitespace in his sample code, the two
> are independent features, i.e., they can be turn on/off
(Continue reading)

David Ponce | 29 May 2002 13:16

Re: semantic lexer for python

To continue on this subject and to illustrate how a wisent lexer could be wrote for python, based on my
previous hack of `semantic-flex' I just wrote the following basic piece of code (untested) ;-)

(defvar wisent-python-last-indent nil
  "The last level of indentation encountered so far.
Should be reset before starting a new parse task.")

(defun wisent-python-lexer ()
  "Return the next python's lexical token available.
Filter any `semantic-flex' 'indent tokens available to produce (INDENT
N) or (DEDENT N) lexical tokens needed to parse python code.  Other
`semantic-flex' tokens are handled in a normal way by `wisent-flex'."
  (let (wlex curr-indent last-indent)
    ;; Digest `semantic-flex' 'indent tokens
    (while (and (not wlex) (eq (caar wisent-flex-istream) 'indent))
      (setq curr-indent (cdar wisent-flex-istream)
            last-indent (or wisent-python-last-indent 0)
            wisent-python-last-indent curr-indent
            wisent-flex-istream (cdr wisent-flex-istream))
      (cond
       ;; No indentation change
       ((= curr-indent last-indent)) ;; Just eat 'indent token
       ;; Indentation increased
       ((> curr-indent last-indent)
        ;; Return an INDENT lexical token
        (setq wlex (list 'INDENT (- curr-indent last-indent))))
       ;; Indentation decreased
       (t
        ;; Pop indentation stack
        (setq wlex (list 'DEDENT (- last-indent curr-indent))))))
(Continue reading)

Eric M. Ludlam | 29 May 2002 14:23
Gravatar

Re[3]: semantic lexer for python

Thanks for explaining.

Eric

>>> "Richard Y. Kim" <ryk <at> dspwiz.com> seems to think that:
>Eric and Dave,
>
>>>>>> "EL" == Eric M Ludlam <eric <at> siege-engine.com> writes:
>    EL> 
>    EL> That is an interesting idea.  It seems interesting
>    EL> that in your example the indent and whitespace
>    EL> reference the same text?
>
>Yes in most cases.  In Dave's code however, INDENT tokens
>may be generated by the empty string at the beginning of
>lines without leading white spaces!
>
>Also a key difference betwen INDENT and whitespace tokens is
>that Dave's INDENT token does not consume any input
>characters!  Dave stuck in an entry in the middle of `cond'
>clauses that may generate INDENT tokens, but it does not
>move the current point.  There is no infinite recursion,
>because the cond clause Dave added always evaluates to `nil'
>so that it goes on to the next cond clause *always*. I had
>to look at the code for a couple of minutes before I
>understood what was going on.  Completely legal code, but
>unusual use of the `cond' form.  I have no problem with the
>code so long as we add a comment in capital letters what is
>going on.
>
(Continue reading)


Gmane