Romain Behar | 3 Jun 12:33 2004
Picon

On the fly XML parser


The attached example is an attempt for an "on the fly"
XML parser. XML files tend to get very big: the idea
is not to load the entire file into memory but to
parse it when loading the file.

The parser was modified to use prefix parsing, and the
rNode and rElement merged into a new rXML rule. We
loose the element nesting information: a stack needs
to be set up to keep track of nesting level or skip
entire blocks.

Could the prefix parser stop on defined rules, that
would avoid changing the grammar?

There is another tweak in the prefix parsing loop:

char c = file.get();
while(c != ' ' && c != '\n' && c != '\r' && c != '\t')
{
 ...
}

helps the parser not to stop on string parsing (e.g.
"<!--" in rComment);

is there a better way to handle this trick?

 Regards,

(Continue reading)

Alex Rousskov | 3 Jun 17:30 2004

Re: On the fly XML parser

On Thu, 3 Jun 2004, Romain Behar wrote:

> The attached example is an attempt for an "on the fly" XML parser.
> XML files tend to get very big: the idea is not to load the entire
> file into memory but to parse it when loading the file.

That's a reasonable approach provided you need to extract something
small from the XML file as opposed to load and manipulate the entire
file.

> The parser was modified to use prefix parsing, and the rNode and
> rElement merged into a new rXML rule. We loose the element nesting
> information: a stack needs to be set up to keep track of nesting
> level or skip entire blocks.
>
> Could the prefix parser stop on defined rules, that would avoid
> changing the grammar?

Do you mean stop if any of the specified rules match, as opposed
to stopping when the top-most rule (the grammar) matches? It sounds
like you may want to use semantic actions attached to the rules you
are interested in:

	http://www.hapy.org/actions.html

We probably need more experience with this, but for now the following
rules of thumb seem accurate to me:

	- Use prefix parsing to handle a stream of "objects",
	  where each object has the same grammar
(Continue reading)

Alex Rousskov | 1 Jul 22:15 2004

Re: On the fly XML parser

On Thu, 3 Jun 2004, Romain Behar wrote:

> There is another tweak in the prefix parsing loop:
> ...
> helps the parser not to stop on string parsing (e.g.
> "<!--" in rComment);
> is there a better way to handle this trick?

Romain's use case uncovered a bug in prefixing code. The patch is
attached and will be included in 0.0.7 release. The bug does not
affect parsers that have access to entire input.

Thank you,

Alex.

--

-- 
Protocol performance, functionality, and reliability testing.
Tools, services, and know-how.
http://www.measurement-factory.com/
Index: src/Algorithms.cc
===================================================================
RCS file: /usr/local/CVS/TmfBase/Hapy/src/Algorithms.cc,v
retrieving revision 1.34
diff -u -r1.34 Algorithms.cc
--- src/Algorithms.cc	3 Mar 2004 04:07:46 -0000	1.34
+++ src/Algorithms.cc	1 Jul 2004 05:15:40 -0000
 <at>  <at>  -654,18 +654,12  <at>  <at> 
 }
(Continue reading)


Gmane