Stuart A. Yeates | 1 May 09:11 2011
Picon

eXist optimisation and large TEI collections

I have some questions about eXist optimisation.

Background: I'm a have a project called He Kupu Tawhito  (
https://github.com/stuartyeates/He-Kupu-Tawhito assumes ubuntu/make)
which builds multilingual concordances based in TEI/XML using eXist.
The example I'll be using here downloads two publicly available bibles
in the users choice of languages and builds a multilingual concordance
which is then queried in eXist using XQuery. Most of the heavy lifting
is done with XSLT. I'm a TEI user (
http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index-toc.html ),
so both my input and output from eXist are pure TEI, I'm converting to
xhtml with XSLT in the browser for display. I'm using 64 bit Java on a
64 bit Ubuntu and giving Java 6GB of RAM (of 8GB physical RAM
available). My locale is en_NZ.UTF-8 and I'm using firefox for
display.

I have several GB of XML I'd like to put into He Kupu Tawhito, but I'm
having difficulty scaling above a couple of dozen MB of XML. Others in
the TEI community have reported similar issues, informally. Discussion
with Jens Østergaard Petersen convinced me that it may be worth having
another go at scaling this, and thus this message.

The best example I can give is this graph which shows query time vs
the size of the contents, both with and without collection.xconf.

https://spreadsheets.google.com/ccc?key=0AtkIjlDqC2H4dFBQM0V4SjVTdTdlQnFfWWpDaks5d0E&hl=en_GB&authkey=CJDPuY8F

The 50 chapter mark represents 5.9 MB of TEI/XML in a single file. The
problem I have with that graph is that as the size of the collection
(in chapters) increases, the query time appears to go up linearly (or
(Continue reading)

Stefan Majewski | 1 May 14:28 2011
Picon
Picon

Re: eXist optimisation and large TEI collections

Hi Stuart,

I just pulled your project from git, but to be honest I found it rather hard to get everything in place to get it running. After changing everything such that I can work with less ram (I'm on 32 bit on my home machine), I finally got it up and running, but the experiments have issues. Here i see it starts the exist instance again and again. As I'm a little limited in time right now, I stopped getting this up as I assume it is due to some assumptions on the system in the makefile. Therefore, I tried to focused on the things I could comment on without actually testing it myself. If you could provide a makefile that takes just the URI and password of an exist-db instance without all this setup stuff I'd be happy to test it, though. However, I tried to assemble a few suggestions  (just suggestions) based on my experience. I do not claim that they are fit to solve all your problems.

xhtml with XSLT in the browser for display. I'm using 64 bit Java on a
64 bit Ubuntu and giving Java 6GB of RAM (of 8GB physical RAM
available).

That is huge. We work with below 2GB and sometimes dozens of concurrent users and rather complex queries (filtering based on structural and indexed criteria, intersection of result-sets etc.) and that is sufficient. I don't know if you hit the memory-limit, but I doubt it. One difference may be that we put our results in a session variable on the server and serialize only the portion that fit's in the browser. There you can navigate from result-page to result-page. Some queries would yield a million hits that would obviously take some time and memory to serialize. In our experience, this is an area where significant performance improvements may be gained.


I have several GB of XML I'd like to put into He Kupu Tawhito, but I'm
having difficulty scaling above a couple of dozen MB of XML. Others in
the TEI community have reported similar issues, informally.

I would assume that you should certainly not be limited to a couple of dozen MB. There has to be something wrong, then.

with Jens Østergaard Petersen convinced me that it may be worth having
another go at scaling this, and thus this message.

I remember your harsh statement about the "failure to scale up" on TEI-L (without any hedging, indicating that your queries might have a share in this). Mhh. Let's see, I guess it is still possible to improve them, though.

The best example I can give is this graph which shows query time vs
the size of the contents, both with and without collection.xconf.

https://spreadsheets.google.com/ccc?key=0AtkIjlDqC2H4dFBQM0V4SjVTdTdlQnFfWWpDaks5d0E&hl=en_GB&authkey=CJDPuY8F

As far as I have seen from your makefile, this is the overall time of the request to the query. While in practice this is what actually matters, it is interesting to ask what the reason might be. As I was not able to quickly test it with the makefile, I assume from the code that the resulting page is of moderate size (it only fetches the first 15 whatever-it-might-be, right?)? If the time is spent selecting the elements from the database it is probably an indicator that indexes are either not used or the query optimizer was not able to optimize your query.


The 50 chapter mark represents 5.9 MB of TEI/XML in a single file.

That's small. The performance you see is exceptionally bad.

(1) In my makefile I'm loading my files using:

$(EXIST_HOME)/bin/client.sh
uri=xmldb:exist://localhost:8081/exist/xmlrpc -m
/db/system/config/db/he_kupu_tawhito/ -p collection.xconf --no-gui
$(EXIST_HOME)/bin/client.sh
uri=xmldb:exist://localhost:8081/exist/xmlrpc -m /db/he_kupu_tawhito/
-p korero/www.biblegateway.com-sampler/import.words.xml --no-gui

Will that result in that collection.xconf applying to import.words.xml ?

the procedure looks fine to me.

(2) My collection.xconf and main xQuery are at:

https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/collection.xconf

I think you could use the qname index instead of path index for all the indexes you defined. The query optimizer will do a much better job, then.

https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql

It could be worthwhile using a fulltext index for some cases (e.g. multiple values in <at> corresp). Maybe an appropriate fulltext index with the whitespace-analyzer, as I don't know what the standard analyzer would do with the "#", and ft:contains could help especially for the cases where <at> corresp holds several values. But, from your query I would assume that a qname or ngram index should be sufficient.

    <create qname="lemma" type="xs:string"/>

you don't use this anywhere, right?

    <create qname=" <at> xml:id" type="xs:string"/>
    <create qname=" <at> xml:lang" type="xs:string"/>
    <create qname=" <at> lemma" type="xs:string"/>

you should really make these qname (not path) indexes. I changed it accordingly. An index for <at> corresp you make use of in your query is missing. It could be useful to define a qname and an ngram or fulltext index here, depending on the kind of queries you want to perform.

(3) I've tinkered a reasonable amount with my xquery, but I won't
profess to being an expert:

https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql

Is there anything obvious I'm doing wrong?

I found that it sometimes helps to break multiple predicates in several let expressions.

let $words := $this//w[ <at> lemma=$kupu][ <at> xml:lang=$reo]/ <at> xml:id

let $words := $this//w[ <at> lemma eq $kupu]
let $words := $words[ <at> xml:lang eq $reo]

try to do the most restrictive predicate first. I think it is a good habit to use eq when not intending to deal with sequences ($kupu and $reo are just string values, right?). Shouldn't affect performance, here, though.

let $thisid := $this/ <at> xml:id
let $thishash := concat('#', $thisid)

 you don't use these anywhere later. Why define them then?


let $others := //p[contains($this/ <at> corresp, <at> xml:id)][(concat('#', <at> xml:id)=$this/ <at> corresp) or (concat('#',$this/ <at> xml:id)= <at> corresp)] |
//p[contains( <at> corresp,$this/ <at> xml:id)][(concat('#', <at> xml:id)=$this/ <at> corresp) or (concat('#',$this/ <at> xml:id)= <at> corresp)]

I would suspect that this line is causing much trouble! You can split it as the one above and then:


- contains($this/ <at> corresp, <at> xml:id).
this does not use indexes! maybe something like " <at> xml:id = tokenize(this/ <at> corresp, ' ')"  (note that I do not use "eq" here) or do you use fn:contains because of a prepended "#"? Then maybe something along the lines of " <at> xml:id = (for $i in tokenize(this/ <at> corresp, ' ') return substring-before($i,'#'))". I assumed that you do this for 15 $this and do not have a huge number of <at> corresp values. You can make use of the index on <at> xml:id, here.


- concat('#', <at> xml:id)=$this/ <at> corresp
this does not use indexes either. If you have indexes defined for <at> xml:id it would not use them because of the concat.
The way you defined it I would suspect that exist had, in worst case, to do a concat on all tei:p elements you have. (linear scaling, then). Try maybe " <at> xml:id eq substring-after($this/ <at> corresp,'#')" for useing the index on <at> xml:id instead.

- concat('#',$this/ <at> xml:id)= <at> corresp
similar here. Here it is the missing index on <at> corresp, though.


The second part of the union has the same issues. The second predicate is exactly the same, for the first just swap <at> xml:id and <at> corresp in my first explanation.


sing an index to to pull a <p/> out of 500 MB single file slower
than pulling the same <p/> out of a 50 KB file sitting in a collection
of 10K files?

I would not expect the big file to be slower.

(5) TEI uses the standard xml:id and xml:lang tags and I make
extensive use of both of these. I run xmllint over my input files to
check for duplicate xml:ids. Does eXist have any special support for
these? are there any common traps?

You can define indexes just as on any other attribute. I don't know what you mean with special support, though.


(6) Currently I don't take any steps to update the indexes. Does eXist
build the indexes listed in the collection.xconf as documents are
loaded?

Yes, eXist updates the indexes when you store documents. Hence, this shouldn't be a problem. I think the issue is your query.


(7) Are there other common traps and pitfalls that I should be
checking? I've read http://exist.sourceforge.net/tuning.html

read http://exist.sourceforge.net/indexing.html

I think you will find some very important things there. Especially on the kinds of indexes certain functions (e.g. fn:contains) benefit from and why qname indexes are preferable.


I hope i figured correctly what you are trying to do. If I am wrong with some of my assumptions or solutions, please feel free to correct me. I hope you find something useful in this response.

cheers,
Stefan

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Exist-open mailing list
Exist-open <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/exist-open
Wolfgang Meier | 1 May 14:41 2011

Re: eXist optimisation and large TEI collections

FYI: I managed to generate the test XML via the makefile and started
to tweak the query. There are a few things to improve, starting with
the collection.xconf as Stefan already pointed out and within the main
query itself. I will write down what I did once I'm happy with it.

Wolfgang

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
Andy Bunce | 1 May 20:41 2011
Picon

problems with 1.4.1dev-rev14275-20110426

Hi,
Testing this war on ubuntu tomcat6. I get problems with the java client as below.
I notice the guest password and the documentation search problems are still there.

/Andy


andy <at> ThinkPad-T42:~$ javaws      -Xclearcache http://localhost:8080/exist/webstart/exist.jnlp
net.sourceforge.jnlp.LaunchException: Fatal: Initialization Error: Could not initialize application.
    at net.sourceforge.jnlp.Launcher.createApplication(Launcher.java:651)
    at net.sourceforge.jnlp.Launcher.launchApplication(Launcher.java:420)
    at net.sourceforge.jnlp.Launcher$TgThread.run(Launcher.java:732)
Caused by: net.sourceforge.jnlp.LaunchException: Fatal: Application Error: Cannot grant permissions to unsigned jars.
    at net.sourceforge.jnlp.runtime.JNLPClassLoader.setSecurity(JNLPClassLoader.java:216)
    at net.sourceforge.jnlp.runtime.JNLPClassLoader.<init>(JNLPClassLoader.java:170)
    at net.sourceforge.jnlp.runtime.JNLPClassLoader.getInstance(JNLPClassLoader.java:249)
    at net.sourceforge.jnlp.Launcher.createApplication(Launcher.java:641)
    ... 2 more
Caused by:
net.sourceforge.jnlp.LaunchException: Fatal: Application Error: Cannot grant permissions to unsigned jars.
    at net.sourceforge.jnlp.runtime.JNLPClassLoader.setSecurity(JNLPClassLoader.java:216)
    at net.sourceforge.jnlp.runtime.JNLPClassLoader.<init>(JNLPClassLoader.java:170)
    at net.sourceforge.jnlp.runtime.JNLPClassLoader.getInstance(JNLPClassLoader.java:249)
    at net.sourceforge.jnlp.Launcher.createApplication(Launcher.java:641)
    at net.sourceforge.jnlp.Launcher.launchApplication(Launcher.java:420)
    at net.sourceforge.jnlp.Launcher$TgThread.run(Launcher.java:732)
andy <at> ThinkPad-T42:~$

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Exist-open mailing list
Exist-open <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/exist-open
Dannes Wessels | 1 May 22:22 2011

Re: problems with 1.4.1dev-rev14275-20110426


On 1 May 2011, at 20:41 , Andy Bunce wrote:

Testing this war on ubuntu tomcat6. I get problems with the java client as below.

andy <at> ThinkPad-T42:~$ javaws      -Xclearcache http://localhost:8080/exist/webstart/exist.jnlp
net.sourceforge.jnlp.LaunchException: Fatal: Initialization Error: Could not initialize application.
    at net.sourceforge.jnlp.Launcher.createApplication(Launcher.java:651)
    at net.sourceforge.jnlp.Launcher.launchApplication(Launcher.java:420)
    at net.sourceforge.jnlp.Launcher$TgThread.run(Launcher.java:732)
Caused by: net.sourceforge.jnlp.LaunchException: Fatal: Application Error: Cannot grant permissions to unsigned jars.


unfortunately do you not provide sufficient  details for helping you. A suggestion

- ubuntu version
- which exist version, did you build yourself?
- which tomcat version
- ....

anyway, looking at the trace... there are no exist-db classes here.  Did you get the war file from www.exist-db.nl/files ? Maybe I forgot to sign the jar files.... 

Please could you use the SUN jvm and retry?

Regards

Dannes

--
Dannes Wessels
eXist-db Open Source Native XML Database
e: dannes <at> exist-db.org
w: http://www.exist-db.org 






Attachment (smime.p7s): application/pkcs7-signature, 2995 bytes
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Exist-open mailing list
Exist-open <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/exist-open
Dannes Wessels | 1 May 22:31 2011

Re: problems with 1.4.1dev-rev14275-20110426

Hi,

On 1 May 2011, at 22:22 , Dannes Wessels wrote:

anyway, looking at the trace... there are no exist-db classes here.  Did you get the war file from www.exist-db.nl/files ? Maybe I forgot to sign the jar files.... 

Please could you use the SUN jvm and retry?

I rechecked the war file I uploaded, and all jar files turn out to be signed correctly. I think the java version installed on ubuntu is not compatible (enough) with the Java5/6 version of Sun/Oracle.

I'd recommend to install Sun/Oracle Java6. A recent version of OpenJDK#7 might work as well, but we did not test this ourself.

thnx

Dannes


--
Dannes Wessels
eXist-db Open Source Native XML Database
e: dannes <at> exist-db.org
w: http://www.exist-db.org 






Attachment (smime.p7s): application/pkcs7-signature, 2995 bytes
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Exist-open mailing list
Exist-open <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/exist-open
Kenneth Reid Beesley | 1 May 22:46 2011
Picon

Re: Newbie: splitting a dictionary into individual <entry> files, almost working


Thanks Joe,

xmldb:encode-uri() solved my problem.

Ken

On 30Apr2011, at 10:06, Joe Wicentowski wrote:

> Hi Ken,
> 
>>> Question 1:  Can collection names in eXist contain Unicode characters beyond
>>> the ASCII range?
> 
> Yes.  A similar question came up in this thread about ampersands in
> collection names (http://markmail.org/message/5rtwgh6bxlqbkvty):
> 
>> Check out the xmldb:encode-uri() and xmldb:decode-uri() functions --
>> see http://demo.exist-db.org/exist/functions/xmldb/encode-uri and
>> http://demo.exist-db.org/exist/functions/xmldb/decode-uri.
> 
> To speak to your case,
> 
>  xmldb:create-collection('/db', xmldb:encode-uri("Ö"))
> 
> returns:
> 
>  /db/%C3%96
> 
> This may look odd, but it is the URI-encoded version of Ö, and it
> appears correctly when browsing your collection hierarchy in oXygen's
> Database Explorer view, or in the eXist admin Browse Collections
> panel.  These applications know how to encode and decode URIs
> properly.  Indeed, since your application may be handed a character
> you don't expect, it's good practice to encode and decode your
> database URIs.  I'd encourage you to browse the source of the eXist
> admin Browse Collections to see encoding and decoding in action
> according to this best practice:
> 
> http://exist.svn.sourceforge.net/viewvc/exist/trunk/eXist/webapp/admin/browse.xqm?revision=14352&view=markup
> 
> You'll see the app encoding a URI when it is making a request to the
> database -- and decoding a URI when it is displaying a collection (or
> resource) name to the user.  Special characters might seem like more
> trouble than they're worth, but if you consistently encode and decode
> the URI you can handle anything your app is presented with.
> 
> Cheers,
> Joe

******************************
Kenneth R. Beesley, D.Phil.
P.O. Box 540475
North Salt Lake, UT
84054  USA

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
Andy Bunce | 1 May 23:55 2011
Picon

Re: problems with 1.4.1dev-rev14275-20110426

Using the SUN jvm fixed it.
Thanks /Andy

On Sun, May 1, 2011 at 9:31 PM, Dannes Wessels <dannes <at> exist-db.org> wrote:
Hi,

On 1 May 2011, at 22:22 , Dannes Wessels wrote:

anyway, looking at the trace... there are no exist-db classes here.  Did you get the war file from www.exist-db.nl/files ? Maybe I forgot to sign the jar files.... 

Please could you use the SUN jvm and retry?

I rechecked the war file I uploaded, and all jar files turn out to be signed correctly. I think the java version installed on ubuntu is not compatible (enough) with the Java5/6 version of Sun/Oracle.

I'd recommend to install Sun/Oracle Java6. A recent version of OpenJDK#7 might work as well, but we did not test this ourself.

thnx

Dannes


--
Dannes Wessels
eXist-db Open Source Native XML Database
e: dannes <at> exist-db.org
w: http://www.exist-db.org 







------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Exist-open mailing list
Exist-open <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/exist-open
Andrzej Jan Taramina | 2 May 09:16 2011

eXide feature request....

Wolfgang:

I noticed that you can edit xml files with eXide.  Very cool....makes it easy to modify xml-based
configuration files and/or data files on the fly for debugging or other purposes.

However, if you show a collection that has a lot of sub-collections, they seem to show in pretty
much random order.  It would be rather nice if the sub-collections could be sorted in ascending
alphabetic order in the file picker collection treeview.

Thanks!

--

-- 
Andrzej Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
Dannes Wessels | 2 May 09:37 2011

Re: problems with 1.4.1dev-rev14275-20110426

Hi

On Sun, May 1, 2011 at 11:55 PM, Andy Bunce <bunce.andy <at> gmail.com> wrote:
> Using the SUN jvm fixed it.
> Thanks /Andy

To learn form this issue, please could you post the ubuntu version and
the java version that was actually used?

thnx

Dannes

--

-- 
eXist-db Native XML Database - http://exist-db.org
Join us on linked-in: http://www.linkedin.com/groups?gid=35624

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd

Gmane