Hi Stuart,
I just pulled your project from git, but to be honest I found it
rather hard to get everything in place to get it running. After
changing everything such that I can work with less ram (I'm on 32
bit on my home machine), I finally got it up and running, but the
experiments have issues. Here i see it starts the exist instance
again and again. As I'm a little limited in time right now, I
stopped getting this up as I assume it is due to some assumptions on
the system in the makefile. Therefore, I tried to focused on the
things I could comment on without actually testing it myself. If you
could provide a makefile that takes just the URI and password of an
exist-db instance without all this setup stuff I'd be happy to test
it, though. However, I tried to assemble a few suggestions (just
suggestions) based on my experience. I do not claim that they are
fit to solve all your problems.
xhtml with XSLT in the browser for display. I'm using
64 bit Java on a
64 bit Ubuntu and giving Java 6GB of RAM (of 8GB physical RAM
available).
That is huge. We work with below 2GB and sometimes dozens of
concurrent users and rather complex queries (filtering based on
structural and indexed criteria, intersection of result-sets etc.)
and that is sufficient. I don't know if you hit the memory-limit,
but I doubt it. One difference may be that we put our results in a
session variable on the server and serialize only the portion that
fit's in the browser. There you can navigate from result-page to
result-page. Some queries would yield a million hits that would
obviously take some time and memory to serialize. In our experience,
this is an area where significant performance improvements may be
gained.
I have several GB of XML I'd like to put into He Kupu
Tawhito, but I'm
having difficulty scaling above a couple of dozen MB of XML.
Others in
the TEI community have reported similar issues, informally.
I would assume that you should certainly not be limited to a couple
of dozen MB. There has to be something wrong, then.
with Jens Østergaard Petersen convinced me that it may
be worth having
another go at scaling this, and thus this message.
I remember your harsh statement about the "failure to scale up" on
TEI-L (without any hedging, indicating that your queries might have
a share in this). Mhh. Let's see, I guess it is still possible to
improve them, though.
The best example I can give is this graph which shows
query time vs
the size of the contents, both with and without collection.xconf.
https://spreadsheets.google.com/ccc?key=0AtkIjlDqC2H4dFBQM0V4SjVTdTdlQnFfWWpDaks5d0E&hl=en_GB&authkey=CJDPuY8F
As far as I have seen from your makefile, this is the overall time
of the request to the query. While in practice this is what actually
matters, it is interesting to ask what the reason might be. As I was
not able to quickly test it with the makefile, I assume from the
code that the resulting page is of moderate size (it only fetches
the first 15 whatever-it-might-be, right?)? If the time is spent
selecting the elements from the database it is probably an indicator
that indexes are either not used or the query optimizer was not able
to optimize your query.
The 50 chapter mark represents 5.9 MB of TEI/XML in a single file.
That's small. The performance you see is exceptionally bad.
(1) In my makefile I'm loading my files using:
$(EXIST_HOME)/bin/client.sh
uri=xmldb:exist://localhost:8081/exist/xmlrpc -m
/db/system/config/db/he_kupu_tawhito/ -p collection.xconf --no-gui
$(EXIST_HOME)/bin/client.sh
uri=xmldb:exist://localhost:8081/exist/xmlrpc -m
/db/he_kupu_tawhito/
-p korero/www.biblegateway.com-sampler/import.words.xml --no-gui
Will that result in that collection.xconf applying to
import.words.xml ?
the procedure looks fine to me.
(2) My collection.xconf and main xQuery are at:
https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/collection.xconf
I think you could use the qname index instead of path index for all
the indexes you defined. The query optimizer will do a much better
job, then.
https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql
It could be worthwhile using a fulltext index for some cases (e.g.
multiple values in <at> corresp). Maybe an appropriate fulltext index
with the whitespace-analyzer, as I don't know what the standard
analyzer would do with the "#", and ft:contains could help
especially for the cases where <at> corresp holds several values. But,
from your query I would assume that a qname or ngram index should be
sufficient.
<create
qname="lemma" type="xs:string"/>
you don't use this anywhere, right?
<create
qname=" <at> xml:id" type="xs:string"/>
<create
qname=" <at> xml:lang" type="xs:string"/>
<create
qname=" <at> lemma" type="xs:string"/>
you should really make these qname (not path) indexes. I changed it
accordingly. An index for <at> corresp you make use of in your query is
missing. It could be useful to define a qname and an ngram or
fulltext index here, depending on the kind of queries you want to
perform.
(3) I've tinkered a reasonable amount with my xquery,
but I won't
profess to being an expert:
https://github.com/stuartyeates/He-Kupu-Tawhito/blob/master/kupu.xql
Is there anything obvious I'm doing wrong?
I found that it sometimes helps to break multiple predicates in
several let expressions.
let
$words := $this//w[ <at> lemma=$kupu][ <at> xml:lang=$reo]/ <at> xml:id
let $words := $this//w[ <at> lemma eq $kupu]
let $words := $words[ <at> xml:lang eq $reo]
try to do the most restrictive predicate first. I think it is a good
habit to use eq when not intending to deal with sequences ($kupu and
$reo are just string values, right?). Shouldn't affect performance,
here, though.
let
$thisid := $this/ <at> xml:id
let
$thishash := concat('#', $thisid)
you don't use these anywhere later. Why define them then?
let
$others :=
//p[contains($this/ <at> corresp, <at> xml:id)][(concat('#', <at> xml:id)=$this/ <at> corresp)
or (concat('#',$this/ <at> xml:id)= <at> corresp)] |
//p[contains( <at> corresp,$this/ <at> xml:id)][(concat('#', <at> xml:id)=$this/ <at> corresp)
or (concat('#',$this/ <at> xml:id)= <at> corresp)]
I would suspect that this line is causing much trouble! You can
split it as the one above and then:
- contains($this/ <at> corresp, <at> xml:id).
this does not use indexes! maybe something like " <at> xml:id =
tokenize(this/ <at> corresp, ' ')" (note that I do not use "eq" here) or
do you use fn:contains because of a prepended "#"? Then maybe
something along the lines of " <at> xml:id = (for $i in
tokenize(this/ <at> corresp, ' ') return substring-before($i,'#'))". I
assumed that you do this for 15 $this and do not have a huge number
of <at> corresp values. You can make use of the index on <at> xml:id, here.
- concat('#', <at> xml:id)=$this/ <at> corresp
this does not use indexes either. If you have indexes defined for
<at> xml:id it would not use them because of the concat.
The way you defined it I would suspect that exist had, in worst
case, to do a concat on all tei:p elements you have. (linear
scaling, then). Try maybe " <at> xml:id eq
substring-after($this/ <at> corresp,'#')" for useing the index on <at> xml:id
instead.
- concat('#',$this/ <at> xml:id)= <at> corresp
similar here. Here it is the missing index on <at> corresp, though.
The second part of the union has the same issues. The second
predicate is exactly the same, for the first just swap <at> xml:id and
<at> corresp in my first explanation.
sing an index to to pull a <p/> out of 500 MB
single file slower
than pulling the same <p/> out of a 50 KB file sitting in a
collection
of 10K files?
I would not expect the big file to be slower.
(5) TEI uses the standard xml:id and xml:lang tags and
I make
extensive use of both of these. I run xmllint over my input files
to
check for duplicate xml:ids. Does eXist have any special support
for
these? are there any common traps?
You can define indexes just as on any other attribute. I don't know
what you mean with special support, though.
(6) Currently I don't take any steps to update the
indexes. Does eXist
build the indexes listed in the collection.xconf as documents are
loaded?
Yes, eXist updates the indexes when you store documents. Hence, this
shouldn't be a problem. I think the issue is your query.
(7) Are there other common traps and pitfalls that I
should be
checking? I've read http://exist.sourceforge.net/tuning.html
read
http://exist.sourceforge.net/indexing.html
I think you will find some very important things there. Especially
on the kinds of indexes certain functions (e.g. fn:contains) benefit
from and why qname indexes are preferable.
I hope i figured correctly what you are trying to do. If I am wrong
with some of my assumptions or solutions, please feel free to
correct me. I hope you find something useful in this response.
cheers,
Stefan