Mandell, Laura C. Dr. | 1 Jul 2010 01:57

Re: Relative databases vs. XML technologies

Dear TEI-List:

Thank you all for this great info.

Best, Laura

On 6/30/10 11:50 PM, "Martin Holmes" <mholmes <at> uvic.ca> wrote:

> I would agree with this: if your data looks like records in a database,
> then a database is the right tool for the job. eXist is unlikely ever to
> compete with db speeds for this kind of operation, although it can come
> pretty close if you tune the indexes carefully enough.
> 
> The strength of an XML db is that it can handle arbitrarily complex tree
> structures with massive levels of nesting, and let you navigate around
> these sorts of structures in your query. You can frame queries such as:
> 
> Give me every <l> (line) element which is the third in its stanza, whose
> stanza is nested inside another stanza which is the last of its
> siblings, where the line contains text in Latin, where the containing
> poem was published after 1753... etc.
> 
> In other words, if structure and hierarchy are intrinsic parts of your
> data, XML databases are very useful indeed; if structure is an arbitrary
> organizing principle which is not inherently interesting, and especially
> if it's simple, then relational dbs are a better option.
> 
> Cheers,
> Martin
> 
(Continue reading)

Brett Zamir | 1 Jul 2010 04:55
Picon
Favicon

Re: Relative databases vs. XML technologies

The big question here is whether this technology can be brought to a level at which non-programmers can learn it in a reasonable time. A "reasonable time" is more than five minutes, but it's almost closer to learning how to ride a bicycle than how to play the violin.

I believe so.  In the process of arguing in favor of XML databases for use in client-side HTML storage*, I've also proposed allowing a jQuery-like syntax against client-side databases, jQuery being a hugely popular and easy-to-learn JavaScript library, since CSS Selectors (which jQuery uses) should, I believe, be fully convertible into XPath (as John Resig has done in the other direction for simple XPath), which is a subset of the yet more powerful XQuery. Even if such an interface is dumbing things down compared to XQuery or even XPath, it may be more comfortable to get people used to it (and jQuery has its own XQuery-like functions for easily iterating nodes, etc. anyways).

Client-side usage might look like this (where the client-side database "Classics" had been created earlier by some other web API function call):

pre.navy_sh_sourceCode { background-color: #000035; color: lightblue; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_keyword { color: darkorange; font-weight: bold; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_type { color: #e1e72f; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_string { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_regexp { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_specialchar { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_comment { color: #ffbb00; font-weight: normal; font-style: italic; } pre.navy_sh_sourceCode .navy_sh_number { color: #f87ff4; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_preproc { color: #bb00ff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_symbol { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_function { color: yellow; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_cbracket { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_url { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_date { color: #f8c50b; font-weight: bold; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_time { color: #f8c50b; font-weight: bold; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_file { color: #f8c50b; font-weight: bold; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_ip { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_name { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_variable { color: #13d8ef; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_oldfile { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_newfile { color: #ffffff; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_difflines { color: #f8c50b; font-weight: bold; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_selector { color: #13d8ef; font-weight: normal; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_property { color: #f8c50b; font-weight: bold; font-style: normal; } pre.navy_sh_sourceCode .navy_sh_value { color: #ffffff; font-weight: normal; font-style: normal; } $('db("Classics", "Classic Literature Database") '+ 'collection("ShakespeareWorks") .irony'.each(function (ironicPassage) { $('#TheBardBeingIronic').append(ironicPassage); });

The above finds every ironic passage of Shakespeare in the locally-stored collection and adds the passages to an HTML element.

Even querying remote XML/HTML stores should be possible for client-side HTML too (as it already available to server-side languages), especially if websites were not forced to obtain explicit permissions from the remote site but could instead make cross-domain requests (via HTML Ajax), conditional on user permission: https://bugzilla.mozilla.org/show_bug.cgi?id=573886 . This would allow any website to be treated as a data store (including TEI ones) with the burden shifted away from the server to the user who could query to their heart's content without slowing down an intermediate server (but still allowing that third-party the ability to make a user interface available to them in HTML). But because there are security issues when incorporating content from remote sites, it would need to require permission by the user.

For example, the remote querying might look as simple as this:
var collec = {someShakespeareWorks: ['http://example.com/Hamlet.html', 'http://example.com/Macbeth.html']}; // Reusable collection variable $('collection("someShakespeareWorks") .irony', function (ironicPassages) { $('#TheBardBeingIronic').append(ironicPassage); }, collec);

As above, this finds every ironic passage of Shakespeare in the specified works and adds the passages to an HTML element, but in this case, the files have not been created locally, but are available live from remote. (The best of both worlds could be possible if the client-side storage could be made to check periodically for updates.)

If you XML fans want to see this kind of functionality available to browsers (which could either use jQuery or ideally XQuery itself), so that any web designer could make an interface which would work against your TEI, whether stored on a remote server, or designed to be installable via the web into a web-accessible client-side database, voice your support in the HTML5 WHATWG email list  (http://www.whatwg.org/mailing-list )!

Brett

* In the discussion thread about IndexedDB, the proposed standard for allowing client-side database storage inside HTML5 at http://hacks.mozilla.org/2010/06/comparing-indexeddb-and-webdatabase/comment-page-1/#comment-96635

Brett Zamir | 1 Jul 2010 05:09
Picon
Favicon

Re: Relative databases vs. XML technologies

> If time and computing resources is not your every day problem, then 
> put your TEI corpora in an XML native database (Exist, Berkeley DB XML 
> <http://www.oracle.com/technology/products/berkeley-db/xml/index.html> 
> ...) may help a lot if XQuery language can become for you a second 
> nature. If your tagging allow it, you can have fastly answers like : 
> How much sentences are not in quotes? Who said "I love you" in this 
> drama? I remember an idea in a note but where? But, it's not a good 
> idea to open that on the web. A classroom of 15 students can freeze an 
> XML database on a decent server (depending on queries).
Unless as, per my other post just now, it is client-side storage 
accessible to websites which can handle both privacy and server load 
concerns while still allowing web applications on the web to install and 
access the data, according to user preference.

This would work like cookies (which are also stored locally) but allow a 
much richer and larger-storage-capable client-side database features, 
whether that would be IndexedDB as currently proposed for HTML5, or, my 
preference, a native XML XQuery-supporting database. I have proposed the 
latter (and am currently working on a Firefox add-on to hopefully 
demonstrate the concept and make it usable until such time as it could 
hopefully become standardized). While eXist or BDBXML would be great for 
those willing to make a fairly big download, I've recently discovered a 
very small XML database, Sedna, which I think could be small enough to 
include with an extension or possibly as part of a browser like Firefox 
itself, though ideally the HTML database API would be generic (maybe 
using XQJ, the Java API for XQuery?) to work with any database the user 
installed.

best wishes,
Brett

Brett Zamir | 1 Jul 2010 05:53
Picon
Favicon

Re: Relative databases vs. XML technologies

On 7/1/2010 6:50 AM, Martin Holmes wrote:
> I would agree with this: if your data looks like records in a 
> database, then a database is the right tool for the job. eXist is 
> unlikely ever to compete with db speeds for this kind of operation, 
> although it can come pretty close if you tune the indexes carefully 
> enough.

I would think that this is merely an implementation issue, rather than 
an inherent problem that XML databases could not overcome.  If an XML 
file is known to have a strictly tabular structure (with no processing 
instructions, comments, etc., at least of relevance to a deliberately 
"relational" query for which a separate table structure could be created 
internally), then whatever storage principles are being applied in 
relational databases can be applied to the XML data.   I see absolutely 
no reason, at least theoretical, why there will be any need for 
relational databases as separate from XML databases, unless there is 
some benefit to an XML database allowing for 
"relational"-aware/optimizing queries inside XQuery (as extensions to 
XQuery allow in being able to make SQL inside XQuery). That is, if the 
database can't automatically be made to pre-optimize on its own, as I 
would think should be possible.

Since XML in concept can mimic a relational database perfectly (while 
the converse is not true), an XML database could even outsource this 
work to a relational database component, I would imagine. I'm not any 
expert here, I'll admit, but I just don't see any reason it couldn't work.

The advantage of focusing on XML databases I think is that it provides a 
common query mechanism (XQuery) which can work with either hierarchical 
or tabular data. In particular, I hope the web will not be deprived of 
this single means of querying.

Brett

Richard Light | 1 Jul 2010 10:11
Picon
Picon

Re: Relative databases vs. XML technologies

In message <0C40532C-3E9C-45E9-A024-05EA895BF749 <at> oucs.ox.ac.uk>, 
Sebastian Rahtz <sebastian.rahtz <at> OUCS.OX.AC.UK> writes
>For a project here, we recently switched from a setup based on eXist 
>and XQuery to one using an SQL database; the speed of operation and the 
>speed of development rocketed overnight :-} This worked because our TEI 
>file consisted of 250,000 "records" (TEI <person>), which we stored 
>untouched in one column of the table, and added as many columns as we 
>needed to index the data. Then we used XSLT to format the <person> 
>records which came back from a query. It's not a new technique.
>
>Of course, this is not a traditional use of TEI, but it demonstrates a) 
>that there are applications which are TEI but look more like a 
>database, and b) you can combine XML tools with SQL databases.

Just to mention another hybrid approach: we also use a relational 
database engine and store XML fragments as BLOBs.  We then add an 
indexing plugin to the database.  This allows us to specify multiple 
indexes which use XPath expressions to index XML content within the 
BLOBs. However, as far as the database engine is concerned, these are 
"normal" indexes.

We use parent and child processing instructions to indicate the position 
of each XML fragment within the original document. This allows the 
re-creation of this document as part of a report generation "pipe".

This approach gives us the benefit of holding our TEI as a shared 
updateable resource (with the usual relational record-locking and 
transaction support, and real-time indexing of content).

On the retrieval front, this approach doesn't help much with external 
querying, since SQL doesn't deal in indexes, but we have built custom 
search mechanisms which use the XPath indexes directly, and are happy 
enough with that.  Nor does it give you the ability to put ad hoc XPath 
queries to the entire document as a native XML database would.

One advantage of this approach is that it will support any type of TEI 
document, not just "record-like" ones.  The one requirement is that you 
have to decide on a "chunking" policy, and assign a unique identifier to 
each chunk/record.

Richard
--

-- 
Richard Light

Sebastian Rahtz | 1 Jul 2010 11:03
Picon
Picon
Favicon

Re: Relative databases vs. XML technologies

I suspect we all agree that managing the source data as TEI XML is the important thing. From there we can choose
to use XSLT, Postgres, eXist or Jena triplestore as meets our needs today, pretty confident that we can switch
to something else next year if it seems appropriate.
--
Sebastian Rahtz      
(acting) Information and Support Group Manager
Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente

Graham, Wayne (wsg4w | 1 Jul 2010 16:02

Re: Relative databases vs. XML technologies

I'll throw one more in here. We've been using Solr a lot for the discovery
layer, splitting XML in to different types of Solr documents as needed, then
using client XML/XSLT libraries to provide more granular search results on a
per-page basis as needed. You can see an example of the technique at
http://raven.scholarslab.org/; If you're really interested, you can check
out some code we have at the githubs: http://github.com/mwmitchell/raven

HTH,
Wayne

On 7/1/10 4:11 AM, "Richard Light" <richard <at> LIGHT.DEMON.CO.UK> wrote:

> In message <0C40532C-3E9C-45E9-A024-05EA895BF749 <at> oucs.ox.ac.uk>,
> Sebastian Rahtz <sebastian.rahtz <at> OUCS.OX.AC.UK> writes
>> For a project here, we recently switched from a setup based on eXist
>> and XQuery to one using an SQL database; the speed of operation and the
>> speed of development rocketed overnight :-} This worked because our TEI
>> file consisted of 250,000 "records" (TEI <person>), which we stored
>> untouched in one column of the table, and added as many columns as we
>> needed to index the data. Then we used XSLT to format the <person>
>> records which came back from a query. It's not a new technique.
>> 
>> Of course, this is not a traditional use of TEI, but it demonstrates a)
>> that there are applications which are TEI but look more like a
>> database, and b) you can combine XML tools with SQL databases.
> 
> Just to mention another hybrid approach: we also use a relational
> database engine and store XML fragments as BLOBs.  We then add an
> indexing plugin to the database.  This allows us to specify multiple
> indexes which use XPath expressions to index XML content within the
> BLOBs. However, as far as the database engine is concerned, these are
> "normal" indexes.
> 
> We use parent and child processing instructions to indicate the position
> of each XML fragment within the original document. This allows the
> re-creation of this document as part of a report generation "pipe".
> 
> This approach gives us the benefit of holding our TEI as a shared
> updateable resource (with the usual relational record-locking and
> transaction support, and real-time indexing of content).
> 
> On the retrieval front, this approach doesn't help much with external
> querying, since SQL doesn't deal in indexes, but we have built custom
> search mechanisms which use the XPath indexes directly, and are happy
> enough with that.  Nor does it give you the ability to put ad hoc XPath
> queries to the entire document as a native XML database would.
> 
> One advantage of this approach is that it will support any type of TEI
> document, not just "record-like" ones.  The one requirement is that you
> have to decide on a "chunking" policy, and assign a unique identifier to
> each chunk/record.
> 
> Richard

Stuart Yeates | 1 Jul 2010 20:58
Picon
Picon
Favicon

Re: Relative databases vs. XML technologies

> I'll throw one more in here. We've been using Solr a lot for the discovery
> layer, splitting XML in to different types of Solr documents as needed, then
> using client XML/XSLT libraries to provide more granular search results on a
> per-page basis as needed. You can see an example of the technique at
> http://raven.scholarslab.org/; If you're really interested, you can check
> out some code we have at the githubs: http://github.com/mwmitchell/raven

We do something completely completely different with solr: http://www.nzetc.org/tm/scholarly/facets/search

See also http://www.nzetc.org/tm/scholarly/books.rss which is an alias to our complete list of works,
as an RSS feed of downloadable ePubs, through solr

cheers
stuart

James Cummings | 1 Jul 2010 23:25
Picon
Picon
Favicon

Re: Relative databases vs. XML technologies

On 30/06/10 18:05, Martin Mueller wrote:
> Joseph Wicentowski at the State Department's Office of the Historian
> and the folks at syntactica.com are doing some very interesting work
> with a publishing solution that uses eXist and xquery to do just
> about everything. I'm marginally and not very competently involved in
> this enterprise, but conceptually it looks very promising, and the
> web site of the Office of the Historian is certainly a site of some
> scale.
>
> The big question here is whether this technology can be brought to a
> level at which non-programmers can learn it in a reasonable time. A
> "reasonable time" is more than five minutes, but it's almost closer
> to learning how to ride a bicycle than how to play the violin.

Just to comment on this. Joe is giving a 2.5 workshop as part of our TEI 
Summer School introducing this kind of thing to people.  I've taught 
people XQuery and basic eXist enough to have them indexing and doing 
several types of query in 3/4s of a day. Obviously though it takes much 
longer to feel familiar implementing it for real.  In a poorly created 
analogy, I can teach anyone to juggle in 45 minutes, but it will take at 
least a couple weeks of regular practice before it feels normal.

I use eXist/XQuery for several websites and proper indexing (with 
lucene full-text) is certainly something that makes a substantial 
difference to retrieval times. The xquery url rewriting in recent 
versions is also quite interesting.

-James

Sebastian Rahtz | 8 Jul 2010 00:07
Picon
Picon
Favicon

TEI P5 release 1.7.0

The new release of the TEI Guidelines is now in all the usual places:

   * as file releases on Sourceforge at https://sourceforge.net/projects/tei/files/
   * live at http://www.tei-c.org/Vault/P5/current/ and points south (not quite yet
       at http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html, but soon)
   * as Debian packags at http://tei.oucs.ox.ac.uk/teideb/

the release notes below cover most of the changes, but are not
all-encompassing

***

This release introduces significant additional features to the way
in which the ODD system for TEI customization may be expressed. The
new features introduced allow a customization to be expressed by
inclusion (specifying only the elements it requires) rather than by
exclusion (specifying the elements which it does not require). They
also permit specification of a particular version of the Guidelines as
source for a schema. 

Support for this entailed a number of changes, including definition
of new attributes for <moduleRef>, and new elements
<elementRef>, <macroref> and <classRef> as well
as revision to the prose of the relevant Guidelines
chapters. Some clarification of the way the odd-to-odd transformation
process works, for example when generating pattern prefixes for
RelaxNG, was also necessary. The new features greatly simplify the
process of generating user specific customisations, while retaining
all the existing behaviours. In addition, the test suite has been
revised and extended to check that the new facilities worked as
intended. Expanded tutorial material for the revised system will be
the subject of a Workshop to be taught at the 2010 General Meeting.

As usual, several minor clarifications and corrections were made to
the wording of the Guidelines in response to Sourceforge tickets (e.g.
3025031 3025032 3025017 3010481 2989088 2982439 2942469 2965680
2981703 2982056) and discussion on Council and TEI-L mailing
lists. 

The work of the AFNOR group providing and correcting French
translations and examples also continued; in particular during May and June
substantial work on a new suite of French language examples was undertaken.

Other specific changes are listed below in reverse date order:

 *  2010-07-02 : add constraint to prevent <relatedItem> supplying both a  <at> target and some content (2728061)

 * 2010-05-08 : add new source attribute and revise datatype of
existing version attributes for consistency  

 * 2010-05-06 : add new att.docStatus attribute class (2812634)

 * 2010-05-06 : add <material> to att.canonical class
(2811234)

 * 2010-05-01 : rationalise and make consistent datatype of the
 <at> target attribute (2531384)

 * 2010-05-01 : revised class memberships for consistency amongst
elements which can appear within choice (2834505)

 * 2010-04-30 : new attribute  <at> points  added to att.coordinated to enable definition of non-rectangular
zones in facsimile (2971316) 

 * 2010-04-30 : added idno to model.nameLike, thus permitting use of standard identifiers of various kinds
in person, place etc. (2949985) 

 *  2010-04-30 : add <ref> wherever <title> is permitted in highlevel components of <biblStruct> (2976608)
 *  2010-04-06 : make  <at> url mandatory and scaling attributes optional for all graphic elements (2965521)

--
Sebastian Rahtz      
Information Manager, Oxford University Computing Services
13 Banbury Road, Oxford OX2 6NN. Phone +44 1865 283431

Sólo le pido a Dios
que el futuro no me sea indiferente


Gmane