Re: Re:Re:Xquery and ISO-8859-1
Michael Beddow <mbexlist-2 <at> mbeddow.net>
2005-01-03 11:40:23 GMT
> So, I tried requests (without and with accentuated
> characters) form the command-line client, and it
> works perfectly :
That's good news. It confirms that everything is working fine at the core as
far as iso-8859-1 encoded data is concerned. Your problem is arising
somewhere on the periphery and should be relatively easy to isolate and
solve.
> I also tried to encode all my data in UTF-8 as suggested
> Michael, replacing accentuated characters by code (for
> example "é" in plae of "é"). But I got error
> messages when I tried to import the data in the eXist
> database.
Ah, there's a misunderstanding here, with, I think, two components
1) You can't use character entity references (CER) like é in XML
documents, unless your document has a DTD (or a fragmentary internal subset
of such a DTD if you have no "real" DTD or are using a schema for
validation) in which those entities are declared and defined so that the
parser knows how to resolve them (which the parser in eXist won't do anyway
unless it is explicitly told to validate against a DTD). Without such
declarations in a DTD, a conformant XML parser can resolve only & <
> " and '. People who use XHTML or the TEI-LITE DTD sometimes
don't realise this, because in both cases the relevant DTD defines a wide
range of character entity references. So, for what you are trying to do, you
would need to use a numeric character reference (NCR), in this case é or
é These look like entity references, in that they begin with an
ampersand and end with a semi-colon, but they are not entity references and
shouldn't be called such. An XML parser handles NCRs at a much lower level
than entity references (character or otherwise). As soon as an NCR appears
in the parser's input stream, it is translated from a string to a binary
value corresponding to the number represented in ASCII between the & and the
; in the body of the NCR. The parser proper never sees the NCR at all, and
so doesn't report its presence or its processing. It just disappears,
leaving the binary value behind. Whereas the presence of a CER triggers the
parser's entity resolution mechanism, which may involve a callback to the
application the parser is servicing and so be visible to that application if
required.
2) Use of the NCR é to represent an e acute hasn't anything to do with
using utf-8. We have to distinguish between the code-point assigned to a
character in a character set and the internal representation of that
code-point. The character Unicode names as LATIN SMALL LETTER E WITH ACUTE
happens to have the same assigned code-point in both ISO-8859-1 and in
Unicode, namely hex E9. And what the NCR é tells the parser is "I want
to insert the character whose Unicode code-point is U00E9 into the text
stream at this point". How that character is internally represented is a
different matter, and needn't concern us unless things go wrong and we have
to pick over the wreckage, but in an iso-8859-1 encoded document it is
represented as a single byte with value hex E9, whereas in a utf-8 encoded
document the same code-point is represented internally as a two byte
sequence, hex C3 A9. Properly-configured software should always be able to
hide this internal representation from us, but things don't always work out
that way.
So to follow my initial advice and convert your data to utf-8 you would need
to run all your iso-8859-1 documents through a transcoder. The one most
people rely in is iconv, which is in all Linux and most Unix distributions,
and for Windows can be obtained from
http://gnuwin32.sourceforge.net/packages/libiconv.htm
> My request is written in a text editor (JEdit)
That in itself doesn't tell us what internal encoding the editor is using. I
am (by choice) very ignorant about the interaction of Java and Windows, but
I wouldn't be surprised if the underlying JVM didn't default to the system
locale, which in your case would be iso-8859-1, as its internal
representation for character data. That should mean that when you press your
key for eacute when composing your query, the iso-8859-1 internal
representation of that character goes into the data buffer and gets saved in
the file. And if the data you're querying is encoded in iso-8859-1, then
that's what you want to happen. Are you providing your XQuery with an
encoding declaration, though? Your example doesn't have one, but if it is
iso-8859-1 encoded it needs one. [Q to Wolfgang: I take it the eXist XQuery
parser recognises and handles encoding declarations ?]
This matter can be a bit confusing. Although XQuery documents are
emphatically not XML documents and so can't and don't have an XML
declaration, then can have an encoding declaration, and indeed must have one
if their encoding is not utf-8. See http://www.w3.org/TR/xquery sections
===========
H5: XQuery documents use the Unicode character set and, by default, the
UTF-8 encoding.
===========
and
===========
H3 An XQuery document may contain an encoding declaration as part of its
version declaration :
xquery version "1.0" encoding "utf-8";
===========
There is a hidden dependency between those two statements, which is hidden
all the more by the order in which they appear in the WD.
SO ... if you still have your collection in eXist encoded in iso-8859-1 and
correctly declared as such (which seems to be the case, because the XPath
test with the cl client succeeds), I suggest you try heading up your
XQueries with
xquery version "1.0" encoding "iso-8859-1";
and then submitting them, again via the command line client, but this time
using its -F argument (NB upper case in that switch) to pass in the name of
the file which contains your Xquery. If your editor is indeed encoding the
query using iso-8859-1 and if eXist correctly supports encoding
declarations on XQueries, this should then work. If if doesn't we will have
to do some more head scratching.
> and I run my xql file from an html form in Internet explorer.
I'm not clear what "run from" exactly means here and how it relates to the
creation of the query using Jedit, but encoding issues with html form data
add a further layer of possible errors, which I'd prefer to leave aside at
the moment, which is why I suggest delivering the XQuery via the
command-line
client's -F parameter.
Michael Beddow
, and I run my
xql file from an html form in Internet explorer. My
computer is running under windows XP.
Rémy
-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Exist-open mailing list
Exist-open <at> lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/exist-open
-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt