Re: XML.StreamWrapper error on UTF-8 "byte-order mark"
Steven Kelly <stevek <at> metacase.com>
2008-12-01 12:04:16 GMT
Thanks Holger, I'll take a look. While of course I agree with the idea
of putting the code in the right place, I find it's much easier to get a
change into the base VW if it's a bug fix to a single base method, than
if it's an extension in the public repository. Once it's in the base VW,
it's maintained and supported, and of course it's there immediately for
both new and experienced users: no need to hit the bug, wonder about it,
ask on the mailing list, wait for the creator of the fix in the public
repository to reply, load the latest version, and update all build
scripts to include it (and add a note to self to check that newer
versions of it are still good, and that it is updated when there is a
new VW version)
.
I added a smiley, but I have to admit that it's a slightly pained smile.
Eclipse users have to spend an average of 30 minutes a day just keeping
their IDE up to date, and I worry that VW is heading in the same
direction with all the contributed stuff. I really welcome the moves to
integrate the most used parts in the standard development image. For bug
fixes like this, I think it's even clearer that they should be in the
base.
Steve
> -----Original Message-----
> From: Holger Guhl [mailto:holger <at> heeg.de]
> Sent: 01 December 2008 12:30
> To: Steven Kelly
> Cc: VW NC
> Subject: Re: [vwnc] XML.StreamWrapper error on UTF-8 "byte-order mark"
>
> Please, have a look at our recent version of GHCsvImportExport(1.11)
in
> Cincom Public Repository. We made some extensions to PeekableStream to
> carefully peek for a BOM (byte order mark). Method #nextBOM peeks for
a
> byte order mark and leaves the stream pointer behind its ocurrence (if
> any). Method #getEncodingFromBOM translates the result into an
encoding
> symbol.
> Method Heeg.CsvReader>>onFileNamed: shows a possible application
> scenario.
> We did not yet extract the reusable PeekableStream stuff to another
> package. I' ld like to encourage you to go with this approach or adapt
> it to your needs. Having reusable code for Stream is better than
> inlining the stuff whenever you need. BTW: The PeekableStream
extension
> methods have some nice comments that explain some of the bits and
bytes
> "magic".
>
> Regards
>
> Holger Guhl
> --
> Senior Consultant * Certified Scrum Master * Holger.Guhl <at> heeg.de
> Tel: +49 231 9 75 99 21 * Fax: +49 231 9 75 99 20
> Georg Heeg eK Dortmund
> Handelsregister: Amtsgericht Dortmund A 12812
>
>
> Steven Kelly schrieb:
> > In 7.6, parsing a UTF-8 XML file starting with a BOM causes an error
> "<
> > expected, but not found". The code for
> XML.StreamWrapper>>checkEncoding
> > only takes account of a UTF-16 BOM (somewhat odd, given it checks
> first
> > that the encoding is UTF-8). Maybe I'm missing something here. For
my
> > file to read, the following worked. I couldn't resist changing the
> check
> > for the UTF-16 BOM (FEFF / FFFE) to be rather less cryptic than "c1
*
> c2
> > = 16rFD02" - to understand that you need to know that multiplication
> is
> > commutative, that FE * FF = FD02, and that no other pair of bytes
can
> > multiply to the same value.
> >
> > The last ifTrue: block could just be "stream position: pos+3" if we
> can
> > be certain that will put us in the right place and state, even for
> > funkily encoded multi-byte per character streams. That sounds
> > reasonable, given that we've just decided that this really is a
UTF-8
> > stream.
> >
> > Steve
> >
> > checkEncoding
> >
> > | encoding |
> > encoding := [stream encoding] on: Error do: [:ex | ex
> > returnWith: #null].
> > encoding = #'UTF-8'
> > ifTrue:
> > [| firstPair third pos |
> > pos := stream position.
> > stream setBinary: true.
> > firstPair := stream nextAvailable: 2.
> > third := stream peek.
> > stream setBinary: false.
> > (#([16rFE 16rFF] [16rFF 16rFE]) includes:
> > firstPair)
> > ifTrue: [stream encoder:
> > (UTF16StreamEncoder new
> >
> > forByte1: firstPair first byte2: firstPair last)]
> > ifFalse: [(firstPair = #[16rEF 16rBB]
> > and: [third = 16rBF])
> > ifTrue: [stream
> > setBinary: true; next; setBinary: false]
> > ifFalse:
> > [stream position: pos]]]
> >
> > _______________________________________________
> > vwnc mailing list
> > vwnc <at> cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/vwnc