Ben Lippmeier | 1 Oct 2009 03:39
Picon
Picon

patch applied (packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
  * Strip any Byte Order Mark (BOM) from the front of decoded streams.
  Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
  When decoding to UTF-32, Solaris iconv inserts a BOM at the front
  of the stream, but Linux iconv doesn't. 

    M ./GHC/IO/Handle/Internals.hs -6 +27

View patch online:
http://darcs.haskell.org/packages/base/_darcs/patches/20090930084229-43c66-fa2d613575d64ec62ad941dff46163a19354aa5d.gz
Simon Marlow | 1 Oct 2009 10:01
Picon

Re: patch applied (packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

On 01/10/2009 02:39, Ben Lippmeier wrote:
> Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
>    * Strip any Byte Order Mark (BOM) from the front of decoded streams.
>    Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
>    When decoding to UTF-32, Solaris iconv inserts a BOM at the front
>    of the stream, but Linux iconv doesn't.

Thanks for looking at this, but I think we should do it a different way. 
  It may be that Solaris thinks we want UTF-32 rather than UTF-32BE, 
which is why it is adding the BOM: try changing haskellChar in 
GHC.IO.Encoding.IConv.  It currently uses UCS-4(LE), but it should 
probably use UTF32{BE,LE}.

If that doesn't fix it, then I think we should apply any workarounds in 
GHC.IO.Encoding.Iconv, perhaps with a configure test to detect the 
erroneous behaviour.

Cheers,
	Simon
Ian Lynagh | 3 Oct 2009 13:50
Picon
Gravatar

patch applied (ghc-6.12/packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
  * Strip any Byte Order Mark (BOM) from the front of decoded streams.
  Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
  When decoding to UTF-32, Solaris iconv inserts a BOM at the front
  of the stream, but Linux iconv doesn't. 

    M ./GHC/IO/Handle/Internals.hs -6 +27

View patch online:
http://darcs.haskell.org/ghc-6.12/packages/base/_darcs/patches/20090930084229-43c66-fa2d613575d64ec62ad941dff46163a19354aa5d.gz
Duncan Coutts | 3 Oct 2009 14:50

Re: patch applied (ghc-6.12/packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

On Sat, 2009-10-03 at 04:50 -0700, Ian Lynagh wrote:
> Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
>   * Strip any Byte Order Mark (BOM) from the front of decoded streams.
>   Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
>   When decoding to UTF-32, Solaris iconv inserts a BOM at the front
>   of the stream, but Linux iconv doesn't. 
> 
>     M ./GHC/IO/Handle/Internals.hs -6 +27

I agree with Simon that this is not the correct fix.

As Simon suspected, Solaris iconv does indeed insert a BOM if you ask to
convert into UTF-32. Arguably this is actually the correct behaviour.
Also, as Simon suspected, it does not insert a BOM if you ask to convert
into UTF-32BE or LE.

So the better solution is to ask iconv for UTF-32BE or UTF-32LE
depending on the host byte order. This should also work correctly on
Linux so there doesn't need to be Solaris #ifdeffery (just host order
#ifdeffery which is needed anyway).

Demo: (Solaris iconv on big endian CPU)

echo foo | iconv -f UTF-8 -t UTF-32 | hexdump -c

0000000  \0  \0 376 377  \0  \0  \0   f  \0  \0  \0   o  \0  \0  \0   o
0000010  \0  \0  \0  \n

echo foo | iconv -f UTF-8 -t UTF-32BE | hexdump -c
0000000  \0  \0  \0   f  \0  \0  \0   o  \0  \0  \0   o  \0  \0  \0  \n
(Continue reading)

Duncan Coutts | 4 Oct 2009 18:24
Picon
Picon
Favicon

patch applied (bytestring): TAG 0.9.1.5

Sun Oct  4 09:11:45 PDT 2009  Duncan Coutts <duncan <at> haskell.org>
  tagged 0.9.1.5
  Ignore-this: ee0008fcfbd5a7245d6a9a131a951c46

View patch online:
http://darcs.haskell.org/bytestring/_darcs/patches/20091004161145-adfee-fd441218788e0f6dde0d8e94ebbf397edaa6c76f.gz
Simon Marlow | 5 Oct 2009 12:12
Picon

Re: patch applied (ghc-6.12/packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

On 03/10/2009 13:50, Duncan Coutts wrote:
> On Sat, 2009-10-03 at 04:50 -0700, Ian Lynagh wrote:
>> Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
>>    * Strip any Byte Order Mark (BOM) from the front of decoded streams.
>>    Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
>>    When decoding to UTF-32, Solaris iconv inserts a BOM at the front
>>    of the stream, but Linux iconv doesn't.
>>
>>      M ./GHC/IO/Handle/Internals.hs -6 +27
>
> I agree with Simon that this is not the correct fix.
>
> As Simon suspected, Solaris iconv does indeed insert a BOM if you ask to
> convert into UTF-32. Arguably this is actually the correct behaviour.
> Also, as Simon suspected, it does not insert a BOM if you ask to convert
> into UTF-32BE or LE.
>
> So the better solution is to ask iconv for UTF-32BE or UTF-32LE
> depending on the host byte order. This should also work correctly on
> Linux so there doesn't need to be Solaris #ifdeffery (just host order
> #ifdeffery which is needed anyway).
>
> Demo: (Solaris iconv on big endian CPU)
>
> echo foo | iconv -f UTF-8 -t UTF-32 | hexdump -c
>
> 0000000  \0  \0 376 377  \0  \0  \0   f  \0  \0  \0   o  \0  \0  \0   o
> 0000010  \0  \0  \0  \n
>
> echo foo | iconv -f UTF-8 -t UTF-32BE | hexdump -c
(Continue reading)

Simon Marlow | 5 Oct 2009 12:59
Picon

patch applied (packages/base): use UTF32BE/UTF32LE instead of UCS-4/UCS-4LE

Mon Oct  5 03:15:54 PDT 2009  Simon Marlow <marlowsd <at> gmail.com>
  * use UTF32BE/UTF32LE instead of UCS-4/UCS-4LE
  Ignore-this: 2aef5e9bec421e714953b7aa1bdfc1b3

    M ./GHC/IO/Encoding/Iconv.hs -2 +2

View patch online:
http://darcs.haskell.org/packages/base/_darcs/patches/20091005101554-12142-840856a601bc2ffec2df91e01e9e3d571348a50c.gz
Ian Lynagh | 5 Oct 2009 22:46
Picon
Gravatar

patch applied (packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
  UNDO: Strip any Byte Order Mark (BOM) from the front of decoded streams.
  Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
  When decoding to UTF-32, Solaris iconv inserts a BOM at the front
  of the stream, but Linux iconv doesn't. 

    M ./GHC/IO/Handle/Internals.hs -27 +6

View patch online:
http://darcs.haskell.org/packages/base/_darcs/patches/20090930084229-43c66-dcde1c417ef0ca19f4a0299f6cdea4772e079329.gz
Ian Lynagh | 5 Oct 2009 23:27
Picon
Gravatar

patch applied (ghc-6.12/packages/base): Strip any Byte Order Mark (BOM) from the front of decoded streams.

Wed Sep 30 01:42:29 PDT 2009  Ben.Lippmeier <at> anu.edu.au
  UNDO: Strip any Byte Order Mark (BOM) from the front of decoded streams.
  Ignore-this: d0d0c3ae87b31d71ef1627c8e1786445
  When decoding to UTF-32, Solaris iconv inserts a BOM at the front
  of the stream, but Linux iconv doesn't. 

    M ./GHC/IO/Handle/Internals.hs -27 +6

View patch online:
http://darcs.haskell.org/ghc-6.12/packages/base/_darcs/patches/20090930084229-43c66-dcde1c417ef0ca19f4a0299f6cdea4772e079329.gz
Ian Lynagh | 5 Oct 2009 23:27
Picon
Gravatar

patch applied (ghc-6.12/packages/base): use UTF32BE/UTF32LE instead of UCS-4/UCS-4LE

Mon Oct  5 03:15:54 PDT 2009  Simon Marlow <marlowsd <at> gmail.com>
  * use UTF32BE/UTF32LE instead of UCS-4/UCS-4LE
  Ignore-this: 2aef5e9bec421e714953b7aa1bdfc1b3

    M ./GHC/IO/Encoding/Iconv.hs -2 +2

View patch online:
http://darcs.haskell.org/ghc-6.12/packages/base/_darcs/patches/20091005101554-12142-840856a601bc2ffec2df91e01e9e3d571348a50c.gz

Gmane