Bayley, Alistair | 1 Sep 11:16 2004

Marshalling Haskell String <-> UTF-8

I want to call a foreign C function that takes a UTF-8 encoded string as one
of its arguments (and there's also a version of the function that receives
UTF-16). Can someone point me to documentation or examples of how this would
be done? AFAICT (reading the FFI spec) marshalling a String to a CString is
locale-dependent, whereas I know that I want UTF-8/16.

Also, if a C function returns a UTF-8 (or UTF-16) encoded string, how do I
marshall this reliably into a Haskell String?

Can I use the UTF-16 functions directly with CWStrings? (I'm not sure
exactly what wchar_t is, as it's apparently dependent on the locale at
compile-time, and could be 8, 16, or 32 bits).

Thanks,
Alistair.

-----------------------------------------
*****************************************************************
Confidentiality Note: The information contained in this 
message, and any attachments, may contain confidential 
and/or privileged material. It is intended solely for the 
person(s) or entity to which it is addressed. Any review, 
retransmission, dissemination, or taking of any action in 
reliance upon this information by persons or entities other 
than the intended recipient(s) is prohibited. If you received
this in error, please contact the sender and delete the 
material from any computer.
*****************************************************************
Simon Marlow | 1 Sep 12:09 2004
Picon

RE: Marshalling Haskell String <-> UTF-8

On 01 September 2004 10:16, Bayley, Alistair wrote:

> I want to call a foreign C function that takes a UTF-8 encoded string
> as one of its arguments (and there's also a version of the function
> that receives UTF-16). Can someone point me to documentation or
> examples of how this would be done? AFAICT (reading the FFI spec)
> marshalling a String to a CString is locale-dependent, whereas I know
> that I want UTF-8/16. 
> 
> Also, if a C function returns a UTF-8 (or UTF-16) encoded string, how
> do I marshall this reliably into a Haskell String?
> 
> Can I use the UTF-16 functions directly with CWStrings? (I'm not sure
> exactly what wchar_t is, as it's apparently dependent on the locale at
> compile-time, and could be 8, 16, or 32 bits).

Your best bet is to marshal it yourself.  We're a bit behind in this
area: 6.2.x doesn't have CAString and CWString, and CString is just
char*.  The HEAD has CAString and CWString, and will hopefully follow
the FFI spec by the time we release 6.4 (we still have to do the locale
encoding/decoding between CString and String, IIRC).

In any case, none of this allows you to specify a UTF-8 conversion.

wchar_t varies from platform to platform: on Windows it is 16 bits, on
Linux with glibc it is 32 bits, for example.  CWString is only useful
for talking to C interfaces that are expressed in terms of wchar_t.

Cheers,
	Simon
(Continue reading)

Ross Paterson | 1 Sep 12:13 2004
Picon

Re: Marshalling Haskell String <-> UTF-8

On Wed, Sep 01, 2004 at 10:16:23AM +0100, Bayley, Alistair wrote:
> I want to call a foreign C function that takes a UTF-8 encoded string as one
> of its arguments (and there's also a version of the function that receives
> UTF-16). Can someone point me to documentation or examples of how this would
> be done? AFAICT (reading the FFI spec) marshalling a String to a CString is
> locale-dependent, whereas I know that I want UTF-8/16.

The locale-dependent marshalling of CString described by the FFI spec
isn't yet implemented in the library.  There is some code by John Meacham
including UTF-8 conversion at

	http://www.haskell.org/pipermail/ffi/2003-August/001355.html

> Can I use the UTF-16 functions directly with CWStrings? (I'm not sure
> exactly what wchar_t is, as it's apparently dependent on the locale at
> compile-time, and could be 8, 16, or 32 bits).

Under Windows, CWString uses the UTF-16 encoding.  On systems that define
__STDC_ISO_10646__ (e.g. glibc as used under Linux) it uses UTF-32.
(This is in the CVS version that will become 6.4, not the current release.)
David Roundy | 1 Sep 12:26 2004

Re: Marshalling Haskell String <-> UTF-8

On Wed, Sep 01, 2004 at 11:13:23AM +0100, Ross Paterson wrote:
> On Wed, Sep 01, 2004 at 10:16:23AM +0100, Bayley, Alistair wrote:
> > I want to call a foreign C function that takes a UTF-8 encoded string
> > as one of its arguments (and there's also a version of the function
> > that receives UTF-16). Can someone point me to documentation or
> > examples of how this would be done? AFAICT (reading the FFI spec)
> > marshalling a String to a CString is locale-dependent, whereas I know
> > that I want UTF-8/16.
> 
> The locale-dependent marshalling of CString described by the FFI spec
> isn't yet implemented in the library.  There is some code by John Meacham
> including UTF-8 conversion at
> 
> 	http://www.haskell.org/pipermail/ffi/2003-August/001355.html

You could also look at the darcs source code, as darcs uses UTF8 to store
file names.
--

-- 
David Roundy
http://www.abridgegame.org/darcs
George Russell | 1 Sep 12:42 2004
Picon
Picon

Re: Marshalling Haskell String <-> UTF-8

I have implemented code to do this which I think is better than
John Meacham's, because it (a) handles all UTF8 sequences
(up to 6 bytes); (b) checks for errors as UTF8 decoders are
supposed to do; (c) lets you determine if there is an error
without having to seq the entire list.  Here is a link:

    http://www.haskell.org//pipermail/glasgow-haskell-users/2004-April/006564.html
Bayley, Alistair | 1 Sep 15:51 2004

RE: Marshalling Haskell String <-> UTF-8

> From: George Russell [mailto:ger <at> informatik.uni-bremen.de]
> 
> http://www.haskell.org//pipermail/glasgow-haskell-users/2004-April/006
> 564.html

Thanks George, this looks useful.

There are some things I want to clarify...

module UTF8(
    toUTF8,
       -- :: String -> String
       -- Converts a String (whose characters must all have codes <2^31)
into
       -- its UTF8 representation.
    fromUTF8WE,
       -- :: Monad m => String -> m String
       -- Converts a UTF8 representation of a String back into the String,
       -- catching all possible format errors.

Does toUTF8 return a String whose Chars are all code-points < 256, which,
when converted to bytes, will represent a UTF-8 string?

Likewise, does fromUTF8WE expect a String whose Chars are all code-points <
256 i.e. they are the result of saying "chr n" for each byte in the UTF-8
stream?

> From: Simon Marlow [mailto:simonmar <at> microsoft.com]
> 
> In any case, none of this allows you to specify a UTF-8 conversion.
(Continue reading)

John Meacham | 2 Sep 01:39 2004
Picon

Re: Marshalling Haskell String <-> UTF-8

On Wed, Sep 01, 2004 at 11:13:23AM +0100, Ross Paterson wrote:
> On Wed, Sep 01, 2004 at 10:16:23AM +0100, Bayley, Alistair wrote:
> > I want to call a foreign C function that takes a UTF-8 encoded string as one
> > of its arguments (and there's also a version of the function that receives
> > UTF-16). Can someone point me to documentation or examples of how this would
> > be done? AFAICT (reading the FFI spec) marshalling a String to a CString is
> > locale-dependent, whereas I know that I want UTF-8/16.
> 
> The locale-dependent marshalling of CString described by the FFI spec
> isn't yet implemented in the library.  There is some code by John Meacham
> including UTF-8 conversion at

I should mention I have a new version of the CWString library in
development that conforms to the new FFI spec and works on all posixy
systems, not just those that have unicode wchar_t's like my first
posting.  

It is not quite ready for release, but if there is a strong need I can
package it up nicely. 
        John
--

-- 
John Meacham - ⑆repetae.net⑆john⑈ 
Ross Paterson | 2 Sep 12:36 2004
Picon

Re: Marshalling Haskell String <-> UTF-8

On Wed, Sep 01, 2004 at 04:39:30PM -0700, John Meacham wrote:
> I should mention I have a new version of the CWString library in
> development that conforms to the new FFI spec and works on all posixy
> systems, not just those that have unicode wchar_t's like my first
> posting.  
> 
> It is not quite ready for release, but if there is a strong need I can
> package it up nicely. 

The most useful packaging would be as a patch against the HEAD version
of fptools/libraries/base/Foreign/C/String.hs

There may also be a difficulty in that you may require hsc2hs but I
think Simon wants to keep it out of low-level modules (?).
Carsten Schultz | 2 Sep 15:37 2004
Picon

unfold deforestation

[If this should have gone to a different list, please tell me and I
 will subscribe to that.]

Hi!

There is no rule in standard libs related to unfoldr.  From short
googling I know that there may be several approaches to this, but as
long as nothing fancy is introduced, the following might be a simple
improvement:

unfoldr :: (b -> Maybe (a,b)) -> b -> [a]

unfoldr u x = build (g x)
    where
    g x c n = f x
	where
	f x = case u x of
		       Just (h,t) -> h `c` f t
		       Nothing -> n

{-# INLINE unfoldr #-}

It makes unfoldr a good producer and works nicely with the following
example from `The Under-Appreciated Unfold' (Gibbons&Jones):

===
module Tree(Tree(..), bftf) where

import Unfold

(Continue reading)

Simon Marlow | 2 Sep 17:39 2004
Picon

RE: Marshalling Haskell String <-> UTF-8

On 02 September 2004 11:36, Ross Paterson wrote:

> On Wed, Sep 01, 2004 at 04:39:30PM -0700, John Meacham wrote:
>> I should mention I have a new version of the CWString library in
>> development that conforms to the new FFI spec and works on all posixy
>> systems, not just those that have unicode wchar_t's like my first
>> posting. 
>> 
>> It is not quite ready for release, but if there is a strong need I
>> can package it up nicely.
> 
> The most useful packaging would be as a patch against the HEAD version
> of fptools/libraries/base/Foreign/C/String.hs
> 
> There may also be a difficulty in that you may require hsc2hs but I
> think Simon wants to keep it out of low-level modules (?).

Yes, using hsc2hs in libraries/base is problematic for bootstrapping
reasons - it adds inconvenient extra steps to the process of
bootstrapping GHC on a new platform, so please avoid it.  Outside of the
libraries required for building GHC, hsc2hs is fair game.

Cheers,
	Simon

Gmane