Robert Bradshaw | 1 Dec 04:09 2009

Re: [Cython] Another string encoding idea

Just to clarify discussion, here is what I'm proposing (which is still  
in flux, and simplified due to memory issues, which does make it less  
attractive as one does not get to choose the used encoding, but it  
would always be UTF-8 in Py3).

Without directive(s) (as it is now):

    char* <-> bytes

With the directive(s) (which can be applied locally or globally):

     char* <-> str
     unicode/bytes -> char* would also work (for Py2/Py3 respectively)

The encoding used would be the system default (in Py2) and UTF-8 (in  
Py3). This would use the defenc slot so the encoded char* would be  
valid as long as the unicode object is around, and the long term  
future of the defenc slot needs to be ensured before this could be  
used for non-arguments conversion.

Also out there is the idea of a directive that would make char* become  
unicode in both Py2 and Py3.

On Nov 29, 2009, at 8:47 AM, Stefan Behnel wrote:

> Robert Bradshaw, 28.11.2009 22:12:
>> My personal concern is the pain I see porting Sage to Py3. I'd have  
>> to
>> go through the codebase and throw in encodes() and decodes() and
>> change signatures of functions that take char* arguments
(Continue reading)

Robert Bradshaw | 1 Dec 04:23 2009

Re: [Cython] Another string encoding idea

On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote:

>> Robert Bradshaw wrote:
>> this is the kind
>>> of thing that usually tells me there's a deficiency in the language
>>> that should be fixed to ease the users burden instead.
>
> sure -- but the deficiency is in C (and py2), and that's not something
> we can fix. As for the Cython language, it should really follow  
> Python:
> unicode for "text", bytes for arbitrary data.
>
> But we need to deal with C (and fortran) no matter how you slice it.
>
> I wrote a similar post on the numpy list: I think the key from a  
> user's
> perspective is that one is either working with "text": human readable
> stuff, or data. If text, then the natural python3 data type is a  
> unicode
> string. If data, then bytes -- we should really follow that as best  
> we can.

Exactly.

unicode = char* + length + encoding
bytes = char* + length

So what is the Python equivalent of char*? Neither, and what you want  
depends on the application and context.

(Continue reading)

Lisandro Dalcin | 1 Dec 04:26 2009
Picon

Re: [Cython] Another string encoding idea

On Tue, Dec 1, 2009 at 12:09 AM, Robert Bradshaw
<robertwb@...> wrote:
>
>> BTW, I wouldn't mind extending the string input argument conversion
>> support
>> to everything that supports the buffer protocol.
>
> That might be interesting, though one difficulty is that buffers in
> general don't have a intrinsic notion of length. (Technically, nor do
> strings, null terminated strings encoded with null-free encodings are
> common enough to make char* useable.)
>

I see, then in the near future I'll be able to create a numpy array
with "unsigned char" dtype (let say, for storing a 8-bit image?). But
at some point, I'll mistakenly pass these arrays to something
accepting 'bytes'... No, -1, do not do that please. Explicit is better
than exlicit. Error should never pass silently.

--

-- 
Lisandro Dalcín
---------------
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594
Robert Bradshaw | 1 Dec 04:36 2009

Re: [Cython] Another string encoding idea

On Nov 30, 2009, at 7:26 PM, Lisandro Dalcin wrote:

> On Tue, Dec 1, 2009 at 12:09 AM, Robert Bradshaw
> <robertwb@...> wrote:
>>
>>> BTW, I wouldn't mind extending the string input argument conversion
>>> support
>>> to everything that supports the buffer protocol.
>>
>> That might be interesting, though one difficulty is that buffers in
>> general don't have a intrinsic notion of length. (Technically, nor do
>> strings, null terminated strings encoded with null-free encodings are
>> common enough to make char* useable.)
>>
>
> I see, then in the near future I'll be able to create a numpy array
> with "unsigned char" dtype (let say, for storing a 8-bit image?). But
> at some point, I'll mistakenly pass these arrays to something
> accepting 'bytes'... No, -1, do not do that please. Explicit is better
> than exlicit. Error should never pass silently.

I agree, that would be bad.

What I was interpreting this as is one could write

def foo(int* data):
    ...

and data would be "extracted" via the buffer interface. That could get  
messy with char*. I certainly don't support
(Continue reading)

Stefan Behnel | 1 Dec 07:41 2009
Picon

Re: [Cython] Another string encoding idea


Robert Bradshaw, 01.12.2009 04:09:
> Just to clarify discussion, here is what I'm proposing (which is still  
> in flux, and simplified due to memory issues, which does make it less  
> attractive as one does not get to choose the used encoding, but it  
> would always be UTF-8 in Py3).

... and the 'default encoding' in Py2, which may or may not be ASCII, but
would likely be at least something that's compatible with ASCII, as it
would break tons of code otherwise.

> Without directive(s) (as it is now):
> 
>     char* <-> bytes
>
> With the directive(s) (which can be applied locally or globally):
> 
>      char* <-> str
>      unicode/bytes -> char* would also work (for Py2/Py3 respectively)

'respectively' in the sense of 'for both'?

> The encoding used would be the system default (in Py2) and UTF-8 (in  
> Py3). This would use the defenc slot so the encoded char* would be  
> valid as long as the unicode object is around, and the long term  
> future of the defenc slot needs to be ensured before this could be  
> used for non-arguments conversion.

That's my main concern here. We are basing a major feature on a side-effect
of something that's declared "for internal use only".
(Continue reading)

Stefan Behnel | 1 Dec 07:48 2009
Picon

Re: [Cython] Another string encoding idea


Robert Bradshaw, 01.12.2009 04:36:
> one could write
> 
> def foo(int* data):
>     ...
> 
> and data would be "extracted" via the buffer interface. That could get  
> messy with char*.

Less messy than for int*, for sure. At least, char* has a somewhat well
defined termination character '\0'. How would you know how many elements an
int* parameter would refer to?

Stefan
Robert Bradshaw | 1 Dec 07:57 2009

Re: [Cython] Another string encoding idea

On Nov 30, 2009, at 10:48 PM, Stefan Behnel wrote:

>
> Robert Bradshaw, 01.12.2009 04:36:
>> one could write
>>
>> def foo(int* data):
>>    ...
>>
>> and data would be "extracted" via the buffer interface. That could  
>> get
>> messy with char*.
>
> Less messy than for int*, for sure. At least, char* has a somewhat  
> well
> defined termination character '\0'. How would you know how many  
> elements an
> int* parameter would refer to?

You wouldn't, which is the problem that I was worried about. I think I  
misinterpreted what you were trying to say here, so just ignore it.

- Robert

Robert Bradshaw | 1 Dec 08:41 2009

Re: [Cython] Another string encoding idea

On Nov 30, 2009, at 10:41 PM, Stefan Behnel wrote:

>
> Robert Bradshaw, 01.12.2009 04:09:
>> Just to clarify discussion, here is what I'm proposing (which is  
>> still
>> in flux, and simplified due to memory issues, which does make it less
>> attractive as one does not get to choose the used encoding, but it
>> would always be UTF-8 in Py3).
>
> ... and the 'default encoding' in Py2, which may or may not be  
> ASCII, but
> would likely be at least something that's compatible with ASCII, as it
> would break tons of code otherwise.

Yep.

>
>
>> Without directive(s) (as it is now):
>>
>>    char* <-> bytes
>>
>> With the directive(s) (which can be applied locally or globally):
>>
>>     char* <-> str
>>     unicode/bytes -> char* would also work (for Py2/Py3 respectively)
>
> 'respectively' in the sense of 'for both'?

(Continue reading)

Stefan Behnel | 1 Dec 09:56 2009
Picon

Re: [Cython] Another string encoding idea


Robert Bradshaw, 01.12.2009 04:23:
> On Nov 30, 2009, at 10:14 AM, Christopher Barker wrote:
>> I think the key from a user's
>> perspective is that one is either working with "text": human readable
>> stuff, or data. If text, then the natural python3 data type is a  
>> unicode string. If data, then bytes -- we should really follow that as
>> best we can.
> 
> unicode = char* + length + encoding
> bytes = char* + length
> 
> So what is the Python equivalent of char*? Neither, and what you want  
> depends on the application and context.

Ok, so we agree that there are various different use cases that require
different setups.

As I indicated before, CPython's argument unpacking functions support
various ways of dealing with unicode/bytes conversion to char* through
their "s#", "u#" and "es#" formats. These are actually helpful, but not
currently supported by Cython.

Maybe a buffer emulation might help here, where Cython would set up a
Py_buffer struct for a function argument and fill in the values from the
Python string that was passed. That might be a way to handle all use cases
in a uniform way, and we could easily extend this to an additional buffer
option 'encoding', which would override the platform specific default
encoding used to handle char* buffers.

(Continue reading)

Stefan Behnel | 1 Dec 10:43 2009
Picon

Re: [Cython] Another string encoding idea


Robert Bradshaw, 01.12.2009 08:41:
> On Nov 30, 2009, at 10:41 PM, Stefan Behnel wrote:
>> Robert Bradshaw, 01.12.2009 04:09:
>>> This is completely orthogonal to type inference.
>> It's not orthogonal, as type inference currently breaks C type to 
>> untyped Python name assignments, which is exactly the case you want to
>> influence with the directive. This means that the char* directive
>> would override the type inference directive for one special case.
> 
> I was just using assignment to an untyped variable as an implicit  
> coercion to object in my example. I should have been more explicit and  
> written
> 
>      cdef char* ss = ...
>      cdef object x = ss

In which case you could just as well type x as str, or use an explicit cast
to the type you want. When enabling type inference, the 'default behaviour'
no longer comes for free, except for exactly the function call boundary
cases that I keep stressing.

Stefan


Gmane