Martin von Gagern | 1 May 2009 09:35
Picon

Re: Re: Broken unicode handling in unison 2.27.57

Benjamin Pierce wrote:
>> Martin von Gagern wrote: 
>> I hope that with this kind of optional implementation, the code can go
>> into some development tree (either trunk or a feature branch) soon, so
>> developers can work together to improve on it.
> 
> I'm glad to see that people are interested in improving the unicode
> situation, but I have to admit I'm a little reluctant to put a "small
> very ugly hack" into the Unison development tree... :-)

I can understand your concerns. But how do you propose to continue?

Neither the original author nor I myself are too fluent in OCaml, and I
haven't even had a close enough look to understand most parts of the
Unison code base. So I guess for us improving the patch on our own,
without further input, would be pretty hard indeed. If you were to
review the patch, see how ugly it really is, and point out things that
would require improvement, that might be a good first step.

Things that I can think of might require improvement:
1. The position of the change. Is Case.normalize the correct place?
2. The depndency. Is using camomile acceptable, or do we require our own
   implementation of unicode normalization?
3. Use of findlib. While I guess the use of findlib for camomile makes
   the build more portable, it might be cleaner to switch the whole
   unison build to findlib. On the other hand, if you want to keep build
   time deps to a minimum, findlib shouldn't be used at all.
4. The handling of compilation alternatives. Is providing two files
   "unicode.ml" in two different directories an acceptable way to
   provide and link to optional code?
(Continue reading)

Benjamin Pierce | 1 May 2009 15:43
Favicon

Re: Re: Broken unicode handling in unison 2.27.57

> Benjamin Pierce wrote:
>>> Martin von Gagern wrote:
>>> I hope that with this kind of optional implementation, the code  
>>> can go
>>> into some development tree (either trunk or a feature branch)  
>>> soon, so
>>> developers can work together to improve on it.
>>
>> I'm glad to see that people are interested in improving the unicode
>> situation, but I have to admit I'm a little reluctant to put a "small
>> very ugly hack" into the Unison development tree... :-)
>
> I can understand your concerns. But how do you propose to continue?
>
> Neither the original author nor I myself are too fluent in OCaml,  
> and I
> haven't even had a close enough look to understand most parts of the
> Unison code base. So I guess for us improving the patch on our own,
> without further input, would be pretty hard indeed. If you were to
> review the patch, see how ugly it really is, and point out things that
> would require improvement, that might be a good first step.

I'm not an expert on unicode, character set, or internationalization  
issues, so I'm afraid I can't be much use here.  But, as I've heard  
from other people and as you comment below, a clean solution seems to  
require changes in many places, which to my mind means beginning with  
a paper design that specifies what behavior is intended in all cases.

> Things that I can think of might require improvement:
> 1. The position of the change. Is Case.normalize the correct place?
(Continue reading)

Russel Winder | 1 May 2009 16:40
Gravatar

Re: Re: Broken unicode handling in unison 2.27.57

Benjamin,

On Fri, 2009-05-01 at 09:43 -0400, Benjamin Pierce wrote:

> If someone (or a group of people) steps up and volunteers to design 
> and implement a clean solution, and if partial versions need to be 
> stored someplace while the project is underway, I'll be happy to 
> discuss finding a home for them either in a branch of the unison 
> repository or in a separate repository on U. Penn's svn server.

Or people could just use Bazaar to create a working branch of the
Subversion trunk, no need for extra Subversion logins then.  Launchpad
can host Bazaar branches.

Alternatively, you could use Git and GitHub, but I believe Bazaar to be
better.

The point here is that a DVCS is a better tool for creating a branch
pending merge than trying to work with Subversion, and Bazaar and Git
can both work directly with Subversion repositories both for read and
write.

--

-- 
Russel.
============================================================
Dr Russel Winder                 Partner

Concertant LLP          t: +44 20 7585 2200, +44 20 7193 9203
41 Buckmaster Road,     f: +44 8700 516 084    voip:  sip:russel.winder <at> ekiga.net
London SW11 1EN, UK.    m: +44 7770 465 077    xmpp: russel <at> russel.org.uk
(Continue reading)

Stefan Schwenkenbecher | 1 May 2009 23:43
Picon

Re: Looking for a recent Win32 binary of Unison

Hi Karl,
thanks for the fast response. I will try the profile option, I suppose that is enough for my needs :-) And for
the next time I promise to read the manual about possible options before :-)

Looking forward to the next release. 
Best regards, 
Stefan.

--- In unison-users <at> yahoogroups.com, Karl M <karlm30 <at> ...> wrote:
>
> 
> > Date: Wed, 29 Apr 2009 21:45:39 +0000
> > Subject: [unison-users] Looking for a recent Win32 binary of Unison
> >
> > Hello all,
> >
> > does anybody has a recent win32 binary of the current unison version (2.31.4)? I'm not that familiar to
compiling from source under windows, so maybe there is somebody out who could provide a newer windows build...
> >
> > I know that Alan is hosting several pre-compiled windows binaries on his site, but the last windows build
(by Karl M) is version 2.30.4, which doesn't fit to my server's version :-(
> >
> >
> >
> > Ok, thanks a lot for your help in advance, and best regards,
> >
> > Stefan.
> >
> Hi Stefan...
> 
(Continue reading)

Martin von Gagern | 1 May 2009 16:54
Picon

Re: Re: Broken unicode handling in unison 2.27.57

Russel Winder wrote:
> Or people could just use Bazaar to create a working branch of the
> Subversion trunk, no need for extra Subversion logins then.  Launchpad
> can host Bazaar branches.

I'd be happy with a bazaar branch hosted on launchpad. But I won't do
this alone, so that's only a solution if other people involved are happy
with that as well.

Greetings,
 Martin

Martin von Gagern | 1 May 2009 19:56
Picon

Re: Re: Broken unicode handling in unison 2.27.57

Benjamin Pierce wrote:
> I'm not an expert on unicode, character set, or internationalization
> issues, so I'm afraid I can't be much use here.

I guess I could provide the required input here. I mainly need a guide
around the Unison codebase and some OCaml features, so that I can find
out points of interest quickly without understanding the bulk of Unison.

Crash course for the issues at hand, not only for you, but for future
reference as well. SKip this for now if you want.

There are a lot of character sets out there. Some of them interpret one
byte as one character, according to some table, while others use more
complicated encodings to express a larger number of characters.

In the past, different encodings have often caused trouble. Unicode is
supposed to be universal, making all other character sets obsolete in
the long run, as they are all ideally subsets of unicode.

In its current form, a unicode codepoint (roughly a character, but the
term "codepoint" is more precise) is an integer of at most 20 or so
bits. There are several possible encodings of Unicode using units of
less bits, most notably UTF-8 using units of 8 bits (and up to six bytes
for a codepoint) and UTF-16 using units of 16 bits (and up to two such
words for a codepoint). Any of these encodings can express an arbitrary
sequence of codepoints, so they are equally expressive.

Now to normalization. Unicode does provide things called "combining
characters", which don't usually produce a glyph by themselves, but are
instead combined with the preceding character to form a single glyph.
(Continue reading)

Russel Winder | 2 May 2009 08:49
Gravatar

Re: Re: Broken unicode handling in unison 2.27.57

Martin,

On Fri, 2009-05-01 at 16:54 +0200, Martin von Gagern wrote: 
> Russel Winder wrote:
> > Or people could just use Bazaar to create a working branch of the
> > Subversion trunk, no need for extra Subversion logins then.  Launchpad
> > can host Bazaar branches.
> 
> I'd be happy with a bazaar branch hosted on launchpad. But I won't do
> this alone, so that's only a solution if other people involved are happy
> with that as well.

I definitely have Unicode codepoints (UTF-8 encoded) in my file names
and I synchronize between Ubuntu Intrepid, Ubuntu Jaunty, RHEL 5.x and
Mac OS X Leopard.  The synchronizations between Ubuntus is never
problematic as the UTF-8 encoding seems to take care of itself.  Ditto
the synchronization from Ubuntu to RHEL.  I am, though, having great
difficulty with synchronizing between Ubuntu and Mac OS X.  Not only is
the Unicode an issue, Mac OS X is not even case sensitive :-(

So assuming Unison is buildable on my Ubuntu and Mac OS X machines, I
can offer to do some trials and testing.  However, I have never actually
programmed in OCaml and I am not at all familiar with the Unison code,
so I can't volunteer to do any actual coding.

--

-- 
Russel.
============================================================
Dr Russel Winder                 Partner

(Continue reading)

Russel Winder | 2 May 2009 09:42
Gravatar

Re: Re: Broken unicode handling in unison 2.27.57

Martin,

On Fri, 2009-05-01 at 19:56 +0200, Martin von Gagern wrote:
[ . . . ]
> In its current form, a unicode codepoint (roughly a character, but the
> term "codepoint" is more precise) is an integer of at most 20 or so
> bits. There are several possible encodings of Unicode using units of
> less bits, most notably UTF-8 using units of 8 bits (and up to six bytes
> for a codepoint) and UTF-16 using units of 16 bits (and up to two such
> words for a codepoint). Any of these encodings can express an arbitrary
> sequence of codepoints, so they are equally expressive.

Is UTF-8 up to 4 or 6 bytes per codepoint?

And then there is the UTF-16 & BOM / UTF-16LE / UTF-16BE issue!

Of course there is also UTF-32/UCS-4 but I don't think anyone actually
uses that for anything?

[ . . . on a brief read I think the 9 points are basically not
controversial . . . ]

> I like Russel's suggestion about Launchpad. Before I start pushing my own
> branches there, I think it would be a good idea to register Unison as is
> with launchpad. Maybe it would be better if some core developer would do
> this, so that it dosn't look like it's my project.

Given that Unison is in Ubuntu there will already be some features
around it on Launchpad.  I would recommend creating a Unison Developers
group and then a Unison Project owned by Unison Developers.  I just
(Continue reading)

Gergely Imreh | 2 May 2009 14:33
Picon
Gravatar

Re: Re: Broken unicode handling in unison 2.27.57

Hi,

  I'm a relatively new user to Unison, been using it for a few months.
Being in Taiwan, there are loads of tricky filenames that broke Unison
before. I had to modify the source and take out some of the file-name
checking to get around it. So, I for one, would be really glad for
proper Unicode support, and not having to use such "hacks". Probably
will be able to provide plenty of test-cases as well, Unison choked
plenty of times before. ;)

  Silly question: how other program get around this filename issue? As
much as I know, e.g. git treats filenames as just another binary data.
Of course, this is probably won't take care of the case independent
case that simply.

  Also, if people are thinking of moving to a distributed development
model, is there any reason not to phase out the svn repo entirely? Of
course it would be quite tricky in the beginning, but I'd say it would
save a lot of headache on the long run...  I am more of a git fan,
with github getting better (with issue tracking and such) and more
public hosting sites, but I can see the advantage of Launchpad with
its more comprehensive offering as well....
  Certainly if there's a more accessible repo (then svn), I'm sure
more people would feel to contribute.

  Cheers,
       Greg

------------------------------------

(Continue reading)

Martin von Gagern | 2 May 2009 14:58
Picon

Re: Re: Broken unicode handling in unison 2.27.57

Russel Winder wrote:
> Is UTF-8 up to 4 or 6 bytes per codepoint?

4 bytes per codepoint, you are right. I mixed that up with CESU-8, which
is basically UTF-16 units encoded again in UTF-8, resulting in up to
2x3=6 bytes per codepoint. Java does that in some places.

> And then there is the UTF-16 & BOM / UTF-16LE / UTF-16BE issue!

Not much of an issue for us. Windows is the only system using UTF-16,
for file names, and it has well defined endianness for it. Those
endianess issues are relevant for file content only.

> Of course there is also UTF-32/UCS-4 but I don't think anyone actually
> uses that for anything?

Again, maybe in content, but not for file names.

> [ . . . on a brief read I think the 9 points are basically not
> controversial . . . ]

Glad to hear that. Leaves the question of how to achieve this behaviour.

> Given that Unison is in Ubuntu there will already be some features
> around it on Launchpad.  I would recommend creating a Unison Developers
> group and then a Unison Project owned by Unison Developers.  I just
> noticed you make this suggestion below -- moral of story, read whole of
> email before starting to write reply :-)

Agreed. If noone does so in the near future, I'll do so.
(Continue reading)


Gmane