Re: Gvim for Windows doesn't handle non-BMP characters when interchanging data with Windows OS
Bram Moolenaar <Bram <at> moolenaar.net>
2008-11-02 13:56:20 GMT
Yanwei wrote:
> When interchanging data with Windows such as clipboard operation, gvim will
> convert the text into UCS-2 encoding, but different from UTF-16, UCS-2 can't
> encode non-BMP characters.
>
> For example, when paste a non-BMP character U+248BB from Windows clipboard,
> it will insert two separated characters <d852> <dcbb>. It is caused by the
> function ucs2_to_utf8() in src/os_mswin.c, which treates the surrogate pairs
> as separated unicode characters, and convert it into bad UTF-8 sequence
> 0xED 0xA1 0x92 0xED 0xB2 0xBB -- the correct UTF-8 sequence should be
> 0xF0 0xA4 0xA2 0xBB.
>
> Similarly, when copy a non-BMP character U+248BB into Windows clipboard, the
> content of clipboard will be U+48BB, because the function utf8_to_ucs2()
> in src/os_mswin.c will cast the integer 0x248BB into a short integer 0x48BB.
>
> The attachment is a patch. The surrogate pairs handling has been add into the
> two functions mentioned above. This make the non-BMP characters can be
> correctly interchanged with Windows clipboard as I had tested:
> Non-BMP character paste from/copy into Windows clipboard
> +----------+--------------------------------+------------------------+
> | | WindowsXP with GB18030 support | Windows 98 |
> +----------+--------------------------------+------------------------+
> | editing | before patch works bad | before patch works bad |
> | UTF-* or | after patch works OK | after patch works OK |
> | UCS-4* | | |
> | text | | |
> +----------+--------------------------------+------------------------+
(Continue reading)