Unicode byte order mark decoding
Evan Jones <ejones <at> uwaterloo.ca>
2005-04-01 19:36:07 GMT
I recently rediscovered this strange behaviour in Python's Unicode
handling. I *think* it is a bug, but before I go and try to hack
together a patch, I figure I should run it by the experts here on
Python-Dev. If you understand Unicode, please let me know if there are
problems with making these minor changes.
>>> import codecs
>>> codecs.BOM_UTF8.decode( "utf8" )
u'\ufeff'
>>> codecs.BOM_UTF16.decode( "utf16" )
u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder
turns it into a character? The UTF-16 decoder contains logic to
correctly handle the BOM. It even handles byte swapping, if necessary.
I propose that the UTF-8 decoder should have the same logic: it should
remove the BOM if it is detected at the beginning of a string. This
will remove a bit of manual work for Python programs that deal with
UTF-8 files created on Windows, which frequently have the BOM at the
beginning. The Unicode standard is unclear about how it should be
handled (version 4, section 15.9):
> Although there are never any questions of byte order with UTF-8 text,
> this sequence can serve as signature for UTF-8 encoded text where the
> character set is unmarked. [...] Systems that use the byte order mark
> must recognize when an initial U+FEFF signals the byte order. In those
> cases, it is not part of the textual content and should be removed
> before processing, because otherwise it may be mistaken for a
> legitimate zero width no-break space.
(Continue reading)