Jay Soffian | 4 Jan 19:45 2007

dumpMetadata utf-8 question

In dealing with some i18n RPMs recently I noticed that dumpMetadata  
can generate XML which is unparseable on the receiving end by yum due  
to feeding libxml2 non-utf8 encoded strings in some cases. The reason  
for this is two-fold:

1) The RPM in question (constructed for QA purposes) was encoded  
using the euc_jp encoding.
2) dumpMetadata does not pass all the strings it extracts from an RPM  
through utf8String. (In particular, the name of the RPM as well as  
the name portion of each of the PRCO entries.)

I've modified dumpMetadata to: a) pass all strings through  
utf8String; and b) allow you to optionally specify the encoding that  
was in use when the RPM was constructed.

This causes problems downstream with yum when attempting to install  
the RPM (because yum compares the utf-8 encoded RPM name to the name  
of the RPM as represented by the raw bytes from the downloaded RPM  
header and these are not equal, thus yum rejects the header), but at  
least the XML is valid allowing other RPMs in the repo to be installed.

Anyway, I'm wondering whether it was an intentional design decision  
to not pass all the bits handed to libxml2 through utf8String first.

Thanks,

j.
Jay Soffian | 4 Jan 21:39 2007

dumpMetadata utf-8 question


On Jan 4, 2007, at 1:49 PM, Jay Soffian wrote:

> I've modified dumpMetadata to: a) pass all strings through  
> utf8String; and b) allow you to optionally specify the encoding  
> that was in use when the RPM was constructed.

Comments on attached patch please. I've cc'd yum-devel since I plan  
to eventually patch yum to respect the encoding provided (if any) to  
createrepo. Original message (for yum-devel readers) here:

https://lists.dulug.duke.edu/pipermail/rpm-metadata/2007-January/ 
000738.html

Questions:

- Is it insane to attempt to respect a non-utf8 encoding with yum  
since RPMs don't natively specify the encoding in which they were  
constructed?

- Is the proposed extension to the rpm-metadata xml schema acceptable?

Thanks,

j.
Attachment (dumpMetadata.encoding.patch): application/octet-stream, 6289 bytes
_______________________________________________
Rpm-metadata mailing list
(Continue reading)

seth vidal | 5 Jan 05:05 2007

Re: dumpMetadata utf-8 question

On Thu, 2007-01-04 at 13:45 -0500, Jay Soffian wrote:
> In dealing with some i18n RPMs recently I noticed that dumpMetadata  
> can generate XML which is unparseable on the receiving end by yum due  
> to feeding libxml2 non-utf8 encoded strings in some cases. The reason  
> for this is two-fold:
> 
> 1) The RPM in question (constructed for QA purposes) was encoded  
> using the euc_jp encoding.
> 2) dumpMetadata does not pass all the strings it extracts from an RPM  
> through utf8String. (In particular, the name of the RPM as well as  
> the name portion of each of the PRCO entries.)
> 
> I've modified dumpMetadata to: a) pass all strings through  
> utf8String; and b) allow you to optionally specify the encoding that  
> was in use when the RPM was constructed.
> 
> This causes problems downstream with yum when attempting to install  
> the RPM (because yum compares the utf-8 encoded RPM name to the name  
> of the RPM as represented by the raw bytes from the downloaded RPM  
> header and these are not equal, thus yum rejects the header), but at  
> least the XML is valid allowing other RPMs in the repo to be installed.
> 
> Anyway, I'm wondering whether it was an intentional design decision  
> to not pass all the bits handed to libxml2 through utf8String first.

Mostly I think it was that:
1. We never encountered non-utf8 strings in package names and I
_thought_ that at one point in time rpm used to bitch about them
2. file names used to get upset in package builds when non-utf8 strings
were in %files - I remember running into this error at one point.
(Continue reading)

Jay Soffian | 5 Jan 06:06 2007

Re: dumpMetadata utf-8 question


On Jan 4, 2007, at 11:06 PM, seth vidal wrote:

> Mostly I think it was that:
> 1. We never encountered non-utf8 strings in package names and I
> _thought_ that at one point in time rpm used to bitch about them
>
> 2. file names used to get upset in package builds when non-utf8  
> strings
> were in %files - I remember running into this error at one point.

At least on my RHEL4 system /bin/rpm appears happy to generate and  
consume files with euc_* encodings.

In any case, it seems that dumpMetadata should either:

1) coerce all strings to utf-8 (per the patch I sent previously), or
2) ensure that the strings which it doesn't coerce are already valid  
utf-8 with something like:

def isUtf8(string):
     try:
         x = unicode(string, 'utf-8')
     except UnicodeError:
         return False
     else:
         return x.encode('utf-8') == string

The problem with (1) is that yum compares the name of the RPM in the  
downloaded header to the coerced string and if they don't match it  
(Continue reading)

Christoph Thiel | 10 Jan 16:11 2007
Picon

[patch] fix to make _stringToVersion match rpm's parseEVR behavior

Hi there,

while looking into one of our bugs, I found that createrepo's
_stringToVersion didn't match rpm's parseEVR. Please find the patch
attached.

Best,
Christoph
Attachment (createrepo-EVR.patch): text/x-patch, 453 bytes
_______________________________________________
Rpm-metadata mailing list
Rpm-metadata <at> lists.dulug.duke.edu
https://lists.dulug.duke.edu/mailman/listinfo/rpm-metadata
seth vidal | 10 Jan 16:18 2007

Re: [patch] fix to make _stringToVersion match rpm's parseEVR behavior

On Wed, 2007-01-10 at 16:11 +0100, Christoph Thiel wrote:
> Hi there,
> 
> while looking into one of our bugs, I found that createrepo's
> _stringToVersion didn't match rpm's parseEVR. Please find the patch
> attached.
> 

What keeps the epoch from being an alpha?

-sv
Jeff Johnson | 10 Jan 16:19 2007
Picon

Re: [patch] fix to make _stringToVersion match rpm's parseEVR behavior


On Jan 10, 2007, at 10:18 AM, seth vidal wrote:

> On Wed, 2007-01-10 at 16:11 +0100, Christoph Thiel wrote:
>> Hi there,
>>
>> while looking into one of our bugs, I found that createrepo's
>> _stringToVersion didn't match rpm's parseEVR. Please find the patch
>> attached.
>>
>
> What keeps the epoch from being an alpha?
>

Padding with leading zeroes for representing numbers.

Consider "2" compared with "10".

73 de Jeff
seth vidal | 10 Jan 16:24 2007

Re: [patch] fix to make _stringToVersion match rpm's parseEVR behavior

On Wed, 2007-01-10 at 10:19 -0500, Jeff Johnson wrote:
> On Jan 10, 2007, at 10:18 AM, seth vidal wrote:
> 
> > On Wed, 2007-01-10 at 16:11 +0100, Christoph Thiel wrote:
> >> Hi there,
> >>
> >> while looking into one of our bugs, I found that createrepo's
> >> _stringToVersion didn't match rpm's parseEVR. Please find the patch
> >> attached.
> >>
> >
> > What keeps the epoch from being an alpha?
> >
> 
> Padding with leading zeroes for representing numbers.
> 
> Consider "2" compared with "10".
> 

hmm. I always thought epochs could be like any other value in the evr
fields. I thought that floats would work but it appears to be integers
only.

in that case the patch looks fine.

-sv
Christoph Thiel | 10 Jan 16:36 2007
Picon

Re: [patch] fix to make _stringToVersion match rpm's parseEVR behavior

On Wed, Jan 10, 2007 at 10:24:56AM -0500, seth vidal wrote:
> On Wed, 2007-01-10 at 10:19 -0500, Jeff Johnson wrote:
> > On Jan 10, 2007, at 10:18 AM, seth vidal wrote:
> > 
> > > On Wed, 2007-01-10 at 16:11 +0100, Christoph Thiel wrote:
> > >> Hi there,
> > >>
> > >> while looking into one of our bugs, I found that createrepo's
> > >> _stringToVersion didn't match rpm's parseEVR. Please find the patch
> > >> attached.
> > >>
> > >
> > > What keeps the epoch from being an alpha?
> > >
> > 
> > Padding with leading zeroes for representing numbers.
> > 
> > Consider "2" compared with "10".
> > 
> 
> hmm. I always thought epochs could be like any other value in the evr
> fields. I thought that floats would work but it appears to be integers
> only.
> 
> in that case the patch looks fine.

The actual problem only exists, if Epoch isn't set, but Version contains
something like "10a:1".

Best,
(Continue reading)

Jeff Johnson | 10 Jan 16:38 2007
Picon

Re: [patch] fix to make _stringToVersion match rpm's parseEVR behavior


On Jan 10, 2007, at 10:36 AM, Christoph Thiel wrote:

>
> The actual problem only exists, if Epoch isn't set, but Version  
> contains
> something like "10a:1".
>

Ick.

Do you actually have a package with a version containing a ':'?

If so, I will create a new exhibit in my Packaging Hall of Shame ...

73 de Jeff

Gmane