Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus
dean gaudet <dean <at> arctic.org>
2007-01-05 07:21:21 GMT
On Fri, 5 Jan 2007, Andy Polyakov wrote:
> > there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit
> > to distinguish between two implementations of rc4... unfortunately this
> > fails to properly distinguish the cpus. all dual core cpus (intel or amd)
> > report HT support even if they don't use symmetric-multithreading like some
> > p4 do.
>
> So HT flag is no longer HyperThreading, but something else... Will look into
> it... There is another place HTT flag is checked and it's AES...
yeah HT flag now basically means "multi-threading or multi-core
package"... because when amd/intel went dual core they didn't want silly
license managers to charge for every core.
hmm i don't see any OPENSSL_ia32cap_P test for AES in 0.9.8d ... maybe
i should be looking at the cvs? i'm seeing 17.5 cycles per byte for
aes-128-cbc on core2, which is pretty good.
> > it seems somewhat fortunate that core2 CPUs track the p4 behaviour
> > w.r.t. these two rc4 implementations. here are the core2 results with the
> > stock code / HT test:
> >
> > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
> > bytes
> > rc4 166799.58k 180552.87k 182437.93k 183381.67k
> > 183206.87k
> >
> > and with cpuid test disabled:
> >
> > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
> > bytes
> > rc4 123361.30k 128102.17k 129876.57k 128787.22k
> > 129419.95k
> >
> > for the record, core2 64-bit code seriously underperforming the 32-bit
> > code... here's the 32-bit results (with cpuid test enabled):
> >
> > type 16 bytes 64 bytes 256 bytes 1024 bytes 8192
> > bytes
> > rc4 254164.64k 279901.10k 279364.38k 283617.62k
> > 276690.26k
>
> ... The key feature in 32-bit code with cpuid test is that corresponding loop
> is not unrolled. Can you test following in *64-bit* build on Core2 hardware.
> Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154
> unconditional, i.e. replace jz to jmp. make, benchmark and report back. A.
small improvement...
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
rc4 174197.47k 182564.34k 184536.23k 185292.63k 186258.77k
i think this hints that the problem with the unrolled code is the manual
load/store alias avoidance -- there's fancy new hardware in core2 for
dealing with this (obviously it's not fancy enough :)... and it seems
the 32-bit code pushes the alias problem onto the hardware.
oh and i tried using cmove with no luck either.
bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case
and it's still not performing like it does in 32-bit... maybe i screwed up
though.
-dean
______________________________________________________________________
OpenSSL Project http://www.openssl.org
Development Mailing List openssl-dev <at> openssl.org
Automated List Manager majordomo <at> openssl.org