18 May 03:10
[PATCH] sse2 version of greedy2frame deinterlacer
Roland Scheidegger <rscheidegger_lists <at> hispeed.ch>
2012-05-18 01:10:38 GMT
2012-05-18 01:10:38 GMT
Hi, here's the updated greedy2frame sse2 patch. In contrast to the first try I've now made it fall back to the mmxext path when alignment restrictions aren't met (I've also renamed that old path from sse as it's really mmxext not sse). Thanks to Petri it also initializes the xmm variables in a much less crappy way... I couldn't quite reuse the old template, I'm sure there's some clever way to reuse the same assembly for mmx and sse2 (the ffmpeg guys do that for instance) but that's the way it is now. btw performance results are intersting, on a c2d-class chip the performance increase was little more than statistical noise (maybe 2% overall including h264 decode), despite that this chip can execute the arithmetic really twice as fast thanks to 128bit simd units (it indeed clearly executed less instructions). On a Athlon64 X2 the performance increase was much more substantial (25% or so for deinterlacer alone), even though this chip doesn't gain anything really from using sse2 over mmx (due to its 64bit simd units) - all the difference came from incorporating the separate line copy loop into the inline assembly and the prefetch instructions (slightly more improvement from the former than the latter). Of course that could be done for the mmx path too but I'm not sure how some older cpus would react to that - apparently on a c2d it makes no difference anyway (I'm quite sure the prefetch instructions are just a total waste there for these simple patterns the hw prefetcher is more than adequate on that chip). So on a c2d things are limited by memory bandwidth it seems. Might be more of an improvement on chips which have both 128bit simd units AND a fast memory interface (like Nehalem, Sandy Bridge, possibly Barcelona, Bulldozer).(Continue reading)
RSS Feed