Experiment to speed up proj.4 by 2 or more
Even Rouault <even.rouault <at> spatialys.com>
2015-06-23 15:29:49 GMT
I've done an experiment to use Intel SIMD intrinsics
(https://en.wikipedia.org/wiki/SIMD), and I think they could be beneficial for
proj, when called to transform several coordinates at a time.
I've used the SSE2 instruction set (128 bit registers, so 2 doubles at a
time), and I managed to speed up the inverse Transverse Mercator ellipsoidal
transformation (ie. from projected to geodetic) by a factor of ~ 2 (excluding
potential datum transformations)
One key for performance was to find an efficient way of computing the usual
transcendental functions (ie. sin, cos, tan and their inverse, exp, ln,
etc...) with SIMD registers, since they are not included in the instruction
set. Otherwise you have to collect each component of the SIMD register,
evaluate it with the x87 coprocessor, and reassemble the SIMD register from
the computed components, which kills all the other performance gains. The
SLEEF library (http://freecode.com/projects/sleef) has such routines, is in
the public domain and works rather well (with gcc/clang, although it has some
rough edges when trying with MSVC, but nothing that cannot be overcome)
I've encapsulated the use of SSE2 intrinsics in a C++ class with overloading
of arithmetics operators, so the resulting code looks pretty much similar to
the original C code, which is great for readability (although the original C
code isn't always very readable ), and confidence that it doesn't introduce
errors. Conditionnal branches are not so great for SIMD performance, but there
are tricks to rewrite some of them with a ternary-like operator.
SLEEF also supports the AVX & AVX2+FMA instruction sets (256 bit registers),
which could also lead to a further ~ x2 gain over SSE2.
So I was wondering if there was :
1) interest of the project in pursuing into that approach (which involves
introducing C++ in the code base, as an implementation detail, the interface
being unchanged). We could imagine to have the same source files compiled
several times with different register sizes, with runtime selection of the
appropriate variant (note: SSE2 is guaranteed to be available on all x86_64
compatible processors. AVX/AVX2 is for more recent CPUs).
2) ... and sponsors interested in making that happen.
Finally, the proof of concept:
* regular code (runs in ~30s on Core i5 750 <at> 2.67GHz ):
* SSE2 code (~14s):
Spatialys - Geospatial professional services
Proj mailing list
Proj <at> lists.maptools.org