Dan Gohman <sunfish <at> mozilla.com>
2014-09-28 13:44:31 GMT
I agree with much of your assessment of the the proposed SIMD.js API.
However, I don't believe it's unsuitability for some problems
invalidates it for solving other very important problems, which it is
well suited for. Performance portability is actually one of SIMD.js'
biggest strengths: it's not the kind of performance portability that
aims for a consistent percentage of peak on every machine (which, as you
note, of course an explicit 128-bit SIMD API won't achieve), it's the
kind of performance portability that achieves predictable performance
and minimizes surprises across machines (though yes, there are some
unavoidable ones, but overall the picture is quite good).
On 09/26/2014 03:16 PM, Nadav Rotem wrote:
> So far, I’ve explained why I believe SIMD.js will not be
> performance-portable and why it will not utilize modern instruction
> sets, but I have not made a suggestion on how to use vector
> instruction scheduling and register allocation, is a code-generation
> problem. In order to solve these problems, it is necessary for the
> compiler to have intimate knowledge of the architecture. Forcing the
> compiler to use a specific instruction or a specific data-type is the
> wrong answer. We can learn a lesson from the design of compilers for
> data-parallel languages. GPU programs (shaders and compute languages,
> such as OpenCL and GLSL) are written using vector instructions because
> the domain of the problem requires vectors (colors and coordinates).
> One of the first thing that data-parallel compilers do is to break
> vector instructions into scalars (this process is called
> scalarization). After getting rid of the vectors that resulted from
> the problem domain, the compiler may begin to analyze the program,
> calculate profitability, and make use of the available instruction set.
> I believe that it is the responsibility of JIT compilers to use vector
> instructions. In the implementation of the Webkit’s FTL JIT compiler,
> we took one step in the direction of using vector instructions. LLVM
> already vectorizes some code sequences during instruction selection,
> and we started investigating the use of LLVM’s Loop and SLP
> vectorizers. We found that despite nice performance gains on a number
> of workloads, we experienced some performance regressions on Intel’s
> Sandybridge processors, which is currently a very popular desktop
> speculation). Unfortunately, branches on Sandybridge execute on Port5,
> which is also where many vector instructions are executed. So,
> pressure on Port5 prevented performance gains. The LLVM vectorizer
> currently does not model execution port pressure and we had to disable
> vectorization in FTL. In the future, we intend to enable more
> vectorization features in FTL.
This is an example of a weakness of depending on automatic vectorization
alone. High-level language features create complications which can lead
to surprising performance problems. Compiler transformations to target
specialized hardware features often have widely varying applicability.
Expensive analyses can sometimes enable more and better vectorization,
but when a compiler has to do an expensive complex analysis in order to
optimize, it's unlikely that a programmer can count on other compilers
doing the exact same analysis and optimizing in all the same cases. This
is a problem we already face in many areas of compilers, but it's more
pronounced with vectorization than many other optimizations.
In contrast, the proposed SIMD.js has the property that code using it
will not depend on expensive compiler analysis in the JIT, and is much
more likely to deliver predictable performance in practice between
different JIT implementations and across a very practical variety of
> To summarize, SIMD.js will not provide a portable performance solution
> because vector instruction sets are sparse and vary between
> architectures and generations. Emscripten should not generate vector
> instructions because it can’t model the target machine. SIMD.js will
> not make use of modern SIMD features such as predication or
> scatter/gather. Vectorization is a compiler code generation problem
> that should be solved by JIT compilers, and not by the language
> itself. JIT compilers should continue to evolve and to start
> vectorizing code like modern compilers.
As I mentioned above, performance portability is actually one of
SIMD.js's core strengths.
I have found it useful to think of the API propsed in SIMD.js as a
"short vector" API. It hits a sweet spot, being a convenient size for
many XYZW and RGB/RGBA and similar algorithms, being implementable on a
wide variety of very relevant hardware architectures, being long enough
to deliver worthwhile speedups for many tasks, and being short enough to
still be convenient to manipulate.
I agree that the "short vector" model doesn't address all use cases, so
I also believe a "long vector" approach would be very desirable as well.
Such an approach could be based on automatic loop vectorization, a SPMD
programming model, or something else. I look forward to discussing ideas
for this. Such approaches have the potential to be much more scalable
and adaptable, and can be much better positioned to solve those problems
that the presently proposed SIMD.js API doesn't attempt to solve. I
believe there is room for both approaches to coexist, and to serve
distinct sets of needs.
In fact, a good example of short and long vector models coexisting is in
these popular GPU programming models that you mentioned, where short
vectors represent things in the problem domains like colors and
coordinates, and are then broken down by the compiler to participate in
the long vectors, as you described. It's very plausible that the
proposed SIMD.js could be adapted to combine with a future long-vector
approach in the same way.
webkit-dev mailing list
webkit-dev <at> lists.webkit.org