jason marshall | 2 Feb 01:46 2008
Picon

Re: jsr166y.forkjoin API comments

I thought there was a relaxation in one of the more recent JDKs that allowed you to say, "just use the hardware FP for this operation".

Extremetech's podcast this week talks with someone from the NVidia Cuda group.  I think this stuff is probably closer than I anticipated.  They can do single precision now, and double precision is a priority for them.  If you're going to bend the ParallelArray API enough to compete with other languages on numerical processing, it has to be 'bent' far enough to support this sort of hardware acceleration, otherwise JNI is the correct choice, not ParallelArray.  If you can't support hardware accelleration, then the implementation should merely be clear and concise, because getting a 4-fold or 10-fold im provement in throughput will mean nothing if JNI offers a 100-fold improvement. 


In an my JVM utopia, the JVM would be able to detect that far too much auto-boxing or auto-unboxing is going on in a particular call graph, and it would generate specializations of all of the methods that are reversed from whatever the person implemented (switch to Integer from int, or vice versa).  The tricky bit would be that the new methods would have to be added without colliding with the namespace that already exists.  I suspect that could be quite a pain for debuggers, profilers, and stack traces.

Besides computational overhead, the only thing that separates Integer from int or Float from float is:

1) Nullability (this one is tough, since Java can't easily determine that a reference can never be null)
2) Identity
3) convenience methods

If you could tell that all 3 don't happen (and on numerical code, none of those had better be happening), and that the computation would benefit from being transformed into using primitives, then ParallelArray wouldn't need specialization at the API level, because the VM could make the same optimizations.

-Jason

On Jan 26, 2008 2:49 AM, Mark Thornton <mthornton <at> optrak.co.uk> wrote:
jason marshall wrote:
>
> For floats, the prevailing winds suggest you're going to use GPGPU for
> SIMD.  IEEE incompatibilities notwithstanding.
The IEEE incompatibilities may make that a non starter for Java.
Currently Java can't even use instructions like the fused Multiply
Accumulate present on some processors. JSR-84 was withdrawn.

Mark Thornton




--
- Jason
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> altair.cs.oswego.edu
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
Hanson Char | 3 Feb 07:26 2008
Picon

ParallelArray.binarySearch

Curious,

1) Why the implementation of the 2
AbstractParallelAnyArray.binarySearch methods don't just simply invoke
the corresponding static methods in Arrays.binarySearch, which
essentially executes the same code, except with a seemingly
inconsistent behavior of returning different value when the target
element is not found ?

AbstractParallelAnyArray.binarySearch returns -1 if the target element
is not present, whereas Arrays.binarySearch returns (-(insertion
point) - 1).

In other words, why not:

          // Inside AbstractParallelAnyArray
          public int binarySearch(T target) {
            return Arrays.binarySearch(array, origin, fence-1, target);
          }

        public int binarySearch(T target, Comparator<? super T> comparator) {
            return Arrays.binarySearch(array, origin, fence-1, comparator);
        }

2) Why the 2 binarySearch methods need to be redefined in
ParallelArray, when they are already defined in the base class
AbstractParallelAnyArray, and all ParrallelArray does is simply invoke
the super.binarySearch ?

Regards,
Hanson
Doug Lea | 3 Feb 15:03 2008

Re: jsr166y.forkjoin API comments

jason marshall wrote:
> 
> Extremetech's podcast this week talks with someone from the NVidia Cuda 
> group.  I think this stuff is probably closer than I anticipated.  They 
> can do single precision now, and double precision is a priority for 
> them.  If you're going to bend the ParallelArray API enough to compete 
> with other languages on numerical processing, it has to be 'bent' far 
> enough to support this sort of hardware acceleration, otherwise JNI is 
> the correct choice, not ParallelArray. 

Consider what the API would look like for any such JNI facility.
It would almost surely provide a set of structured traversals,
but a more restricted set than Parallel*Array. This still implies
to me that it is preferable to capture this via Parallel*Array,
with the idea of internally using SIMD/GPUs as support becomes
available.

> 
> In an my JVM utopia, the JVM would be able to detect that far too much 
> auto-boxing or auto-unboxing is going on in a particular call graph

You only need such utopian visions to handle cases where
programmers express computations using scalar primitives,
which compilers (like javac) then translate into boxed
mechanics, and then dare the JVM to rediscover the underlying
scalar primitive computation, which even at best it cannot
always do.

Without explicit specializations (like ParallelLongArray),
the current situation in java is not quite that extreme but
can get pretty close to it. In a language with better integration
of objects and value types, you wouldn't get yourself into this
situation to begin with. It is not easy though. Very few languages
get this right. As mentioned in a previous posting, the only
reasonable way to cope is to both help instigate others to
pursue language improvements, while at the same time making
the best API you can for the language you have.

For Parallel*Array, the most important performance issues
surround arrays of boxed vs unboxed primitives, not just
single scalar variables.  Auto-morphing a Long[] to a long[]
is not something I expect to happen anytime soon. Although
I do still encourage you to try working on solutions for
this or related problems that would enable you to program
in a way you like better.

-Doug
Doug Lea | 3 Feb 15:13 2008

Re: ParallelArray.binarySearch

Hanson Char wrote:
> Curious,
> 
> 1) Why the implementation of the 2
> AbstractParallelAnyArray.binarySearch methods don't just simply invoke
> the corresponding static methods in Arrays.binarySearch, 

This is a transient fact. In some incarnations, the internal
representation hasn't been amenable to this, so it is easier
for now to keep local version until internals start freezing
(which probably won't be for a while).

> 2) Why the 2 binarySearch methods need to be redefined in
> ParallelArray, when they are already defined in the base class
> AbstractParallelAnyArray, and all ParrallelArray does is simply invoke
> the super.binarySearch ?
> 

This meshes better with common java.util JDK conventions,
which is in turn a minor usability issue. For the top-level
ParallelArray class, you'd like the javadoc to include full
descriptions of all available methods without people needing
to know/care that some of them are inherited from some of the
boring With* classes (which are very likely to change anyway).

To add such javadoc, you need to attach it to a method, so
here, it is just attached to a call-super method.

-Doug
Hanson Char | 3 Feb 19:17 2008
Picon

Re: ParallelArray.binarySearch

> This is a transient fact.

How about the return value being different/inconsistent with that of
Arrays.binarySearch ?  Is that also transient ?

> For the top-level ParallelArray class, you'd like the javadoc to include full
> descriptions of all available methods without people needing
> to know/care that some of them are inherited from some of the

I see.  AFAIK, the javadoc generated for the subclass would
automatically/directly show the full text inherited from the base
class, if the javadoc is physically defined at the base class and is
left empty at the respective overriding method of the subclass.  One
advantage is that all overriding subclasses would get the default (yet
direct) javadoc free.

Another trivia - about 2 years ago, I proposed a JAXB fluent api using
the "with*" convention for method chaining:

  https://jaxb2-commons.dev.java.net/fluent-api

Now that I see such convention being extensively used in
ParrallelArray, I can't help thinking there is a causal relationship
:)

Regards,
Hanson

On Feb 3, 2008 6:13 AM, Doug Lea <dl <at> cs.oswego.edu> wrote:
> Hanson Char wrote:
> > Curious,
> >
> > 1) Why the implementation of the 2
> > AbstractParallelAnyArray.binarySearch methods don't just simply invoke
> > the corresponding static methods in Arrays.binarySearch,
>
> This is a transient fact. In some incarnations, the internal
> representation hasn't been amenable to this, so it is easier
> for now to keep local version until internals start freezing
> (which probably won't be for a while).
>
> > 2) Why the 2 binarySearch methods need to be redefined in
> > ParallelArray, when they are already defined in the base class
> > AbstractParallelAnyArray, and all ParrallelArray does is simply invoke
> > the super.binarySearch ?
> >
>
> This meshes better with common java.util JDK conventions,
> which is in turn a minor usability issue. For the top-level
> ParallelArray class, you'd like the javadoc to include full
> descriptions of all available methods without people needing
> to know/care that some of them are inherited from some of the
> boring With* classes (which are very likely to change anyway).
>
> To add such javadoc, you need to attach it to a method, so
> here, it is just attached to a call-super method.
>
> -Doug
Hanson Char | 4 Feb 07:01 2008
Picon

Matrix multiply with parallelized inner product

Hi Tim,

In the wiki example "Matrix multiply with parallelized inner product"

  http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W42

"It is much, much slower than the version that just parallelizes the
outer loop."

Did you know this as a fact prior to benchmarking ?  Does this mean
too much parallelism via PA would result in slower performance ?  If
so, any guideline/recipe as to what extent should one go about using
PA without causing such slowdown (besides trial-and-error) ?

Regards,
Hanson
Doug Lea | 4 Feb 13:30 2008

Re: ParallelArray.binarySearch

Hanson Char wrote:
>> This is a transient fact.
> 
> How about the return value being different/inconsistent with that of
> Arrays.binarySearch ?  
> 

Thanks! I'll change them to match.

> 
> Another trivia - about 2 years ago, I proposed a JAXB fluent api using
> the "with*" convention for method chaining:
> 
>   https://jaxb2-commons.dev.java.net/fluent-api
> 
> Now that I see such convention being extensively used in
> ParrallelArray, I can't help thinking there is a causal relationship
> :)
> 

The idea of using fluent style here is from Tim Peierls;
He'll have to answer about what led to the suggestion.

-Doug
Tim Peierls | 4 Feb 14:11 2008
Picon

Re: ParallelArray.binarySearch

On Feb 4, 2008 7:30 AM, Doug Lea <dl <at> cs.oswego.edu> wrote:

The idea of using fluent style here is from Tim Peierls;
He'll have to answer about what led to the suggestion.

Combination of the Fowler article, the Guice API, and a few other things that I can't remember. I don't think JAXB was in there, since I don't use it, but I won't swear it wasn't. :-)

--tim
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> altair.cs.oswego.edu
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
Tim Peierls | 4 Feb 14:56 2008
Picon

Re: Matrix multiply with parallelized inner product

On Feb 4, 2008 1:01 AM, Hanson Char <hanson.char <at> gmail.com> wrote:

In the wiki example "Matrix multiply with parallelized inner product"

 http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W42

"It is much, much slower than the version that just parallelizes the outer loop."

Did you know this as a fact prior to benchmarking ?  

No, but I didn't expect my already fully-utilized 2 logical processors (1 physical) to be able to take much advantage of the additional granularity. :-)

 
Does this mean too much parallelism via PA would result in slower performance ?

A nested PA call when all the processors are busy with mostly independent work just adds overhead.

 
If so, any guideline/recipe as to what extent should one go about using
PA without causing such slowdown (besides trial-and-error) ?

I'd think twice about nesting PA calls unless the outer call leaves you with many processors idle. See my last comment on that page:

"The only way I could see this approach being practical is when the number of processors greatly exceeds the number of columns in the result."

I'd use RecursiveTask/Action instead if tempted to use nested PA calls.

--tim
_______________________________________________
Concurrency-interest mailing list
Concurrency-interest <at> altair.cs.oswego.edu
http://altair.cs.oswego.edu/mailman/listinfo/concurrency-interest
Hanson Char | 4 Feb 18:25 2008
Picon

Re: Matrix multiply with parallelized inner product

> "The only way I could see this approach being practical is when the number
> of processors greatly exceeds the number of columns in the result."
>
> I'd use RecursiveTask/Action instead if tempted to use nested PA calls.

If the number of processors greatly exceeds the number of columns,
would using nested PA calls be significantly faster than using
RecursiveTask/Action (in this case of matrix multiplication) ?

It would be nice to have further code snippet on the wiki to
illustrate the combination of the (non-nested) PA with
RecursiveTask/Action for the inner product, if doing so would lead to
a solution that can perform reasonably well regardless of the number
of processors.  Or maybe the "forkJoinMatrixMultiply" method is
already in general optimal ?

Hanson

Gmane