Andrew Shewmaker | 1 Oct 2006 08:05
Picon

Re: Has anyone actually seen/used a cell system?

On 9/19/06, J.A.Delcorso <at> larc.nasa.gov <J.A.Delcorso <at> larc.nasa.gov> wrote:
> If this has already been discussed on the list, a pointer
> to the thread would be appreciated!
>
> Looking over the list I found a couple references to the
> IBM cell processors being used in PS3.  Has anyone had a
> chance to actually use/test one of these systems yet?
> Considering that the PS3 is supposed to be competing with
> the XBOX 360 and the Wii, I assume that it's out, or that
> someone has actually used one of these systems.
>
> Can anyone point me to a url, or tell me what their
> experience is with this technology?  Is it as fast as
> it's purported to be?  Apparently RedHat is developing
> EL 4.3 to run on the system?
>
> Any input is appreciated,

People in Stanford's graphics lab are doing promising
work making the Cell easier to program.

They have been busy creating a programming language
called Sequoia that focuses on memory hierarchies.  A
programmer defines a task that recursively decomposes
the input data until it reaches the bottom of the memory
hierarchy, where it runs an optimized kernel.  The runtime
environment uses a specification of the memory hierarchy
for the generic PC cluster, Cell Blade Server, etc.  In the
case of a cluster, the runtime is built on top of MPI.  In the
case of a Cell Blade Server, the runtime manages task
(Continue reading)

John Hearns | 1 Oct 2006 10:30
Favicon

Re: Ineternet cluster

Maxence Dunnewind wrote:
> Ok,
> 
> i would do a "packaging farm" because i know some people who packages 
> some big app, and the building time is about 20 hours :/
> So, do you think there really is no solution for parrallel works over 
> Internet ?

If you install Sun Gridengine on a set of machines, there is a parallel 
make environment:

http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/doc/htmlman/htmlman1/qmake.html

Does this satisfy your requirements?
Mark Hahn | 1 Oct 2006 22:58
Picon
Picon

Re: Has anyone actually seen/used a cell system?

> They have a paper that explains it well and has some
> interesting benchmarks.
>
> http://sc06.supercomputing.org/schedule/pdf/pap225.pdf

this is quite interesting.  I wish they had done benchmarks with doubles,
especially since they alluded to, for instance, the n-body calculation 
really needing at least careful consideration of precision/resolution.
(now that I think of it, using 23 bits of mantisas on a 256^3 FFT 
sounds numerically dubious too.)

interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops per SPE.
does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
efficiency of libgoto with 2 threads would be >= 80%, so flops would be 
.8*2*8*3 =~ 40 Gflops, or half a Cell chip. 
makes it hard to argue for wide use of Cell, I think...
Geoff Jacobs | 1 Oct 2006 23:29
Picon

Re: Has anyone actually seen/used a cell system?

Mark Hahn wrote:
>> They have a paper that explains it well and has some
>> interesting benchmarks.
>>
>> http://sc06.supercomputing.org/schedule/pdf/pap225.pdf
> 
> this is quite interesting.  I wish they had done benchmarks with doubles,
> especially since they alluded to, for instance, the n-body calculation
> really needing at least careful consideration of precision/resolution.
> (now that I think of it, using 23 bits of mantisas on a 256^3 FFT sounds
> numerically dubious too.)
> 
> interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops per SPE.
> does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
> efficiency of libgoto with 2 threads would be >= 80%, so flops would be
> .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
> wide use of Cell, I think...

Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
http://www.pcper.com/article.php?aid=265&type=expert&pid=3

This is an order of magnitude less performance than SGEMM predictions in
the LBL paper. Unfortunately, the LBL numbers are only predictions.
http://www.lbl.gov/Science-Articles/Archive/sabl/2006/Jul/CellProcessorPotential.pdf#search=%22sgemm%20cell%22

The linked article _is_ an evaluation of performance on an actual Cell
chip. Unfortunately, it's a lower clocked pre-production example running
an experimental pseudo-compiler. I'm interested in seeing SGEMM using
Cell-specific intrinsics. Such a benchmark should represent the maximum
(Continue reading)

Andrew Shewmaker | 2 Oct 2006 02:28
Picon

Re: Has anyone actually seen/used a cell system?

On 10/1/06, Geoff Jacobs <gdjacobs <at> gmail.com> wrote:

> > interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops per SPE.
> > does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
> > efficiency of libgoto with 2 threads would be >= 80%, so flops would be
> > .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
> > wide use of Cell, I think...
>
> Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
> sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
> http://www.pcper.com/article.php?aid=265&type=expert&pid=3

The same site reports that the X6800, a 2.93 GHz Core2 and sees
almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).

http://www.pcper.com/article.php?aid=272&type=expert&pid=5

I don't know much about ScienceMark.  The website has been
replaced with advertisements.  From what I gathered from several
review sites, it is MS Windows only and single threaded.  My
guess is that Goto's implementation would perform significantly
better even with a single thread.  Unfortunately, I looked all over
and couldn't find Core2 benchmarks using Goto's BLAS.

> The linked article _is_ an evaluation of performance on an actual Cell
> chip. Unfortunately, it's a lower clocked pre-production example running
> an experimental pseudo-compiler. I'm interested in seeing SGEMM using
> Cell-specific intrinsics. Such a benchmark should represent the maximum
> practical performance peak.
>
(Continue reading)

Andrew Shewmaker | 2 Oct 2006 03:23
Picon

Re: Has anyone actually seen/used a cell system?

On 10/1/06, Andrew Shewmaker <agshew <at> gmail.com> wrote:

> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a 2.93 GHz
> Core2 at SGEMM.  That's an awfully big range, so hopefully someone
> wil be kind enough to benchmark libgoto on Core2 for us.  The history file
> indicates that libgoto is optimized for Core2, but I don't have one to test.

I apologize for replying to my own message, but the 2-6 times faster isn't a
good range since it assumes only one of the Core2 cores is used for the
upper bound (80/12.5).  Assuming that ScienceMark's BLAS scaled
perfectly across two cores, the upper bound would be about 3.

So, it looks like a preproduction 2.4 GHz Cell is about 2-3 times faster than a
2.93 GHz Core2 at SGEMM.

However, IBM intends to scale production Cells to 3.2 GHz (let's assume a
1.3x speedup).  And Intel intends to double their cores again, and we expect
them to lower the clock of those cores too.  Anandtech thinks 2.66GHz
is the fastest we'll see.

http://www.anandtech.com/mac/showdoc.aspx?i=2832&p=6

So, that might give us a 2.66/2.93*2 = 1.8x speedup for SGEMM on Intel's quad
core.  The Cell may only be 1.4-2.3 faster at SGEMM than an Intel solution by
Q107.  Most people I know would love to have that kind of speedup if it didn't
take too much effort.  Sequoia looks like it might make the level of effort
reasonable.

FYI, Charm++ is also working on the difficulty of Cell programming.

(Continue reading)

Andrew Shewmaker | 2 Oct 2006 03:40
Picon

Re: Has anyone actually seen/used a cell system?

On 10/1/06, Mark Hahn <hahn <at> physics.mcmaster.ca> wrote:
> > They have a paper that explains it well and has some
> > interesting benchmarks.
> >
> > http://sc06.supercomputing.org/schedule/pdf/pap225.pdf
>
> this is quite interesting.  I wish they had done benchmarks with doubles,
> especially since they alluded to, for instance, the n-body calculation
> really needing at least careful consideration of precision/resolution.
> (now that I think of it, using 23 bits of mantisas on a 256^3 FFT
> sounds numerically dubious too.)

Peak double precision performance of Ax=b is 15 GFLOPS on
a 3.2 GHz Cell.

http://icl.cs.utk.edu/iter-ref/custom/index.html?lid=104&slid=210

I can understand why they didn't use doubles in the Sequoia
paper.  DP performance of the architecture is out of their hands
and mentioning it would have distracted from their focus on
programming to the memory hierarchy.

--

-- 
Andrew Shewmaker
Geoff Jacobs | 2 Oct 2006 03:06
Picon

Re: Has anyone actually seen/used a cell system?

Andrew Shewmaker wrote:
> On 10/1/06, Geoff Jacobs <gdjacobs <at> gmail.com> wrote:
> 
>> > interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops
>> per SPE.
>> > does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
>> > efficiency of libgoto with 2 threads would be >= 80%, so flops would be
>> > .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
>> > wide use of Cell, I think...
>>
>> Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
>> sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
>> http://www.pcper.com/article.php?aid=265&type=expert&pid=3
> 
> The same site reports that the X6800, a 2.93 GHz Core2 and sees
> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).
> 
> http://www.pcper.com/article.php?aid=272&type=expert&pid=5
> 
> I don't know much about ScienceMark.  The website has been
> replaced with advertisements.  From what I gathered from several
> review sites, it is MS Windows only and single threaded.  My
> guess is that Goto's implementation would perform significantly
> better even with a single thread.  Unfortunately, I looked all over
> and couldn't find Core2 benchmarks using Goto's BLAS.

WRT ScienceMark sgemm being multithreaded, I think you're right. The
prime web site is German, and explicitly states that BLAS tests are not
SMP. My bad :|
http://www.sciencemark.de/faq.html
(Continue reading)

Mark Hahn | 2 Oct 2006 05:24
Picon
Picon

Re: Has anyone actually seen/used a cell system?

>> The same site reports that the X6800, a 2.93 GHz Core2 and sees
>> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).

hmm, those numbers are pretty low - peak should be 2.93*4 or 8,
and I'd expect 80% of peak or 19 Gflops/core for this comparison
(Opterons can do 90%, at least on my machine using HPL.)

so the paper shows 80.6 Gflops SGEMM for 8 SPE's; it's only 
fair to compare this to 2 or 4 Core2 cores (37.5 and 75 Gflops!)

> indicative of per core performance on Core 2. Is it safe to say that
> Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto
> BLAS?

peak SGEMM/core would be 3*8=24, so 15 sounds quite low.

>> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a

do you know of something crippled in the pre-production Cell chips?

it looks like 2x is about right to me, considering that full-production
Cell appears to ship about the same time as 4x Core2.  the main question
is whether that's good enough to make Cell more than a niche product.
I've talked with a number of my better users, and they all tend to want
>=10x speedup before considering non-GP approaches (cell, fpga, gpgpu).

> I guess my biggest objection to Mark's comment was the comparison of
> SGEMM implemented in an experimental language with unproven structure
> with a theoretical calculation of Core 2 peak performance. I'd simply

(Continue reading)

Mark Hahn | 2 Oct 2006 06:24
Picon
Picon

Re: Ineternet cluster

>> So, do you think there really is no solution for parrallel works over 
>> Internet ?

"over the internet" implies widely ramified machines in different admin
domains, no shared FS, poor interconnect, etc.  since compilation, on today's
machines, is quite IO, rather than compute-bound, this is going to be a problem.

it's no big deal to do parallel makes in a _cluster_, where you have some 
decent queueing system, shared filesystem, etc.  in fact, I did an experiment 
where I simply said
 	CC=sqrun cc
in a makefile, with gnu "make -j 2" for instance, and it worked as expected.

Gmane