Beat Rubischon | 1 Oct 08:32 2009
Picon

Re: recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose?

Hi Mark!

Quoting <hahn <at> mcmaster.ca> (30.09.09 21:13):

>> I allready saw a lot of Beowulfs where the management network is so rarely
>> used that nobody detects errors. When an emergencie arrives, the operators
>> struggles over the screwed up switches and unplugged cables.
> I think that's insane (no offense intended).  same with the comment about
> crash carts being hard to maneuver.

You're absolutely right. And the fact that you regularly post to this list
is probably a sign that you keep your serverrooms clean :-)

During the last years I saw more then 100 datacenters - And I remember less
then 10 which were a good place to work. Empty boxes, rails and cables on
the floor, keyboards with PS2 plugs and USB only servers or even no monitor
at all. Broken double floor plates, leaking cooling pipes, cascaded
connector strips without free connectors. You remeber that cutted cables are
always too short? Once installed they are often not replaced by a longer one
which doesn't need to be spanned through the air. Ah yes, the clip on the
RJ-45 plugs could break. Even on uplinks of large servers.

As long as a single person or a motivated team handles a datacenter,
everything looks good. But when spare time operators or even worse several
groups are using a spare room in the basement, bad things are common. A
coworker brought the situation to the point: Entropy is a natural state and
it takes force to bring in order. And many people out there are not willing
or able to invest this piece of force.

Beat
(Continue reading)

Bill Broadley | 1 Oct 22:08 2009
Picon

Nvidia FERMI/gt300 GPU

Impressive:
* IEEE floating point, doubles 1/2 as fast as single precision (6 times or
  so faster than the gt200).
* ECC
* 512 cores (gt200 has 240)
* 384 bit bus gddr5 (twice as fast per pin, gt200 has 512 bits)
* 3 billion transistors
* 64KB of L1 cache per SM, 768KB L2, cache coherent across the chip.

No time line or price, but it sounds like a rather interesting GPU for CUDA
and OpenCL.

More details:
http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932
Craig Tierney | 1 Oct 23:18 2009
Picon

Re: Nvidia FERMI/gt300 GPU

Bill Broadley wrote:
> Impressive:
> * IEEE floating point, doubles 1/2 as fast as single precision (6 times or
>   so faster than the gt200).
> * ECC

The GDDR5 says it supports ECC, but what is the card going to do?
Is it ECC just from the memory controller, or is it ECC all the way
through the chip?  Is it 1-bit correct, 2-bit error message?

Anyone?

Craig

> * 512 cores (gt200 has 240)
> * 384 bit bus gddr5 (twice as fast per pin, gt200 has 512 bits)
> * 3 billion transistors
> * 64KB of L1 cache per SM, 768KB L2, cache coherent across the chip.
> 
> No time line or price, but it sounds like a rather interesting GPU for CUDA
> and OpenCL.
> 
> More details:
> http://www.realworldtech.com/page.cfm?ArticleID=RWT093009110932
> _______________________________________________
> Beowulf mailing list, Beowulf <at> beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

--

-- 
(Continue reading)

Bill Broadley | 1 Oct 23:42 2009
Picon

Re: Nvidia FERMI/gt300 GPU

Craig Tierney wrote:
> Bill Broadley wrote:
>> Impressive:
>> * IEEE floating point, doubles 1/2 as fast as single precision (6 times or
>>   so faster than the gt200).
>> * ECC
> 
> The GDDR5 says it supports ECC, but what is the card going to do?
> Is it ECC just from the memory controller, or is it ECC all the way
> through the chip?  Is it 1-bit correct, 2-bit error message?

Nvidia is pleasingly specific in their white paper:
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf

Specifically:
 Fermi supports Single-Error Correct Double-Error Detect (SECDED) ECC codes
 that correct any single bit error in hardware as the data is accessed.
 ...
 Fermi’s register files, shared memories, L1 caches, L2 cache, and DRAM memory
 are ECC protected
 ...
 All NVIDIA GPUs include support for the PCI Express standard for CRC check
 with retry at the data link layer. Fermi also supports the similar GDDR5
 standard for CRC check with retry (aka “EDC”) during transmission of data
 across the memory bus.

Kudos to Nvidia to being very clear.

Kilian CAVALOTTI | 2 Oct 09:19 2009
Picon

Re: Nvidia FERMI/gt300 GPU

On Thu, Oct 1, 2009 at 10:08 PM, Bill Broadley <bill <at> cse.ucdavis.edu> wrote:
> No time line or price, but it sounds like a rather interesting GPU for CUDA
> and OpenCL.

Not only CUDA and OpenCL, but also DirectX, DirectCompute, C++, and
Fortran. From a programmer's point of view, it could be a major
improvement, and the only thing which still kept people from using
GPUs to run their code.

On a side note, it's funny to notice that this is probably the first
GPU in history to be introduced by its manufacturer as a
"supercomputer on a chip", rather than a graphics engine which will
allow gamers to play their favorite RPS at never-reached resolutions
and framerates. Reading some reviews, it seemed that the traditional
audience of such events (gamers) were quite disappointed by not really
seeing what the announcements could mean to them.
After all, HPC is still a niche compared to the worldwide video games
market, and it's impressive that NVIDIA decided to focus on this tiny
fraction of its prospective buyers, rather than go for the usual
my-vertex-pipeline-is-longer-than-yours. :)

Cheers,
--

-- 
Kilian
Igor Kozin | 2 Oct 11:01 2009

Re: Nvidia FERMI/gt300 GPU

> Not only CUDA and OpenCL, but also DirectX, DirectCompute, C++, and
> Fortran. From a programmer's point of view, it could be a major
> improvement, and the only thing which still kept people from using
> GPUs to run their code.

all of the above but C++ can be used on the current hardware.
CUDA Fortran is already available in PGI 9.0-4 (public beta).

> On a side note, it's funny to notice that this is probably the first
> GPU in history to be introduced by its manufacturer as a
> "supercomputer on a chip", rather than a graphics engine which will
> allow gamers to play their favorite RPS at never-reached resolutions
> and framerates. Reading some reviews, it seemed that the traditional
> audience of such events (gamers) were quite disappointed by not really
> seeing what the announcements could mean to them.
> After all, HPC is still a niche compared to the worldwide video games
> market, and it's impressive that NVIDIA decided to focus on this tiny
> fraction of its prospective buyers, rather than go for the usual
> my-vertex-pipeline-is-longer-than-yours. :)

yes, indeed and such a strong skew towards HPC worries me. there is
almost no word how gamers are going to benefit from Fermi apart from
C++ support. in the mean time RV770 offers 2.7 TFlops SP and 1/4 of
that in DP which positions it pretty close to Fermi in that respect.
in RV770, quick integer operations are in 24-bit and i wonder if the
same still holds true for Fermi.
Rahul Nabar | 2 Oct 21:46 2009
Picon

Re: recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose?

On Thu, Oct 1, 2009 at 1:32 AM, Beat Rubischon <beat <at> 0x1b.ch> wrote:

>
> As long as a single person or a motivated team handles a datacenter,
> everything looks good. But when spare time operators or even worse several
> groups are using a spare room in the basement, bad things are common.

+1 coming from a university setting. I think this is very typical
amongst my peers. Not "good" , or "desirable"; but just a fact.

I also suspect this is a "size of the cluster" issue. Many of the
HPC-clusters I see are "small". 30-50 servers clustered together. I
think these tend to be a lot more messy and hacked-together than the
nice, clean, efficient setups  many of you on this list might have.
Mark probably deals with "huge" systems. Just a thought.

--

-- 
Rahul
Rahul Nabar | 2 Oct 21:51 2009
Picon

Re: recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose?

On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman
<landman <at> scalableinformatics.com> wrote:
> Of course I could also talk about the SOL (serial over lan) which didn't

THanks Joe! I've been reading about IPMI and also talking to my vendor
about it. Sure, our machines also have IPMI support!

Question: What's the difference between SOL and IPMI. Is one a subset
of the other?

Frankly, based on my past experience the only part of IPMI that I
foresee being most useful to us is the console (keyboard + monitor)
redirection. Right from the BIOS stage of the bootup cycle.  All the
other IPMI capabilities are of course cool and nice but that is icing
on the cake for me.

--

-- 
Rahul
Skylar Thompson | 3 Oct 05:13 2009

Re: recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose?

Rahul Nabar wrote:
> On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman
> <landman <at> scalableinformatics.com> wrote:
>   
>> Of course I could also talk about the SOL (serial over lan) which didn't
>>     
>
> THanks Joe! I've been reading about IPMI and also talking to my vendor
> about it. Sure, our machines also have IPMI support!
>
> Question: What's the difference between SOL and IPMI. Is one a subset
> of the other?
>   

SOL is provided by IPMI v1.5+, so it's a part of IPMI itself.

> Frankly, based on my past experience the only part of IPMI that I
> foresee being most useful to us is the console (keyboard + monitor)
> redirection. Right from the BIOS stage of the bootup cycle.  All the
> other IPMI capabilities are of course cool and nice but that is icing
> on the cake for me.
>   

In addition to the console, the other really useful feature of IPMI is
remote power cycling. That's useful when the console itself is totally
wedged.

--

-- 
-- Skylar Thompson (skylar <at> cs.earlham.edu)
-- http://www.cs.earlham.edu/~skylar/
(Continue reading)

Rahul Nabar | 3 Oct 18:33 2009
Picon

Re: recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose?

On Fri, Oct 2, 2009 at 10:13 PM, Skylar Thompson <skylar <at> cs.earlham.edu> wrote:
> Rahul Nabar wrote:
>> On Wed, Sep 30, 2009 at 9:09 AM, Joe Landman
>> <landman <at> scalableinformatics.com> wrote:

>
> In addition to the console, the other really useful feature of IPMI is
> remote power cycling. That's useful when the console itself is totally
> wedged.
>

True. That's a useful feature. But that "could" be done by sending
"magic packets" to a eth card as well, right? I say "can" because I
don't have that running on all my servers but had toyed with that on
some. I guess, just many ways of doing the same thing.

--

-- 
Rahul

Gmane