Greg A. Woods | 3 Jun 09:24 2005
X-Face

another (alpha 1.6.x) PMAP SMP simple_lock deadlock -- this time with networking?

For some annoying reason the full system lockups we've been suffering on
the 4-CPU ES40 AlphaServer I recently put into production with 1.6.x
went from once every 8-10 days (which was sufferable but painful) to
once per day, now three days in a row (which is damn near intollerable).

The only good thing about it has been that I've been able to directly
manage the reboot a couple of times, instead of the on-call guys just
hitting the reset button, and so I've finally got some info from DDB.
(It's also good to finally get reasonably strong evidence that it's
"just" a software bug, and not some hardware issue! :-)

There are three events here, the first is a bit different, but even the
last two seem subtly different (one being in the TCP path, the other in
UDP).

However all three are quite different than other PMAP-related
simple_lock bugs I've seen in the past (and which seem to have been
fixed by various patches that have been posted).

I suppose I should post this as a PR too, but I'm wondering if this is
enough information to give someone who's worked with these kinds of bugs
before a clue as to where to start looking, or whether there's anything
more I can/should collect from DDB.

(Note that a full core dump, while theoretically possible in some cases,
as I now have an 18GB dump device attached and only 16GB of RAM, is
likely impossible in this case as all interrupts seem to be disabled at
the point of the hang and so even a "sync" or "reboot" from DDB fails to
get anywhere and the machine must be halted again and then booted from
scratch from SRM.  Maybe I can halt the other CPUs and then try to get
(Continue reading)

Jan Schaumann | 3 Jun 18:52 2005

Re: SMP implementation detail

Tiffany Snyder <tiffany.snyder <at> gmail.com> wrote:

>     is there any write-up on SMP or any references to published papers
> it's based on? I'm in the process of evaluating an SMP capable OS for
> our product and multi-arch is a prerequisite (MIPS, x86, and PowerPC).

http://www.netbsd.org/Changes/2003.html#merge-nathanw_sa
http://web.mit.edu/nathanw/www/usenix/

-Jan

--

-- 
chown -R us:enemy your_base
Perry E. Metzger | 3 Jun 19:02 2005

Re: SMP implementation detail


Jan Schaumann <jschauma <at> netmeister.org> writes:
> Tiffany Snyder <tiffany.snyder <at> gmail.com> wrote:
>
>>     is there any write-up on SMP or any references to published papers
>> it's based on? I'm in the process of evaluating an SMP capable OS for
>> our product and multi-arch is a prerequisite (MIPS, x86, and PowerPC).
>
> http://www.netbsd.org/Changes/2003.html#merge-nathanw_sa
> http://web.mit.edu/nathanw/www/usenix/

That's not our SMP stuff, that's the pthreads stuff.

Perry

Jan Schaumann | 3 Jun 19:11 2005

Re: SMP implementation detail

"Perry E. Metzger" <perry <at> piermont.com> wrote:
> 
> Jan Schaumann <jschauma <at> netmeister.org> writes:
> > Tiffany Snyder <tiffany.snyder <at> gmail.com> wrote:
> >
> >>     is there any write-up on SMP or any references to published papers
> >> it's based on? I'm in the process of evaluating an SMP capable OS for
> >> our product and multi-arch is a prerequisite (MIPS, x86, and PowerPC).
> >
> > http://www.netbsd.org/Changes/2003.html#merge-nathanw_sa
> > http://web.mit.edu/nathanw/www/usenix/
> 
> That's not our SMP stuff, that's the pthreads stuff.

D'oh.

-Jan

--

-- 
What do you mean, why has it got to be built? It's a 
bypass. Got to build bypasses.

Gmane