3 Jun 2005 09:24
another (alpha 1.6.x) PMAP SMP simple_lock deadlock -- this time with networking?
Greg A. Woods <woods <at> weird.com>
2005-06-03 07:24:26 GMT
2005-06-03 07:24:26 GMT
For some annoying reason the full system lockups we've been suffering on the 4-CPU ES40 AlphaServer I recently put into production with 1.6.x went from once every 8-10 days (which was sufferable but painful) to once per day, now three days in a row (which is damn near intollerable). The only good thing about it has been that I've been able to directly manage the reboot a couple of times, instead of the on-call guys just hitting the reset button, and so I've finally got some info from DDB. (It's also good to finally get reasonably strong evidence that it's "just" a software bug, and not some hardware issue!(Continue reading)There are three events here, the first is a bit different, but even the last two seem subtly different (one being in the TCP path, the other in UDP). However all three are quite different than other PMAP-related simple_lock bugs I've seen in the past (and which seem to have been fixed by various patches that have been posted). I suppose I should post this as a PR too, but I'm wondering if this is enough information to give someone who's worked with these kinds of bugs before a clue as to where to start looking, or whether there's anything more I can/should collect from DDB. (Note that a full core dump, while theoretically possible in some cases, as I now have an 18GB dump device attached and only 16GB of RAM, is likely impossible in this case as all interrupts seem to be disabled at the point of the hang and so even a "sync" or "reboot" from DDB fails to get anywhere and the machine must be halted again and then booted from scratch from SRM. Maybe I can halt the other CPUs and then try to get

There are three events here, the first is a bit different, but even the
last two seem subtly different (one being in the TCP path, the other in
UDP).
However all three are quite different than other PMAP-related
simple_lock bugs I've seen in the past (and which seem to have been
fixed by various patches that have been posted).
I suppose I should post this as a PR too, but I'm wondering if this is
enough information to give someone who's worked with these kinds of bugs
before a clue as to where to start looking, or whether there's anything
more I can/should collect from DDB.
(Note that a full core dump, while theoretically possible in some cases,
as I now have an 18GB dump device attached and only 16GB of RAM, is
likely impossible in this case as all interrupts seem to be disabled at
the point of the hang and so even a "sync" or "reboot" from DDB fails to
get anywhere and the machine must be halted again and then booted from
scratch from SRM. Maybe I can halt the other CPUs and then try to get
RSS Feed