My long fight with missed interrupts...
Johnny Billquist <bqt <at> softjar.se>
2010-06-14 06:01:59 GMT
Finally! I've found the problem with my lost interrupts.
It turns out to be a vax-specific problem, and one related to spin mutexes.
More specific, the mutex_spin_enter function is not truly reentrant.
What can happen (and is happening sometimes) is that mutex_spin_enter is
called for a mutex. Between the check wether any other spin mutexes are
being held and the decrement of the counter, an interrupt from the clock
happens, which raises the IPL even higher than the spin mutex requires.
This interrupt also calls mutex_spin_enter, and at this point, it don't
look like any other mutex is being held, so the IPL of the clock
interrupt is saved as the old ipl level. The clock interrupt runs
through, restores the IPL to what it was before, and the interrupted
routine continues, at the right IPL level. However, at mute_spin_exit,
the routine will now restore the IPL that was before mutex_spin_enter,
but the value saved is the one from when the interrupt occured, meaning
that in the end, we will actually raise the IPL at mutex_spin_exit, and
will not recover properly from this situation for quite a while.
I don't know under which circumstances it eventually gets recovered, but
it do happen at some point. But the system is somewhat catatonic for
quite a while.
Here is a fix for this bug. It should be incorporated into the source
tree asap, I'd say. If anyone want to improve on it, feel free. I've
tested it for a while on a 8650 now, and it works fine there.
The fix is maybe crude. I raise the IPL to IPL_HIGH for maybe longer
than needed while acquiring a spin mutex, and raise it again at
releasing the mutex. Feel free to improve the code if anyone feel like it...
Johnny
(Continue reading)