13 Feb 2005 01:32
"pmap_unwire: wiring ... didn't change!"
Chuck Silvers <chuq <at> chuq.com>
2005-02-13 00:32:47 GMT
2005-02-13 00:32:47 GMT
hi, I looked into the "pmap_unwire: wiring ... didn't change!" problem. what's happening is that sometimes we're going through uvm_fault() again for one of the wired pages, and the pmap_enter() which results from that replaces the wired pmap entry with non-wired one. so why would we go through uvm_fault() again? that's happening because sometimes the TLB entry is recycled after the page mapped by one side of the entry is wired and before the other page is wired. and when MachTLBUpdate() is called for the second page, the entry it creates only has the second side filled in. this causes us to get a "TLB invalid" trap when we access the other half of the TLB entry instead of the "TLB miss" trap that we would get if there were no TLB entry at all. the handler for "TLB invalid" traps from usermode always calls trap() instead of looking at the PTEs. so the problem is that the PTEs and the TLB get out of sync, and the trap handlers aren't expecting that. I see several possible fixes, in approximate decreasing order of preference: - change MachTLBUpdate() to take both PTEs for the TLB entry for MIPS3. - change the MIPS3 MachTLBUpdate() to only update existing TLB entries, not create new ones. - have the MIPS3 user "TLB invalid" trap handler try to reload from the PTEs before calling trap(). - use MIPS3_TBIS() instead of MachTLBUpdate() for user mappings on MIPS3. the last one was the easiest, so I implemented that one and it appears to(Continue reading)
Chuck> the last one was the easiest, so I implemented that one and
Chuck> it appears to make the problem go away.
I've applied your patch and I can confirm vanishing of the
"pmap_unwire: ..." messages (so far, 2 hours now).
But I still see (my?) data corruption problem:
)
>> But I still see (my?) data corruption problem:
Chuck> that sounds like a CPU cache problem to me too, probably in
Chuck> bus_dma or the cache-flushing code itself. if it's
Chuck> happening during writes to disk rather than reads from disk
Chuck> then it's probably in the cache write-back part rather than
Chuck> the cache invalidate part. I didn't see anything in a brief
Chuck> look at the code, though.
My mentioned tests, where I can reproduce the data corruption
certainly, involve disk access; _reading_ large data amounts from disk
is enough to get a corruption.
Once I tested my qube2's RAM with pkgsrc/sysutils/memtester where no
errors were reported.
I did not notice any data corruption if using my qube2 in routing
data, but I did no selective stress testing on that.
Chuck> I don't see this problem on my R4400 pmax 5000/260, so it's
Chuck> likely specific to either the RM5200/R5000 or systems with
). To workaround the latter I run 'nice pax -zvrpe ...'
over a ssh connection, so that pax's '-v' vorbose output produces some
additional load which prevents most file corruptions (not all: some,
especially larger files might still get corrupted).
>> Hmm, if the problems occurs on quite different hardware, just
>> having the same mips CPU type, (common) r5k cache handling
>> seems really to be the most probable cause of the corruption
>> (correct?). Or ist bus_dma still a candidate?
Chuck> could be either, we don't know yet. the various versions of
Chuck> the bus_dma code for all the MIPS3 platforms are pretty
Chuck> similar.
...despite the fact that the same bus_dma code works for/on your R4k4?
RSS Feed