Nathan Scott | 1 Mar 2006 07:14
Picon
Favicon

TAKE 949659 - fix 2.4 builds

Fix module-related build issue for 2.4 kernels.

Date:  Wed Mar  1 17:14:28 AEDT 2006
Workarea:  chook.melbourne.sgi.com:/build/nathans/xfs-linux
Inspected by:  cattelan

The following file(s) were checked into:
  longdrop.melbourne.sgi.com:/isms/xfs-kern/xfs-linux-melb

Modid:  xfs-linux-melb:xfs-kern:25329a
linux-2.4/Makefile - 1.208 - changed
http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.4/Makefile.diff?r1=text&tr1=1.208&r2=text&tr2=1.207&f=h

David Chinner | 1 Mar 2006 13:53
Picon
Favicon

TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

using a spinlock per cpu for superblock counter exclusion results in
a preempt counter overflow at 256p and above. Change the exclusion mechanism
to use atomic bit operations and busy wait loops to emulate the spin
lock exclusion mechanism but without the preempt count issues.

Date:  Wed Mar  1 23:52:17 AEDT 2006
Workarea:  chook.melbourne.sgi.com:/build/dgc/isms/2.6.x-xfs
Inspected by:  nathans

The following file(s) were checked into:
  longdrop.melbourne.sgi.com:/isms/linux/2.6.x-xfs-melb

Modid:  xfs-linux-melb:xfs-kern:25338a
fs/xfs/xfs_mount.h - 1.216 - changed
http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.h.diff?r1=text&tr1=1.216&r2=text&tr2=1.215&f=h
	- Use a flag bit for per-cpu counter exclusion rather than a spin lock
	  to prevent preempt count overflows.

fs/xfs/xfs_mount.c - 1.372 - changed
http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/xfs_mount.c.diff?r1=text&tr1=1.372&r2=text&tr2=1.371&f=h
	- Use a flag bit for per-cpu counter exclusion rather than a spin lock
	  to prevent preempt count overflows.

fs/xfs/linux-2.6/xfs_linux.h - 1.142 - changed
http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-linux/linux-2.6/xfs_linux.h.diff?r1=text&tr1=1.142&r2=text&tr2=1.141&f=h
	- Per-cpu superblock counter locks need to busy wait.

Andi Kleen | 1 Mar 2006 22:36
Picon

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

dgc <at> sgi.com (David Chinner) writes:

> using a spinlock per cpu for superblock counter exclusion results in
> a preempt counter overflow at 256p and above. Change the exclusion mechanism
> to use atomic bit operations and busy wait loops to emulate the spin
> lock exclusion mechanism but without the preempt count issues.

That sounds like the totally wrong place to fix this. 
Wouldn't it be better to fix the spinlocks instead instead of hacking
around this? After all any other code on that big box could run
into the same issue.

-Andi

David Chinner | 2 Mar 2006 04:06
Picon
Favicon

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

On Wed, Mar 01, 2006 at 10:36:05PM +0100, Andi Kleen wrote:
> dgc <at> sgi.com (David Chinner) writes:
> 
> > using a spinlock per cpu for superblock counter exclusion results in
> > a preempt counter overflow at 256p and above. Change the exclusion mechanism
> > to use atomic bit operations and busy wait loops to emulate the spin
> > lock exclusion mechanism but without the preempt count issues.
> 
> That sounds like the totally wrong place to fix this. 
> Wouldn't it be better to fix the spinlocks instead instead of hacking
> around this? After all any other code on that big box could run
> into the same issue.

The issue is one cpu taking all the spin locks for a mount point
to exclude per-cpu access while we do something global to the counter
(e.g a rebalance).

I initially used spinlocks because it was simple to implement, I
knew it would prevent preemption in the critical sections and I knew
of nothing that would prevent a single cpu from taking up to NR_CPU
+ 1 nested spinlocks. Turns out I made an incorrect assumption.

However, when we are excluding the counters, we are already running
atomic due to holding the in-core superblock spinlock and the
per-cpu code uses get_cpu()/put_cpu() so neither can get preempted
in the critical sections the spinlocks used to protect.  So the only
considerations are having an atomic exclusion mechanism and a wait
method that does not sleep.

In hindsight, spinlocks were a bad implementation choice. If I knew
(Continue reading)

Andi Kleen | 2 Mar 2006 05:55
Picon

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

David Chinner <dgc <at> sgi.com> writes:
> 
> However, when we are excluding the counters, we are already running
> atomic due to holding the in-core superblock spinlock and the
> per-cpu code uses get_cpu()/put_cpu() so neither can get preempted
> in the critical sections the spinlocks used to protect.  So the only
> considerations are having an atomic exclusion mechanism and a wait
> method that does not sleep.

spinlocks are should be fine for this in theory. All CPUs spinning
on a spinlock shouldn't happen normally, but it is supposed to work 
correctly if it happens.

If preemptible spinlocks don't scale to this for large NR_CPUs they're
broken and need to be fixed.

You discovered a important limitation in them and now instead
of fixing them properly you're trying to work around it.

Please try to see the big picture a bit more instead of just XFS.

-Andi

David Chinner | 2 Mar 2006 07:58
Picon
Favicon

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

On Thu, Mar 02, 2006 at 05:55:43AM +0100, Andi Kleen wrote:
> David Chinner <dgc <at> sgi.com> writes:
> > 
> > However, when we are excluding the counters, we are already running
> > atomic due to holding the in-core superblock spinlock and the
> > per-cpu code uses get_cpu()/put_cpu() so neither can get preempted
> > in the critical sections the spinlocks used to protect.  So the only
> > considerations are having an atomic exclusion mechanism and a wait
> > method that does not sleep.
> 
> spinlocks are should be fine for this in theory. All CPUs spinning
> on a spinlock shouldn't happen normally, but it is supposed to work 
> correctly if it happens.

And that does work correctly. However, that's not the problem being
solved here.

The problem is one thread grabbing > 245 spinlocks and holding them
locked. The preempt count is per thread, so it has to be a single
thread that holds all the locks at the same time to cause the problem.

> If preemptible spinlocks don't scale to this for large NR_CPUs they're
> broken and need to be fixed.
> You discovered a important limitation in them and now instead
> of fixing them properly you're trying to work around it.

It's not NR_CPUs that they need to scale to - I could have 2 locks
per cpu that need to be held, and then it's 2*NR_CPUs that the
pre-empt count must scale to. Where do you draw the line?

(Continue reading)

Andi Kleen | 2 Mar 2006 13:09
Picon

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

On Thursday 02 March 2006 07:58, David Chinner wrote:
> On Thu, Mar 02, 2006 at 05:55:43AM +0100, Andi Kleen wrote:
> > David Chinner <dgc <at> sgi.com> writes:
> > > However, when we are excluding the counters, we are already running
> > > atomic due to holding the in-core superblock spinlock and the
> > > per-cpu code uses get_cpu()/put_cpu() so neither can get preempted
> > > in the critical sections the spinlocks used to protect.  So the only
> > > considerations are having an atomic exclusion mechanism and a wait
> > > method that does not sleep.
> >
> > spinlocks are should be fine for this in theory. All CPUs spinning
> > on a spinlock shouldn't happen normally, but it is supposed to work
> > correctly if it happens.
>
> And that does work correctly. However, that's not the problem being
> solved here.
>
> The problem is one thread grabbing > 245 spinlocks and holding them
> locked. The preempt count is per thread, so it has to be a single
> thread that holds all the locks at the same time to cause the problem.

> > If preemptible spinlocks don't scale to this for large NR_CPUs they're
> > broken and need to be fixed.
> > You discovered a important limitation in them and now instead
> > of fixing them properly you're trying to work around it.
>
> It's not NR_CPUs that they need to scale to - I could have 2 locks
> per cpu that need to be held, and then it's 2*NR_CPUs that the
> pre-empt count must scale to. Where do you draw the line?

(Continue reading)

Andi Kleen | 2 Mar 2006 18:52
Picon

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p

On Thursday 02 March 2006 17:53, Linus Torvalds wrote:
>nything worse.
> 
> > I don't see anything that would rely on the count being positive
> > so using the sign bit is probably ok. Hardirq 11 might be a bit
> > tight though, so maybe it would be better to move 64bit machines
> > to 64bit here.

Yes agreed, it just would be a bigger patch probably affecting many
architectures.

> And yes, I think it may make sense to just use the full 64 bits on a 
> 64-bit machine. Eventually. Somebody should check what the larger 
> constants result in, though.
> 
> And not for now, but obviously the 11 bits will run out for even bigger 
> machines. But for a "let's fix it quickly", adding three bits should be 
> plenty good enough, no?

That would be spinlock nesting of 2 per CPU on a 1024 CPU machine. 
Not exactly plenty. Better 12-14 at least.

-Andi

Linus Torvalds | 2 Mar 2006 17:53

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p


On Thu, 2 Mar 2006, Andi Kleen wrote:
> 
> I suspect 256 softirq nestings are not needed, so how about setting 
> PREEMPT_BITS to 16 and SOFTIRQ_BITS to 4 and hardirq bits
> to 11 and PREEMPT_ACTIVE to 31?

I'd rather be a bit more gentle: instead of taking away _half_ the SOFTIRQ 
bits, take away just one or two. The HARDIRQ bits are also pretty 
dangerous, I'd say. So I'm not happy with your choice of numbers.

Right now we actally have three bits unused, I think, so it would be much 
better to just make PREEMPT_ACTIVE be 31, and just make PREEMPT_BITS be 
11. Problem solved, without making anything worse.

> I don't see anything that would rely on the count being positive
> so using the sign bit is probably ok. Hardirq 11 might be a bit
> tight though, so maybe it would be better to move 64bit machines
> to 64bit here.

And yes, I think it may make sense to just use the full 64 bits on a 
64-bit machine. Eventually. Somebody should check what the larger 
constants result in, though.

And not for now, but obviously the 11 bits will run out for even bigger 
machines. But for a "let's fix it quickly", adding three bits should be 
plenty good enough, no?

Need to check the things that check PREEMPT_ACTIVE, but making i tbit #31 
migt actually _help_ (you can test it more easily).
(Continue reading)

Linus Torvalds | 2 Mar 2006 19:06

Re: TAKE 950027 - xfs_icsb_lock_all_counters fails with CONFIG_PREEMPT and >=256p


On Thu, 2 Mar 2006, Andi Kleen wrote:
>
> That would be spinlock nesting of 2 per CPU on a 1024 CPU machine. 
> Not exactly plenty. Better 12-14 at least.

That's PLENTY.

First off, name somebody who sells bigger systems than that.

Second, taking two spinlocks per CPU is crazy in the first place. Who does 
that? Talk about scaling badly.

Third, code that cares can avoid it entirely by just bumping the preempt 
counter _once_, and then using the raw spinlock code. That's how you 
should do it for big-rw-locks anyway, just because it's less insane, if 
that is what XFS is doing with its "two spinlocks per CPU" thing.

Fourth, you ignore the very real possibility that decreasing the size of 
the other fields can cause problems, and you're completely fixated on the 
crazy XFS usage.

		Linus


Gmane