David Howells | 1 Feb 12:44 2007
Picon

Re: page_mkwrite caller is racy?

Hugh Dickins <hugh <at> veritas.com> wrote:

> > Must it be able to sleep?
> 
> Not as David was using it

It absolutely *must* be able to sleep.  It has to wait for FS-Cache to finish
writing the page to the cache before letting the PTE be made writable.

David
Zan Lynx | 2 Feb 01:43 2007
Picon

2.6.20-rc6-mm3 possible circular lock on mmap_sem

I just noticed this in my logs, from the lock checker:

[  142.299455] 
[  142.299458] =======================================================
[  142.299473] [ INFO: possible circular locking dependency detected ]
[  142.299478] 2.6.20-rc6-mm3 #2
[  142.299482] -------------------------------------------------------
[  142.299547] evolution-data-/6909 is trying to acquire lock:
[  142.299551]  (&mm->mmap_sem){----}, at: [<ffffffff80266096>] do_page_fault+0x3f6/0x8a0
[  142.299742] 
[  142.299743] but task is already holding lock:
[  142.299749]  (&data->latch){----}, at: [<ffffffff8034f221>] get_exclusive_access+0x11/0x20
[  142.299880] 
[  142.299880] which lock already depends on the new lock.
[  142.299882] 
[  142.299947] 
[  142.299948] the existing dependency chain (in reverse order) is:
[  142.300010] 
[  142.300011] -> #1 (&data->latch){----}:
[  142.300138]        [<ffffffff8029e968>] __lock_acquire+0xd48/0xff0
[  142.300390]        [<ffffffff8029ec5b>] lock_acquire+0x4b/0x70
[  142.300583]        [<ffffffff8034f221>] get_exclusive_access+0x11/0x20
[  142.300833]        [<ffffffff80298db4>] down_write+0x34/0x40
[  142.301084]        [<ffffffff8034f221>] get_exclusive_access+0x11/0x20
[  142.301277]        [<ffffffff8034e58e>] mmap_unix_file+0x8e/0x1a0
[  142.301526]        [<ffffffff8020d80b>] do_mmap_pgoff+0x4eb/0x820
[  142.301777]        [<ffffffff8029d496>] mark_held_locks+0x76/0xa0
[  142.301969]        [<ffffffff80263bf4>] _spin_unlock_irq+0x24/0x30
[  142.302219]        [<ffffffff80263bf4>] _spin_unlock_irq+0x24/0x30
[  142.302468]        [<ffffffff80233e73>] elf_map+0xa3/0x100
(Continue reading)

Ethan Solomita | 2 Feb 02:38 2007
Picon

Re: [RFC 0/8] Cpuset aware writeback

    Hi Christoph -- has anything come of resolving the NFS / OOM 
concerns that Andrew Morton expressed concerning the patch? I'd be happy 
to see some progress on getting this patch (i.e. the one you posted on 
1/23) through.

    Thanks,
    -- Ethan

Christoph Lameter | 2 Feb 03:16 2007
Picon

Re: [RFC 0/8] Cpuset aware writeback

On Thu, 1 Feb 2007, Ethan Solomita wrote:

>    Hi Christoph -- has anything come of resolving the NFS / OOM concerns that
> Andrew Morton expressed concerning the patch? I'd be happy to see some
> progress on getting this patch (i.e. the one you posted on 1/23) through.

Peter Zilkstra addressed the NFS issue. I will submit the patch again as 
soon as the writeback code stabilizes a bit.

Andrew Morton | 2 Feb 05:03 2007

Re: [RFC 0/8] Cpuset aware writeback

On Thu, 1 Feb 2007 18:16:05 -0800 (PST) Christoph Lameter <clameter <at> sgi.com> wrote:

> On Thu, 1 Feb 2007, Ethan Solomita wrote:
> 
> >    Hi Christoph -- has anything come of resolving the NFS / OOM concerns that
> > Andrew Morton expressed concerning the patch? I'd be happy to see some
> > progress on getting this patch (i.e. the one you posted on 1/23) through.
> 
> Peter Zilkstra addressed the NFS issue.

Did he?  Are you yet in a position to confirm that?

Christoph Lameter | 2 Feb 06:22 2007
Picon

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Mon, 29 Jan 2007, Andrew Morton wrote:

> On Mon, 29 Jan 2007 15:37:29 -0800 (PST)
> Christoph Lameter <clameter <at> sgi.com> wrote:
> 
> > With a alloc_pages_range() one would be able to specify upper and lower 
> > boundaries.
> 
> Is there a proposal anywhere regarding how this would be implemented?

Yes it was discussed a while back in August. Look for alloc_pages_range. 
Sadly I have not been able to do work on it since there are too many 
other issues.

Christoph Lameter | 2 Feb 06:27 2007
Picon

Re: [PATCH 0/8] Create ZONE_MOVABLE to partition memory between movable and non-movable pages

On Tue, 30 Jan 2007, Peter Zijlstra wrote:

> I'm guessing this will involve page migration.

Not necessarily. The approach also works without page migration. Depends 
on an intelligent allocation scheme that stays off the areas of interest 
to those restricted to low area allocations as much as possible and then 
is able to reclaim from a section of a zone if necessary. The 
implementation of alloc_pages_range() that I did way back did not reply on 
page migration.

Christoph Lameter | 2 Feb 06:29 2007
Picon

Re: [RFC 0/8] Cpuset aware writeback

On Thu, 1 Feb 2007, Andrew Morton wrote:

> > Peter Zilkstra addressed the NFS issue.
> 
> Did he?  Are you yet in a position to confirm that?

He provided a solution to fix the congestion issue in NFS. I thought 
that is what you were looking for? That should make NFS behave more
like a block device right?

As I said before I think NFS is inherently unfixable given the layering of 
a block device on top of the network stack (which consists of an unknown 
number of additional intermediate layers). Cpuset writeback needs to work 
in the same way as in a machine without cpusets. If fails then at least 
let the cpuset behave as if we had a machine all on our own and fail in 
both cases in the same way. Right now we create dangerous low memory 
conditions due to high dirty ratios in a cpuset created by not having a 
throttling method. The NFS problems also exist for non cpuset scenarios 
and we have by and large been able to live with it so I think they are 
lower priority. It seems that the basic problem is created by the dirty 
ratios in a cpuset.

BTW the block layer also may be layered with raid and stuff and then we 
have similar issues. There is no general way so far of handling these 
situations except by twiddling around with min_free_kbytes praying 5 Hail 
Mary's and trying again. Maybe we are able allocate all needed memory from 
PF_MEMALLOC processes during reclaim and hopefully there is now enough 
memory for these allocations and those that happen to occur during an 
interrupt while we reclaim.
(Continue reading)

Nick Piggin | 2 Feb 06:51 2007
Picon

[rfc][patch] mm: half-fix page tail zeroing on write problem

Hi,

For no important reason, I've again looked at those zeroing patches that
Neil did a while back. I've always thought that a simple
`write(fd, NULL, size)` would cause the same sorts of problems.

Turns out it does. If you first write all 1s into a page, then do the
`write(fd, NULL, size)` at the same position, you end up with all 0s in
the page (test-case available on request).  Incredible; surely this
violates the spec?

The buffered-write fixes I've got actually fix this properly, but  they
don't look like getting merged any time soon. We could do this simple
patch which just reduces the chance of corruption from a certainty down
to a small race.

Any thoughts?

--

-- 
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-02-02 13:41:21.000000000 +1100
+++ linux-2.6/include/linux/pagemap.h	2007-02-02 13:42:09.000000000 +1100
 <at>  <at>  -198,6 +198,9  <at>  <at>  static inline int fault_in_pages_writeab
 {
 	int ret;

+	if (unlikely(size == 0))
+		return 0;
+
(Continue reading)

Neil Brown | 2 Feb 07:02 2007
X-Face
Picon

Re: [RFC 0/8] Cpuset aware writeback

On Thursday February 1, clameter <at> sgi.com wrote:
>                    The NFS problems also exist for non cpuset scenarios 
> and we have by and large been able to live with it so I think they are 
> lower priority. It seems that the basic problem is created by the dirty 
> ratios in a cpuset.

Some of our customers haven't been able to live with it.  I'm really
glad this will soon be fixed in mainline as it means our somewhat less
elegant fix in SLES can go away :-)

> 
> BTW the block layer also may be layered with raid and stuff and then we 
> have similar issues. There is no general way so far of handling these 
> situations except by twiddling around with min_free_kbytes praying 5 Hail 
> Mary's and trying again.

md/raid doesn't cause any problems here.  It preallocates enough to be
sure that it can always make forward progress.  In general the entire
block layer from generic_make_request down can always successfully
write a block out in a reasonable amount of time without requiring
kmalloc to succeed (with obvious exceptions like loop and nbd which go
back up to a higher layer).

The network stack is of course a different (much harder) problem.

NeilBrown

Gmane