Nick Piggin | 1 Aug 03:22 2008
Picon

Re: [patch v3] splice: fix race with page invalidation

On Friday 01 August 2008 04:13, Miklos Szeredi wrote:
> On Thu, 31 Jul 2008, Linus Torvalds wrote:
> > On Thu, 31 Jul 2008, Nick Piggin wrote:
> > > It seems like the right way to fix this would be to allow the splicing
> > > process to be notified of a short read, in which case it could try to
> > > refill the pipe with the unread bytes...
> >
> > Hmm. That should certainly work with the splice model. The users of the
> > data wouldn't eat (or ignore) the invalid data, they'd just say "invalid
> > data", and stop. And it would be up to the other side to handle it (it
> > can see the state of the pipe, we can make it both wake up POLL_ERR _and_
> > return an error if somebody tries to write to a "blocked" pipe).
> >
> > So yes, that's very possible, but it obviously requires splice() users to
> > be able to handle more cases. I'm not sure it's realistic to expect users
> > to be that advanced.
>
> Worse, it's not guaranteed to make any progress.  E.g. on NFS server
> with data being continually updated, cache on the client will
> basically always be invalid, so the above scheme might just repeat the
> splices forever without success.

Well, a) it probably makes sense in that case to provide another mode
of operation which fills the data synchronously from the sender and
copys it to the pipe (although the sender might just use read/write)
And b) we could *also* look at clearing PG_uptodate as an optimisation
iff that is found to help.

But I think it is kind of needed. The data comes from the sender, and
so only the sender may really know what to do in case of failure. I
(Continue reading)

Yasunori Goto | 1 Aug 11:42 2008

Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

> Yasunori Goto wrote:
> 
> > Current my idea is using RCU feature for waiting them.
> > Because it is the least impact against reader's performance,
> > and pgdat remover can wait finish of reader's access to pgdat
> > which is removing by synchronize_sched().
> 
> The use of RCU disables preemption which has implications as to
> what can be done in a loop over nodes or zones.

Yeap. It's the one of (big) cons.

> This would also potentially add more overhead to the page allocator hotpaths.

Agree.

To tell the truth, I tried hackbench with 3rd patch which add rcu_read_lock
in hot-path before this post to make rough estimate its impact.

%hackbench 100 process 2000

without patch.
  39.93

with patch
  39.99
(Both is 10 times avarage)

I guess this result has effect of disable preemption.
So, throughput looks not so bad, but probably, latency would be worse
(Continue reading)

Yasunori Goto | 1 Aug 13:16 2008

Re: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes


> Sorry for mixing things up in this thread, the min_free_kbytes issue is
> not related to memory hot-remove, but rather to hot-add and the things that
> happen in setup_per_zone_pages_min(), which is called from online_pages().
> It may well be that my assumptions are wrong, but I'd like to explain my
> concerns again:
> 
> If we have a system with 1 GB of memory, min_free_kbytes will be calculated
> to 4 MB for ZONE_NORMAL, for example. Now, if we add 3 GB of hotplug memory
> to ZONE_MOVABLE, the total min_free_kbytes will still remain 4 MB but it
> will be distributed differently: ZONE_NORMAL will now have only 1 MB of
> MIGRATE_RESERVE memory left, while ZONE_MOVABLE will have 3 MB, e.g.
> 

Right.

> My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
> usable by the kernel anymore, e.g. for PF_MEMALLOC, because it is in
> ZONE_MOVABLE now.

I don't make sense here. I suppose there is no relationship between
ZONE_MOVABLE, PF_MEMALLOC and MIGRATE_RESERVE pages.
Could you tell me more?

> This is what I mean with "effectively reducing the
> available min_free_kbytes". The system would now behave in the same way
> as a system which only had 1 MB of min_free_kbytes, although
> /proc/sys/vm/min_free_kbytes would still say 4 MB. After all, this tunable
> can have a rather negative impact on a system, especially if it is too
> low, hence my concerns.
(Continue reading)

Hugh Dickins | 1 Aug 14:11 2008

Re: GRU driver feedback

On Thu, 31 Jul 2008, Jack Steiner wrote:
> 
> I'm collecting the fixes & additional comments to be added & will send
> them upstream later.

One small thing to remove if you've not already noticed:
EXPORT_SYMBOL_GPL(follow_page) got added to mm/memory.c,
despite our realization that GRU cannot and now does not use it.

Hugh
Christoph Lameter | 1 Aug 15:51 2008

Re: [RFC:Patch: 000/008](memory hotplug) rough idea of pgdat removing

Yasunori Goto wrote:

> I thought it at first, but are there the following worst case?
> 
> 
>    CPU 0                                    CPU 1
> -------------------------------------------------------
> __alloc_pages()
>     
>     parsing_zonelist()
>         :
>     enter page_reclarim()
>     sleep (and remember zone)                 :
>                                               :
>                                         update zonelist and node_online_map
>                                           with stop_machine_run()
>                                         free pgdat().
>                                         remove the Node electrically.
> 
>     wake up and touch remembered
>        zone,  but it is removed
>     (Oops!!!)
> 
> 
> 
> Anyway, I'm happy if there is better way than my poor idea. :-)
> 
> Thanks for your comment.

Duh. Then the use of RCU would also mean that all of reclaim must be in a rcu period. So  reclaim cannot sleep anymore.
(Continue reading)

Gerald Schaefer | 1 Aug 18:04 2008
Picon

Re: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes

On Fri, 2008-08-01 at 20:16 +0900, Yasunori Goto wrote:
> > My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
> > usable by the kernel anymore, e.g. for PF_MEMALLOC, because it is in
> > ZONE_MOVABLE now.
> 
> I don't make sense here. I suppose there is no relationship between
> ZONE_MOVABLE, PF_MEMALLOC and MIGRATE_RESERVE pages.
> Could you tell me more?

Ok, I thought that PF_MEMALLOC allocations work on the MIGRATE_RESERVE
pageblocks, and that only kernel allocations can use PF_MEMALLOC. I also
thought that kernel allocations cannot use ZONE_MOVABLE, e.g. for page
cache memory, because such pages would not be migratable. So I assumed
that MIGRATE_RESERVE pageblocks in ZONE_MOVABLE would not be available
for PF_MEMALLOC allocations.

With this assumption, which can be totally wrong, the redistribution
of MIGRATE_RESERVE pageblocks in setup_per_zone_pages_min() looks like
it will take away reserved pageblocks that should be available to the
kernel in emergency situations.

Maybe I should have explained this assumption earlier, because my whole
min_free_kbytes issue depends on it. If I'm wrong, I apologize for
confusing you all with this "issue", and I will go back to the original
problem with removing the lowest memory chunk in ZONE_MOVABLE...

Thanks,
Gerald

(Continue reading)

Mel Gorman | 1 Aug 18:26 2008
Picon

Re: memory hotplug: hot-add to ZONE_MOVABLE vs. min_free_kbytes

On (31/07/08 19:45), Gerald Schaefer didst pronounce:
> On Thu, 2008-07-31 at 14:22 +0100, Mel Gorman wrote:
> > > The more memory we add to ZONE_MOVABLE, the less reserved pages will
> > > remain to the other zones. In setup_per_zone_pages_min(), min_free_kbytes
> > > will be redistributed to a zone where the kernel cannot make any use of
> > > it, effectively reducing the available min_free_kbytes. 
> > 
> > I'm not sure what you mean by "available min_free_kbytes". The overall value
> > for min_free_kbytes should be approximately the same whether the zone exists
> > or not. However, you're right in that the distribution of minimum free pages
> > changes with ZONE_MOVABLE because the zones are different sizes now. This
> > affects reclaim, not memory hot-remove.
> 
> Sorry for mixing things up in this thread, the min_free_kbytes issue is
> not related to memory hot-remove, but rather to hot-add and the things that
> happen in setup_per_zone_pages_min(), which is called from online_pages().
> It may well be that my assumptions are wrong, but I'd like to explain my
> concerns again:
> 
> If we have a system with 1 GB of memory, min_free_kbytes will be calculated
> to 4 MB for ZONE_NORMAL, for example. Now, if we add 3 GB of hotplug memory
> to ZONE_MOVABLE, the total min_free_kbytes will still remain 4 MB but it
> will be distributed differently: ZONE_NORMAL will now have only 1 MB of
> MIGRATE_RESERVE memory left, while ZONE_MOVABLE will have 3 MB, e.g.
> 

Ok, I haven't double checked your figures but lets go with the assumption -
adding memory means min_free_kbytes will be distributed differently.

> My assumption is now, that the reserved 3 MB in ZONE_MOVABLE won't be
(Continue reading)

Miklos Szeredi | 1 Aug 20:28 2008
Picon

Re: [patch v3] splice: fix race with page invalidation

On Fri, 1 Aug 2008, Nick Piggin wrote:
> Well, a) it probably makes sense in that case to provide another mode
> of operation which fills the data synchronously from the sender and
> copys it to the pipe (although the sender might just use read/write)
> And b) we could *also* look at clearing PG_uptodate as an optimisation
> iff that is found to help.

IMO it's not worth it to complicate the API just for the sake of
correctness in the so-very-rare read error case.  Users of the splice
API will simply ignore this requirement, because things will work fine
on ext3 and friends, and will break only rarely on NFS and FUSE.

So I think it's much better to make the API simple: invalid pages are
OK, and for I/O errors we return -EIO on the pipe.  It's not 100%
correct, but all in all it will result in less buggy programs.

Thanks,
Miklos
----

Subject: mm: dont clear PG_uptodate on truncate/invalidate

From: Miklos Szeredi <mszeredi <at> suse.cz>

Brian Wang reported that a FUSE filesystem exported through NFS could return
I/O errors on read.  This was traced to splice_direct_to_actor() returning a
short or zero count when racing with page invalidation.

However this is not FUSE or NFSD specific, other filesystems (notably NFS)
also call invalidate_inode_pages2() to purge stale data from the cache.
(Continue reading)

Linus Torvalds | 1 Aug 20:32 2008

Re: [patch v3] splice: fix race with page invalidation


On Fri, 1 Aug 2008, Miklos Szeredi wrote:
>
> Subject: mm: dont clear PG_uptodate on truncate/invalidate 
> From: Miklos Szeredi <mszeredi <at> suse.cz>

Ok, this one I have no problems with what-so-ever. I'd like Ack's for this 
kind of change (and obviously hope that it's tested), but it looks clean 
and I think the new rules are better (not just for this case).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jack Steiner | 1 Aug 22:09 2008
Picon

Re: GRU driver feedback

On Fri, Aug 01, 2008 at 01:11:09PM +0100, Hugh Dickins wrote:
> On Thu, 31 Jul 2008, Jack Steiner wrote:
> > 
> > I'm collecting the fixes & additional comments to be added & will send
> > them upstream later.
> 
> One small thing to remove if you've not already noticed:
> EXPORT_SYMBOL_GPL(follow_page) got added to mm/memory.c,
> despite our realization that GRU cannot and now does not use it.
> 

Thanks for catching this. I sent a patch upstream a few minutes ago to
remove the export.

--- jack

Gmane