Stephen C. Tweedie | 1 Apr 2004 01:20
Picon
Favicon

Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Wed, 2004-03-31 at 23:53, Andrew Morton wrote:
> "Stephen C. Tweedie" <sct <at> redhat.com> wrote:

> > Unfortunately, this seems to contradict SingleUnix requirements, which
> > state:
> >         When MS_ASYNC is specified, msync() shall return immediately
> >         once all the write operations are initiated or queued for
> >         servicing
> > although I can't find an unambiguous definition of "queued for service"
> > in the online standard.  I'm reading it as requiring that the I/O has
> > reached the block device layer

> I don't think I agree with that.  If "queued for service" means we've
> started the I/O, then what does "initiated" mean, and why did they specify
> "initiated" separately?

I'd interpret "initiated" as having reached hardware.  "Queued for
service" is much more open to interpretation: Uli came up with "the data
must be actively put in a stage where I/O is initiated", which still
doesn't really address what sort of queueing is allowed.

> What triggered all this was a dinky little test app which Linus wrote to
> time some aspect of P4 tlb writeback latency.  It sits in a loop dirtying a
> page then msyncing it with MS_ASYNC.  It ran very poorly, because MS_ASYNC
> ended up waiting on the previously-submitted I/O before starting new I/O.

Sure.  There are lots of ways an interface can be misused, though: you
only know if one use is valid or not once you've determined what the
(Continue reading)

Stephen C. Tweedie | 1 Apr 2004 01:41
Picon
Favicon

Re: msync() behaviour broken for MS_ASYNC, revert patch?

Hi,

On Wed, 2004-03-31 at 23:37, Linus Torvalds wrote:
> On Wed, 31 Mar 2004, Stephen C. Tweedie wrote:
> >         
> > although I can't find an unambiguous definition of "queued for service"
> > in the online standard.  I'm reading it as requiring that the I/O has
> > reached the block device layer, not simply that it has been marked dirty
> > for some future writeback pass to catch; Uli agrees with that
> > interpretation.
> 
> That interpretation makes pretty much zero sense.
> 
> If you care about the data hitting the disk, you have to use fsync() or 
> similar _anyway_, and pretending anything else is just bogus.

You can make the same argument for either implementation of MS_ASYNC. 
And there's at least one way in which the "submit IO now" version can be
used meaningfully --- if you've got several specific areas of data in
one or more mappings that need flushed to disk, you'd be able to
initiate IO with multiple MS_ASYNC calls and then wait for completion
with either MS_SYNC or fsync().  That gives you an interface that
corresponds somewhat with the region-based filemap_sync();
filemap_fdatawrite(); filemap_datawait() that the kernel itself uses.  

> Having the requirement that it is on some sw-only request queue is
> nonsensical, since such a queue is totally invisible from a user
> perspective.

It's very much visible, just from a performance perspective, if you want
(Continue reading)

Linus Torvalds | 1 Apr 2004 02:08

Re: msync() behaviour broken for MS_ASYNC, revert patch?


On Wed, 1 Apr 2004, Stephen C. Tweedie wrote:
> 
> On Wed, 2004-03-31 at 23:37, Linus Torvalds wrote:
>
> > If you care about the data hitting the disk, you have to use fsync() or 
> > similar _anyway_, and pretending anything else is just bogus.
> 
> You can make the same argument for either implementation of MS_ASYNC. 

Exactly.

Which is why I say that the implementation cannot matter, because user 
space would be _buggy_ if it depended on some timing issue.

> And there's at least one way in which the "submit IO now" version can be
> used meaningfully --- if you've got several specific areas of data in
> one or more mappings that need flushed to disk, you'd be able to
> initiate IO with multiple MS_ASYNC calls and then wait for completion
> with either MS_SYNC or fsync().

Why wouldn't you be able to do that with the current one?

Tha advantage of the current MS_ASYNC is absolutely astoundingly HUGE: 
because we don't wait for in-progress IO, it can be used to efficiently 
synchronize multiple different areas, and then after that waiting for them 
with _one_ single fsync().

In contrast, the "wait for queued IO" approach can't sanely do that, 
exactly because it will wait in the middle, depending on other activity at 
(Continue reading)

Andrew Morton | 1 Apr 2004 02:30

Re: msync() behaviour broken for MS_ASYNC, revert patch?

Linus Torvalds <torvalds <at> osdl.org> wrote:
>
> I'd be perfectly happy with a set of file cache control operations, 
> including
> 
>  - start writeback in [a,b]
>  - wait for [a,b] stable
>  - and maybe "punch hole in [a,b]"

Yup, there are a number of linux-specific fadvise() extensions we
can/should be adding, including "start writeback on this byte range for
flush" and "start writeback on this byte range for data integrity" and
"wait on writeback of this byte range".

Some of these are needed internally for the fs-AIO implementation, and also
for an O_SYNC which only writes the pages which the writer wrote.  It's
pretty simple, and it'll be happening.

One wrinkle is that we'd need to add the start/end loff_t pair to the
a_ops->writepages() prototype.  But instead I intend to put the start/end
info into struct writeback_control and pass it that way.  It seems sleazy
at first but when you think about it, it isn't.  It provides forward and
backward compatability, it recognises that it's just a hint and that
filesystems can legitimately sync the whole file and it produces
smaller+faster code.

We might need a wait_on_page_writeback_range() a_op though.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
(Continue reading)

Andrea Arcangeli | 1 Apr 2004 02:45
Picon

Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 07:28:51PM +0200, Andrea Arcangeli wrote:
> if they run into trouble I'll return to the pagecache API adding the
> GFP_KERNEL and check for oom failure.

there were troubles with the header indeed. So I went back to the
pagecache version (now fixed with GFP_KERNEL and oom retval checking).

the oops I've got with the header trouble was weird (but at least the
previous radix_tree_delete crash is gone), so it's not completely clear
if this will be enough to make it work as well as it was working before
the -mm writeback changes. I tried to reproduce but apparently acpi is
doing nothing here for a echo 4 > sleep :/.

diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
--- x-ref/mm/page_io.c	2004-04-01 02:09:53.846664248 +0200
+++ x/mm/page_io.c	2004-04-01 02:11:41.526294456 +0200
 <at>  <at>  -139,7 +139,7  <at>  <at>  struct address_space_operations swap_aop

 /*
  * A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O.  The caller must have a ref on the page.
  */
 int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
 {
 <at>  <at>  -151,8 +151,11  <at>  <at>  int rw_swap_page_sync(int rw, swp_entry_
 	lock_page(page);

 	BUG_ON(page->mapping);
-	page->mapping = &swapper_space;
(Continue reading)

Andrew Morton | 1 Apr 2004 03:22

Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <andrea <at> suse.de> wrote:
>
>  <at>  <at>  -151,8 +151,11  <at>  <at>  int rw_swap_page_sync(int rw, swp_entry_
>  	lock_page(page);
>  
>  	BUG_ON(page->mapping);
> -	page->mapping = &swapper_space;
> -	page->index = entry.val;
> +	ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);

Doing a __GFP_FS allocation while holding lock_page() is worrisome.  It's
OK if that page is private, but how do we know that the caller didn't pass
us some page which is on the LRU?

The only place where I think we can deadlock is if that GFP_KERNEL
allocation tries to write out the page we hold a lock on, and we hit an
error running swap_writepage() and then enter handle_write_error().

This actually cannot happen because swap_writepage() can only fail if
bio_alloc() fails, and that uses a mempool.  But ick.

Your patch seems reasonable to run with for now, but to be totally anal
about it, I'll run with the below monstrosity.

diff -puN mm/page_io.c~rw_swap_page_sync-fix mm/page_io.c
--- 25/mm/page_io.c~rw_swap_page_sync-fix	Wed Mar 31 16:55:44 2004
+++ 25-akpm/mm/page_io.c	Wed Mar 31 17:15:31 2004
 <at>  <at>  -19,6 +19,7  <at>  <at> 
 #include <linux/buffer_head.h>	/* for block_sync_page() */
 #include <linux/mpage.h>
(Continue reading)

Andrea Arcangeli | 1 Apr 2004 03:26
Picon

Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 05:22:16PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea <at> suse.de> wrote:
> >
> >  <at>  <at>  -151,8 +151,11  <at>  <at>  int rw_swap_page_sync(int rw, swp_entry_
> >  	lock_page(page);
> >  
> >  	BUG_ON(page->mapping);
> > -	page->mapping = &swapper_space;
> > -	page->index = entry.val;
> > +	ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
> 
> Doing a __GFP_FS allocation while holding lock_page() is worrisome.  It's
> OK if that page is private, but how do we know that the caller didn't pass
> us some page which is on the LRU?

it _has_ to be private if it's using rw_swap_page_sync. How can a page
be in a lru if we're going to execute add_to_page_cache on it? That
would be pretty broken in the first place.

> Your patch seems reasonable to run with for now, but to be totally anal
> about it, I'll run with the below monstrosity.

It's not needed IMO. We also already bugcheck on page->mapping, if
you're scared about the page being in a lru, you can add further
bugchecks on PageLru etc.. calling add_to_page_cache on anything that is
already visible to the VM in some lru is broken by design and should be
forbidden. All the users of swap suspend must work with freshly
allocated pages, the page_mapped bugcheck already covers most of the
cases.
--
(Continue reading)

Andrew Morton | 1 Apr 2004 03:51

Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <andrea <at> suse.de> wrote:
>
>  > Doing a __GFP_FS allocation while holding lock_page() is worrisome.  It's
>  > OK if that page is private, but how do we know that the caller didn't pass
>  > us some page which is on the LRU?
> 
>  it _has_ to be private if it's using rw_swap_page_sync. How can a page
>  be in a lru if we're going to execute add_to_page_cache on it? That
>  would be pretty broken in the first place.

An anonymous user page meets these requirements.  A did say "anal", but
rw_swap_page_sync() is a general-purpose library function and we shouldn't
be making assumptions about the type of page which the caller happens to be
feeding us.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart <at> kvack.org"> aart <at> kvack.org </a>

Andrea Arcangeli | 1 Apr 2004 04:01
Picon

Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 05:51:13PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <andrea <at> suse.de> wrote:
> >
> >  > Doing a __GFP_FS allocation while holding lock_page() is worrisome.  It's
> >  > OK if that page is private, but how do we know that the caller didn't pass
> >  > us some page which is on the LRU?
> > 
> >  it _has_ to be private if it's using rw_swap_page_sync. How can a page
> >  be in a lru if we're going to execute add_to_page_cache on it? That
> >  would be pretty broken in the first place.
> 
> An anonymous user page meets these requirements.  A did say "anal", but
> rw_swap_page_sync() is a general-purpose library function and we shouldn't
> be making assumptions about the type of page which the caller happens to be
> feeding us.

that is a specialized backdoor to do I/O on _private_ pages, it's not a
general-purpose library function for doing anonymous pages
swapin/swapout, infact the only user is swap susped and we'd better
forbid swap suspend to pass anonymous pages through that interface and
be sure that nobody will ever attempt anything like that.

that interface is useful only to reach the swap device, for doing I/O on
private pages outside the VM, in the old days that was used to
read/write the swap header (again on a private page), swap suspend is
using it for similar reasons on _private_ pages.

the idea of allowing people to do I/O on anonymous pages using that
interface sounds broken to me. Your code sounds overkill complicated
for allowing something that we definitely must forbid.
(Continue reading)

Hugh Dickins | 1 Apr 2004 07:05

Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, 1 Apr 2004, Andrea Arcangeli wrote:
> On Wed, Mar 31, 2004 at 05:51:13PM -0800, Andrew Morton wrote:
> > rw_swap_page_sync() is a general-purpose library function and we shouldn't
> > be making assumptions about the type of page which the caller happens to be
> > feeding us.
> 
> that is a specialized backdoor to do I/O on _private_ pages, it's not a
> general-purpose library function for doing anonymous pages

I'm not against anal checks (except personally :), but I'm very much
with Andrea on this: rw_swap_page_sync is horrid, but does manage to
do a particular job.  The header page is great fun: sys_swapon and
mkswap read and write it by a totally different route, I shudder
(especially when it's a swapfile with blocksize less than pagesize).
It would be nice to make it more general and correct, but that's
not something you should get stuck on right now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart <at> kvack.org"> aart <at> kvack.org </a>


Gmane