Ethan Solomita | 1 Jul 2007 04:57
Picon
Favicon

Re: [RFC 1/7] cpuset write dirty map

Christoph Lameter wrote:
> On Wed, 27 Jun 2007, Ethan Solomita wrote:
> 
>> 	I looked over it at one point. Most of the code doesn't conflict, but I
>> believe that the code path which calculates the dirty limits will need
>> some merging. Doable but non-trivial.
>> 	-- Ethan
> 
> I hope you will keep on updating the patchset and posting it against 
> current mm?
> 

	I have no new changes, but I can update it against the current mm. Or
did the per-bdi throttling change get taken by Andrew?
	-- Ethan
Martin Schwidefsky | 1 Jul 2007 09:15
Picon
Favicon

Re: [patch 5/5] Optimize page_mkclean_one

On Sat, 2007-06-30 at 15:04 +0100, Hugh Dickins wrote:
> > Oh yes, the dirty handling is tricky. I had to fix a really nasty bug
> > with it lately. As for page_mkclean_one the difference is that it
> > doesn't claim a page is dirty if only the write protect bit has not been
> > set. If we manage to lose dirty bits from ptes and have to rely on the
> > write protect bit to take over the job, then we have a different problem
> > altogether, no ?
> 
> [Moving that over from 1/5 discussion].
> 
> Expect you're right, but I _really_ don't want to comment, when I don't
> understand that "|| pte_write" in the first place, and don't know the
> consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.

The pte_write() part is for the shared dirty page tracking. If you want
to make sure that a max of x% of your pages are dirty then you cannot
allow to have more than x% to be writable. Thats why page_mkclean_one
clears the dirty bit and makes the page read-only.

> My suspicion is that the "|| pte_write" is precisely to cover your
> s390 case where pte is never dirty (it may even have been me who got
> Peter to put it in for that reason).  In which case your patch would
> be fine - though I think it'd be improved a lot by a comment or
> rearrangement or new macro in place of the pte_dirty || pte_write
> line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
> and use that - its former use in msync has gone away now).

No, s390 is covered by the page_test_dirty / page_clear_dirty pair in
page_mkclean. 

(Continue reading)

Hugh Dickins | 1 Jul 2007 10:54

Re: [patch 5/5] Optimize page_mkclean_one

On Sun, 1 Jul 2007, Martin Schwidefsky wrote:
> > 
> > Expect you're right, but I _really_ don't want to comment, when I don't
> > understand that "|| pte_write" in the first place, and don't know the
> > consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.
> 
> The pte_write() part is for the shared dirty page tracking. If you want
> to make sure that a max of x% of your pages are dirty then you cannot
> allow to have more than x% to be writable. Thats why page_mkclean_one
> clears the dirty bit and makes the page read-only.

The whole of page_mkclean_one is for the dirty page tracking: so it's
obvious why it tests pte_dirty, but not obvious why it tests pte_write.

> 
> > My suspicion is that the "|| pte_write" is precisely to cover your
> > s390 case where pte is never dirty (it may even have been me who got
> > Peter to put it in for that reason).  In which case your patch would
> > be fine - though I think it'd be improved a lot by a comment or
> > rearrangement or new macro in place of the pte_dirty || pte_write
> > line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
> > and use that - its former use in msync has gone away now).
> 
> No, s390 is covered by the page_test_dirty / page_clear_dirty pair in
> page_mkclean. 

That's where its dirty page count comes from, yes: but since the s390
pte_dirty just says no, if page_mkclean_one tested only pte_dirty,
then it wouldn't do anything on s390, and in particular wouldn't
write protect the ptes to re-enforce dirty counting from then on.
(Continue reading)

Miklos Szeredi | 1 Jul 2007 12:29
Picon

Re: [patch 5/5] Optimize page_mkclean_one

> page_mkclean_one is used to clear the dirty bit and to set the write
> protect bit of a pte. In additions it returns true if the pte either
> has been dirty or if it has been writable. As far as I can see the
> function should return true only if the pte has been dirty, or page
> writeback will needlessly write a clean page.

There are some weird cases, like for example get_user_pages(), when
the pte takes a write fault and the page is modified, but the pte
doesn't become dirty, because the page is written through the kernel
mapping.

In the get_user_pages() case the page itself is dirtied, so your patch
probably doesn't break that.  But I'm not sure if there aren't similar
cases like that that the pte_write() check is taking care of.

And anyway if the dirty page tracking works correctly, your patch
won't optimize anything, since the pte will _only_ become writable if
the page was dirtied.

So in fact normally pte_dirty() and pte_write() should be equivalent,
except for some weird cases.

Miklos
Peter Zijlstra | 1 Jul 2007 15:27
Favicon

Re: [patch 5/5] Optimize page_mkclean_one

On Sun, 2007-07-01 at 09:54 +0100, Hugh Dickins wrote:

> But I could easily be overlooking something: Peter will recall.

/me tries to get his brain up to speed after the OLS closing party :-)

I did both pte_dirty and pte_write because I was extra careful. One
_should_ imply the other, but since we'll be clearing both, I thought it
prudent to also check both.

I will have to think on this a little more, but I'm currently of the
opinion that the optimisation is not correct. But I'll have a thorough
look at s390 again when I get home.

Martin Schwidefsky | 1 Jul 2007 21:50
Picon
Favicon

Re: [patch 5/5] Optimize page_mkclean_one

On Sun, 2007-07-01 at 09:54 +0100, Hugh Dickins wrote:
> On Sun, 1 Jul 2007, Martin Schwidefsky wrote:
> > > 
> > > Expect you're right, but I _really_ don't want to comment, when I don't
> > > understand that "|| pte_write" in the first place, and don't know the
> > > consequence of pte_dirty && !pte_write or !pte_dirty && pte_write there.
> > 
> > The pte_write() part is for the shared dirty page tracking. If you want
> > to make sure that a max of x% of your pages are dirty then you cannot
> > allow to have more than x% to be writable. Thats why page_mkclean_one
> > clears the dirty bit and makes the page read-only.
> 
> The whole of page_mkclean_one is for the dirty page tracking: so it's
> obvious why it tests pte_dirty, but not obvious why it tests pte_write.

Yes, the pte_write call is needed for shared dirty page tracking. In
particular its needed for s390 but for corner cases where a page is
writable but not dirty it might be needed for other architectures as
well.

> > > My suspicion is that the "|| pte_write" is precisely to cover your
> > > s390 case where pte is never dirty (it may even have been me who got
> > > Peter to put it in for that reason).  In which case your patch would
> > > be fine - though I think it'd be improved a lot by a comment or
> > > rearrangement or new macro in place of the pte_dirty || pte_write
> > > line (perhaps adjust my pte_maybe_dirty in asm-generic/pgtable.h,
> > > and use that - its former use in msync has gone away now).
> > 
> > No, s390 is covered by the page_test_dirty / page_clear_dirty pair in
> > page_mkclean. 
(Continue reading)

Nick Piggin | 2 Jul 2007 08:19
Picon

Re: vm/fs meetup in september?

On Sat, Jun 30, 2007 at 05:35:04AM -0700, Martin J. Bligh wrote:
> Christoph Hellwig wrote:
> >On Tue, Jun 26, 2007 at 12:35:09PM +1000, Nick Piggin wrote:
> >  
> >>I'd like to see you there, so I hope we can find a date that most
> >>people are happy with. I'll try to start working that out after we
> >>have a rough idea of who's interested.
> >>    
> >
> >Do we have any data preferences yet?
> >  
> 
> You mean date?
> 
> 
> VM is arranged for the 3rd, IIRC Kernel summit doesn't
> start until the 5th, so there's a gap on the 4th if you want
> to sort out the fs stuff then? Not 100% sure on the dates.

Well it says kernel summit is 4-6th, but also that it is a two
day event, so I think you're right that it starts on 5th.

I'd like to see what people's preferences are, whether we try to
do this on the 4th, or after KS (which could be 7th or 8th)? I
know some can't make it before KS and some can't make it after,
but if you absolutely won't come on any of these dates can you let
me know? And let me know preferences or other ideas too.

Regarding numbers, there are about a dozen so far which is good
but not as many filesystem maintainers as I had hoped (do they
(Continue reading)

Martin Schwidefsky | 2 Jul 2007 09:07
Picon
Favicon

Re: [patch 5/5] Optimize page_mkclean_one

On Sun, 2007-07-01 at 15:27 +0200, Peter Zijlstra wrote:
> > But I could easily be overlooking something: Peter will recall.
> 
> /me tries to get his brain up to speed after the OLS closing party :-)

Oh-oh, the Black Thorn party :-)

> I did both pte_dirty and pte_write because I was extra careful. One
> _should_ imply the other, but since we'll be clearing both, I thought it
> prudent to also check both.

Just ran a little experiment: I've added a simple WARN_ON(ret == 0) to
page_mkclean after the page_test_dirty() check to see if there are cases
where the page is dirty and all ptes are read-only. A little stress run
including massive swap did not print a single warning.

> I will have to think on this a little more, but I'm currently of the
> opinion that the optimisation is not correct. But I'll have a thorough
> look at s390 again when I get home.

I think the patch is correct, although I beginning to doubt that is has
any effect.

--

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.

(Continue reading)

Jared Hulbert | 2 Jul 2007 19:26
Picon

Re: vm/fs meetup in september?

> > Christoph> So what you mean is "swap on flash" ?  Defintively sounds
> > Christoph> like an interesting topic, although I'm not too sure it's
> > Christoph> all that filesystem-related.
>
> I wouldn't want to call it swap, as this carries with it block-io
> connotations.  It's really mmap on flash.

Yes it is really mmap on flash.  But you are "swapping" pages from RAM
to be mmap'ed on flash.  Also the flash-io complexities are similar to
the block-io layer.  I think "swap on flash" is fair.  Though that
might be confused with making swap work on a NAND flash, which is very
much like the current block-io approach.  "Mmappable swap on flash" is
more exact, I suppose.

> > You need either a block translation layer,
>
> Are you suggesting to go through the block layer to reach the flash?

Well the obvious route would be to have this management layer use the
MTD, I can't see anything wrong with that.

> > or a (swap) filesystem that
> > understands flash peculiarities in order to make such a thing work.
> > The standard Linux swap format will not work.
>
> Correct.
>
> BTW, you may want to have a look at my "[RFC] VM: I have a dream..." thread.

Interesting.  This idea does allow for swap to be access directly.
(Continue reading)

Jared Hulbert | 2 Jul 2007 19:44
Picon

Re: vm/fs meetup in september?

> So what you mean is "swap on flash" ?  Defintively sounds like an
> interesting topic, although I'm not too sure it's all that
> filesystem-related.

Maybe not. Yet, it would be a very useful place to store data from a
file as a non-volatile page cache.

Also it is something that I believe would benefit from a VFS-like API.
 I mean there is a consistent interface a management layer like this
could use, yet the algorithms used to order the data and the interface
to the physical media may vary.  There is no single right way to do
the management layer, much like filesystems.

Given the page orientation of the current VFS seems to me like there
might be a nice way to use it for this purpose.

Or maybe the real experts on this stuff can tell me how wrong that is
and where it should go :)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Gmane