Benjamin Herrenschmidt | 1 Aug 2010 02:27

Re: [PATCH 0/8] v3 De-couple sysfs memory directories from memory sections

On Sat, 2010-07-31 at 12:55 -0700, Greg KH wrote:
> On Sat, Jul 31, 2010 at 03:36:24PM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2010-07-19 at 22:45 -0500, Nathan Fontenot wrote:
> > > This set of patches de-couples the idea that there is a single
> > > directory in sysfs for each memory section.  The intent of the
> > > patches is to reduce the number of sysfs directories created to
> > > resolve a boot-time performance issue.  On very large systems
> > > boot time are getting very long (as seen on powerpc hardware)
> > > due to the enormous number of sysfs directories being created.
> > > On a system with 1 TB of memory we create ~63,000 directories.
> > > For even larger systems boot times are being measured in hours.
> > 
> > Greg, Kame, how do we proceed with these ? I'm happy to put them in
> > powerpc.git with appropriate acks or will you take them ?
> 
> I thought there would be at least one more round of these patches based
> on the review comments, right?

Yes, but I was nontheless inquiring whether I should pick them up after
said repost :-)

> I'll be glad to take them when everyone agrees with them.

Ok, good, one less thing to worry about in powerpc patchwork :-)

Cheers,
Ben.

> thanks,
> 
(Continue reading)

Wu Fengguang | 1 Aug 2010 07:27
Picon
Favicon

Re: [PATCH] vmscan: remove wait_on_page_writeback() from pageout()

On Wed, Jul 28, 2010 at 05:59:55PM +0800, KOSAKI Motohiro wrote:
> > On Wed, Jul 28, 2010 at 06:43:41PM +0900, KOSAKI Motohiro wrote:
> > > > On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > > > > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> > > > > 
> > > > >         shrink_inactive_list()
> > > > >           shrink_page_list(PAGEOUT_IO_ASYNC)
> > > > >             pageout(PAGEOUT_IO_ASYNC)
> > > > >           shrink_page_list(PAGEOUT_IO_SYNC)
> > > > >             pageout(PAGEOUT_IO_SYNC)
> > > > > 
> > > > > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > > > > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > > > > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > > > > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > > > > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > > > > into dirty pages for pageout(SYNC) unless in some race conditions.
> > > > > 
> > > > 
> > > > It's possible for the second call to run into dirty pages as there is a
> > > > congestion_wait() call between the first shrink_page_list() call and the
> > > > second. That's a big window.
> > > > 
> > > > > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > > > > long stall time if running into some range of dirty pages.
> > > > 
> > > > True, but this is also lumpy reclaim which is depending on a contiguous
> > > > range of pages. It's better for it to wait on the selected range of pages
> > > > which is known to contain at least one old page than excessively scan and
> > > > reclaim newer pages.
(Continue reading)

Wu Fengguang | 1 Aug 2010 07:49
Picon
Favicon

Re: [PATCH] vmscan: remove wait_on_page_writeback() from pageout()

> > Well, "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls" is completely bandaid and
> 
> No joking. The (DEF_PRIORITY-2) is obviously too permissive and shall be fixed.
> 
> > much IO under slow USB flash memory device still cause such problem even if the patch is applied.
> 
> As for this patch, raising the bar to PAGEOUT_IO_SYNC reduces both
> calls to congestion_wait() and wait_on_page_writeback(). So it
> absolutely helps by itself.

To base the discussion on more subjective numbers, I run the attached
debug patch on my desktop. I have 4GB memory without swap. Two
processes are started:

        usemem 3g --sleep 300
        cp /dev/zero /tmp

I didn't get any stall reports with DEF_PRIORITY/3, so I lowered it to
DEF_PRIORITY-2 and get the below reports. The test is a good testimony
for "vmscan: raise the bar to PAGEOUT_IO_SYNC stalls": it's working as
expected and does avoid the stalls. When the priority goes down to 4,
your patch to remove congestion_wait() will come into play. So they
are both good patches to have.

[Sorry I cannot afford to add more stresses to test your patch.
 My remote test box was stressed to death yesterday and asks for
 some physical reset tomorrow. My desktop has already corrupted the
 .zsh_history file twice and can't risk losing more data..]

Thanks,
(Continue reading)

KOSAKI Motohiro | 1 Aug 2010 10:19
Favicon

Re: [PATCH 6/6] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages

Hi Trond,

> There is that, and then there are issues with the VM simply lying to the
> filesystems.
> 
> See https://bugzilla.kernel.org/show_bug.cgi?id=16056
> 
> Which basically boils down to the following: kswapd tells the filesystem
> that it is quite safe to do GFP_KERNEL allocations in pageouts and as
> part of try_to_release_page().
> 
> In the case of pageouts, it does set the 'WB_SYNC_NONE', 'nonblocking'
> and 'for_reclaim' flags in the writeback_control struct, and so the
> filesystem has at least some hint that it should do non-blocking i/o.
> 
> However if you trust the GFP_KERNEL flag in try_to_release_page() then
> the kernel can and will deadlock, and so I had to add in a hack
> specifically to tell the NFS client not to trust that flag if it comes
> from kswapd.

Can you please elaborate your issue more? vmscan logic is, briefly, below

	if (PageDirty(page))
		pageout(page)
	if (page_has_private(page)) {
		try_to_release_page(page, sc->gfp_mask))

So, I'm interest why nfs need to writeback at ->release_page again even
though pageout() call ->writepage and it was successfull.

(Continue reading)

KOSAKI Motohiro | 1 Aug 2010 10:32
Favicon

Re: [PATCH] vmscan: remove wait_on_page_writeback() from pageout()

> On Wed, Jul 28, 2010 at 05:59:55PM +0800, KOSAKI Motohiro wrote:
> > > On Wed, Jul 28, 2010 at 06:43:41PM +0900, KOSAKI Motohiro wrote:
> > > > > On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > > > > > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> > > > > > 
> > > > > >         shrink_inactive_list()
> > > > > >           shrink_page_list(PAGEOUT_IO_ASYNC)
> > > > > >             pageout(PAGEOUT_IO_ASYNC)
> > > > > >           shrink_page_list(PAGEOUT_IO_SYNC)
> > > > > >             pageout(PAGEOUT_IO_SYNC)
> > > > > > 
> > > > > > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > > > > > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > > > > > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > > > > > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > > > > > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > > > > > into dirty pages for pageout(SYNC) unless in some race conditions.
> > > > > > 
> > > > > 
> > > > > It's possible for the second call to run into dirty pages as there is a
> > > > > congestion_wait() call between the first shrink_page_list() call and the
> > > > > second. That's a big window.
> > > > > 
> > > > > > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > > > > > long stall time if running into some range of dirty pages.
> > > > > 
> > > > > True, but this is also lumpy reclaim which is depending on a contiguous
> > > > > range of pages. It's better for it to wait on the selected range of pages
> > > > > which is known to contain at least one old page than excessively scan and
> > > > > reclaim newer pages.
(Continue reading)

Wu Fengguang | 1 Aug 2010 10:35
Picon
Favicon

Re: [PATCH] vmscan: remove wait_on_page_writeback() from pageout()

On Sun, Aug 01, 2010 at 04:32:01PM +0800, KOSAKI Motohiro wrote:
> > On Wed, Jul 28, 2010 at 05:59:55PM +0800, KOSAKI Motohiro wrote:
> > > > On Wed, Jul 28, 2010 at 06:43:41PM +0900, KOSAKI Motohiro wrote:
> > > > > > On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > > > > > > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> > > > > > > 
> > > > > > >         shrink_inactive_list()
> > > > > > >           shrink_page_list(PAGEOUT_IO_ASYNC)
> > > > > > >             pageout(PAGEOUT_IO_ASYNC)
> > > > > > >           shrink_page_list(PAGEOUT_IO_SYNC)
> > > > > > >             pageout(PAGEOUT_IO_SYNC)
> > > > > > > 
> > > > > > > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > > > > > > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > > > > > > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > > > > > > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > > > > > > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > > > > > > into dirty pages for pageout(SYNC) unless in some race conditions.
> > > > > > > 
> > > > > > 
> > > > > > It's possible for the second call to run into dirty pages as there is a
> > > > > > congestion_wait() call between the first shrink_page_list() call and the
> > > > > > second. That's a big window.
> > > > > > 
> > > > > > > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > > > > > > long stall time if running into some range of dirty pages.
> > > > > > 
> > > > > > True, but this is also lumpy reclaim which is depending on a contiguous
> > > > > > range of pages. It's better for it to wait on the selected range of pages
> > > > > > which is known to contain at least one old page than excessively scan and
(Continue reading)

KOSAKI Motohiro | 1 Aug 2010 10:40
Favicon

Re: [PATCH] vmscan: remove wait_on_page_writeback() from pageout()

> On Sun, Aug 01, 2010 at 04:32:01PM +0800, KOSAKI Motohiro wrote:
> > > On Wed, Jul 28, 2010 at 05:59:55PM +0800, KOSAKI Motohiro wrote:
> > > > > On Wed, Jul 28, 2010 at 06:43:41PM +0900, KOSAKI Motohiro wrote:
> > > > > > > On Wed, Jul 28, 2010 at 04:46:54PM +0800, Wu Fengguang wrote:
> > > > > > > > The wait_on_page_writeback() call inside pageout() is virtually dead code.
> > > > > > > > 
> > > > > > > >         shrink_inactive_list()
> > > > > > > >           shrink_page_list(PAGEOUT_IO_ASYNC)
> > > > > > > >             pageout(PAGEOUT_IO_ASYNC)
> > > > > > > >           shrink_page_list(PAGEOUT_IO_SYNC)
> > > > > > > >             pageout(PAGEOUT_IO_SYNC)
> > > > > > > > 
> > > > > > > > Because shrink_page_list/pageout(PAGEOUT_IO_SYNC) is always called after
> > > > > > > > a preceding shrink_page_list/pageout(PAGEOUT_IO_ASYNC), the first
> > > > > > > > pageout(ASYNC) converts dirty pages into writeback pages, the second
> > > > > > > > shrink_page_list(SYNC) waits on the clean of writeback pages before
> > > > > > > > calling pageout(SYNC). The second shrink_page_list(SYNC) can hardly run
> > > > > > > > into dirty pages for pageout(SYNC) unless in some race conditions.
> > > > > > > > 
> > > > > > > 
> > > > > > > It's possible for the second call to run into dirty pages as there is a
> > > > > > > congestion_wait() call between the first shrink_page_list() call and the
> > > > > > > second. That's a big window.
> > > > > > > 
> > > > > > > > And the wait page-by-page behavior of pageout(SYNC) will lead to very
> > > > > > > > long stall time if running into some range of dirty pages.
> > > > > > > 
> > > > > > > True, but this is also lumpy reclaim which is depending on a contiguous
> > > > > > > range of pages. It's better for it to wait on the selected range of pages
> > > > > > > which is known to contain at least one old page than excessively scan and
(Continue reading)

Wu Fengguang | 1 Aug 2010 10:51
Picon
Favicon

[PATCH mmotm] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.

	http://lkml.org/lkml/2010/4/4/86

In the above thread, Andreas Mohr described that

	Invoking any command locked up for minutes (note that I'm
	talking about attempted additional I/O to the _other_,
	_unaffected_ main system HDD - such as loading some shell
	binaries -, NOT the external SSD18M!!).

This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device

OOM also happens in Andreas' system. The OOM trace shows that 3
processes are stuck in wait_on_page_writeback() in the direct reclaim
path. One in do_fork() and the other two in unix_stream_sendmsg(). They
are blocked on this condition:

	(sc->order && priority < DEF_PRIORITY - 2)

which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.

It's a severe problem in three ways.
(Continue reading)

KOSAKI Motohiro | 1 Aug 2010 10:55
Favicon

Re: [PATCH mmotm] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

> Reported-by: Andreas Mohr <andi <at> lisas.de>
> Reviewed-by: Minchan Kim <minchan.kim <at> gmail.com>
> Signed-off-by: Mel Gorman <mel <at> csn.ul.ie>
> Signed-off-by: Wu Fengguang <fengguang.wu <at> intel.com>

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

KOSAKI Motohiro | 1 Aug 2010 11:12
Favicon

[PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

rebased onto Wu's patch

----------------------------------------------
From 35772ad03e202c1c9a2252de3a9d3715e30d180f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
Date: Sun, 1 Aug 2010 17:23:41 +0900
Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

congestion_wait() mean "waiting for number of requests in IO queue is
under congestion threshold".
That said, if the system have plenty dirty pages, flusher thread push
new request to IO queue conteniously. So, IO queue are not cleared
congestion status for a long time. thus, congestion_wait(HZ/10) is
almostly equivalent schedule_timeout(HZ/10).

If the system 512MB memory, DEF_PRIORITY mean 128kB scan and It takes 4096
shrink_page_list() calls to scan 128kB (i.e. 128kB/32=4096) memory.
4096 times 0.1sec stall makes crazy insane long stall. That shouldn't.

In the other hand, this synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
call wait_on_page_writeback() and it provide sufficient waiting.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
Reviewed-by: Wu Fengguang <fengguang.wu <at> intel.com>

---
 mm/vmscan.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

(Continue reading)


Gmane