Andrew Morton | 1 Sep 2011 01:37

Re: [patch 1/2] mm: vmscan: fix force-scanning small targets without swap

On Mon, 29 Aug 2011 18:08:39 +0200
Johannes Weiner <jweiner <at> redhat.com> wrote:

> Andrew,
> 
> On Thu, Aug 11, 2011 at 10:31:54PM +0200, Johannes Weiner wrote:
> > Without swap, anonymous pages are not scanned.  As such, they should
> > not count when considering force-scanning a small target if there is
> > no swap.
> > 
> > Otherwise, targets are not force-scanned even when their effective
> > scan number is zero and the other conditions--kswapd/memcg--apply.
> 
> I forgot to mention, this patch is a fix for '246e87a memcg: fix
> get_scan_count() for small targets', which went upstream this merge
> window.
> 
> Probably makes sense to merge this one too before the release..?
> 

Ah, I didn't realise that.  Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

KAMEZAWA Hiroyuki | 1 Sep 2011 02:02
Favicon

Re: [PATCH] Enable OOM when moving processes between cgroups?

On Wed, 31 Aug 2011 19:54:22 +0200
Johannes Weiner <jweiner <at> redhat.com> wrote:

> On Wed, Aug 31, 2011 at 08:32:21PM +0300, Viktor Rosendahl wrote:
> > Hello,
> > 
> > I wonder if there is a specific reason why the  OOM killer hasn't been enabled
> > in the mem_cgroup_do_precharge() function in mm/memcontrol.c ?
> > 
> > In my testing (2.6.32 kernel with some backported cgroups patches), it improves
> > the case when there isn't room for the task in the target cgroup.
> 
> Tasks are moved directly on behalf of a request from userspace.  We
> would much prefer denying that single request than invoking the
> oom-killer on the whole group.
> 
Yes, I agree.

> Quite a lot changed in the trycharge-reclaim-retry path since 2009.
> Nowadays, charging is retried as long as reclaim is making any
> progress at all, so I don't see that it would give up moving a task
> too lightly, even without the extra OOM looping.
> 
> Is there any chance you could retry with a more recent kernel?
> 

It's curious topic.

Thanks,
-Kame
(Continue reading)

KAMEZAWA Hiroyuki | 1 Sep 2011 02:09
Favicon

Re: [patch] memcg: skip scanning active lists based on individual size

On Wed, 31 Aug 2011 19:13:34 +0900
Minchan Kim <minchan.kim <at> gmail.com> wrote:

> On Wed, Aug 31, 2011 at 6:08 PM, Johannes Weiner <jweiner <at> redhat.com> wrote:
> > Reclaim decides to skip scanning an active list when the corresponding
> > inactive list is above a certain size in comparison to leave the
> > assumed working set alone while there are still enough reclaim
> > candidates around.
> >
> > The memcg implementation of comparing those lists instead reports
> > whether the whole memcg is low on the requested type of inactive
> > pages, considering all nodes and zones.
> >
> > This can lead to an oversized active list not being scanned because of
> > the state of the other lists in the memcg, as well as an active list
> > being scanned while its corresponding inactive list has enough pages.
> >
> > Not only is this wrong, it's also a scalability hazard, because the
> > global memory state over all nodes and zones has to be gathered for
> > each memcg and zone scanned.
> >
> > Make these calculations purely based on the size of the two LRU lists
> > that are actually affected by the outcome of the decision.
> >
> > Signed-off-by: Johannes Weiner <jweiner <at> redhat.com>
> > Cc: Rik van Riel <riel <at> redhat.com>
> > Cc: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
> > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu <at> jp.fujitsu.com>
> > Cc: Daisuke Nishimura <nishimura <at> mxp.nes.nec.co.jp>
> > Cc: Balbir Singh <bsingharora <at> gmail.com>
(Continue reading)

Daisuke Nishimura | 1 Sep 2011 02:13
Picon
Picon

Re: [PATCH] Enable OOM when moving processes between cgroups?

On Wed, 31 Aug 2011 19:54:22 +0200
Johannes Weiner <jweiner <at> redhat.com> wrote:

> On Wed, Aug 31, 2011 at 08:32:21PM +0300, Viktor Rosendahl wrote:
> > Hello,
> > 
> > I wonder if there is a specific reason why the  OOM killer hasn't been enabled
> > in the mem_cgroup_do_precharge() function in mm/memcontrol.c ?
> > 
> > In my testing (2.6.32 kernel with some backported cgroups patches), it improves
> > the case when there isn't room for the task in the target cgroup.
> 
> Tasks are moved directly on behalf of a request from userspace.  We
> would much prefer denying that single request than invoking the
> oom-killer on the whole group.
> 
I agree. OOM is disabled intentionally at the path.

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Wu Fengguang | 1 Sep 2011 06:14
Picon
Favicon

Re: slow performance on disk/network i/o full speed after drop_caches

Hi Stefan,

On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
> Hi Fengguang,
> Hi Yanhai,
> 
> > you're abssolutely corect zone_reclaim_mode is on - but why?
> > There must be some linux software which switches it on.
> >
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > also
> > ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
> > ~#
> >
> > tells us nothing.
> >
> > I've then read this:
> >
> > "zone_reclaim_mode is set during bootup to 1 if it is determined that
> > pages from remote zones will cause a measurable performance reduction.
> > The page allocator will then reclaim easily reusable pages (those page
> > cache pages that are currently not used) before allocating off node pages."
> >
> > Why does the kernel do that here in our case on these machines.
> 
> Can nobody help why the kernel in this case set it to 1?

It's determined by RECLAIM_DISTANCE.
(Continue reading)

David Gibson | 1 Sep 2011 07:28
Picon

Re: [PATCH v2 1/1] hugepages: Fix race between hugetlbfs umount and quota update.

On Fri, Aug 19, 2011 at 02:51:09PM -0700, Andrew Morton wrote:
> On Fri, 19 Aug 2011 14:14:11 -0500
> Andrew Barry <abarry <at> cray.com> wrote:
> 
> > This patch fixes a use-after-free problem in free_huge_page, with a quota update
> > happening after hugetlbfs umount. The problem results when a device driver,
> > which has mapped a hugepage, does a put_page. Put_page, calls free_huge_page,
> > which does a hugetlb_put_quota. As written, hugetlb_put_quota takes an
> > address_space struct pointer "mapping" as an argument. If the put_page occurs
> > after the hugetlbfs filesystem is unmounted, mapping points to freed memory.
> 
> OK.  This sounds screwed up.  If a device driver is currently using a
> page from a hugetlbfs file then the unmount shouldn't have succeeded in
> the first place!
> 
> Or is it the case that the device driver got a reference to the page by
> other means, bypassing hugetlbfs?  And there's undesirable/incorrect
> interaction between the non-hugetlbfs operation and hugetlbfs?
> 
> Or something else?
> 
> <starts reading the mailing list>
> 
> OK, important missing information from the above is that the driver got
> at this page via get_user_pages() and happened to stumble across a
> hugetlbfs page.  So it's indeed an incorrect interaction between a
> non-hugetlbfs operation and hugetlbfs.
> 
> What's different about hugetlbfs?  Why don't other filesystems hit this?
> 
(Continue reading)

Re: slow performance on disk/network i/o full speed after drop_caches

Thanks!

Am 01.09.2011 06:14, schrieb Wu Fengguang:
> Hi Stefan,
>
> On Wed, Aug 31, 2011 at 03:11:02PM +0800, Stefan Priebe - Profihost AG wrote:
>> Hi Fengguang,
>> Hi Yanhai,
>>
>>> you're abssolutely corect zone_reclaim_mode is on - but why?
>>> There must be some linux software which switches it on.
>>>
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> also
>>> ~# grep 'zone_reclaim_mode' /etc/sysctl.* -r -i
>>> ~#
>>>
>>> tells us nothing.
>>>
>>> I've then read this:
>>>
>>> "zone_reclaim_mode is set during bootup to 1 if it is determined that
>>> pages from remote zones will cause a measurable performance reduction.
>>> The page allocator will then reclaim easily reusable pages (those page
>>> cache pages that are currently not used) before allocating off node pages."
>>>
>>> Why does the kernel do that here in our case on these machines.
>>
(Continue reading)

Ying Han | 1 Sep 2011 08:05
Picon
Favicon

Re: [patch] Revert "memcg: add memory.vmscan_stat"

On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner <at> redhat.com> wrote:
> On Tue, Aug 30, 2011 at 04:20:50PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Tue, 30 Aug 2011 09:04:24 +0200
>> Johannes Weiner <jweiner <at> redhat.com> wrote:
>>
>> > On Tue, Aug 30, 2011 at 10:12:33AM +0900, KAMEZAWA Hiroyuki wrote:
>> > >  <at>  <at>  -1710,11 +1711,18  <at>  <at>  static void mem_cgroup_record_scanstat(s
>> > >   spin_lock(&memcg->scanstat.lock);
>> > >   __mem_cgroup_record_scanstat(memcg->scanstat.stats[context], rec);
>> > >   spin_unlock(&memcg->scanstat.lock);
>> > > -
>> > > - memcg = rec->root;
>> > > - spin_lock(&memcg->scanstat.lock);
>> > > - __mem_cgroup_record_scanstat(memcg->scanstat.rootstats[context], rec);
>> > > - spin_unlock(&memcg->scanstat.lock);
>> > > + cgroup = memcg->css.cgroup;
>> > > + do {
>> > > +         spin_lock(&memcg->scanstat.lock);
>> > > +         __mem_cgroup_record_scanstat(
>> > > +                 memcg->scanstat.hierarchy_stats[context], rec);
>> > > +         spin_unlock(&memcg->scanstat.lock);
>> > > +         if (!cgroup->parent)
>> > > +                 break;
>> > > +         cgroup = cgroup->parent;
>> > > +         memcg = mem_cgroup_from_cont(cgroup);
>> > > + } while (memcg->use_hierarchy && memcg != rec->root);
>> >
>> > Okay, so this looks correct, but it sums up all parents after each
>> > memcg scanned, which could have a performance impact.  Usually,
>> > hierarchy statistics are only summed up when a user reads them.
(Continue reading)

Johannes Weiner | 1 Sep 2011 08:15
Picon
Favicon

Re: [patch] memcg: skip scanning active lists based on individual size

On Thu, Sep 01, 2011 at 09:09:31AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 31 Aug 2011 19:13:34 +0900
> Minchan Kim <minchan.kim <at> gmail.com> wrote:
> 
> > On Wed, Aug 31, 2011 at 6:08 PM, Johannes Weiner <jweiner <at> redhat.com> wrote:
> > > Reclaim decides to skip scanning an active list when the corresponding
> > > inactive list is above a certain size in comparison to leave the
> > > assumed working set alone while there are still enough reclaim
> > > candidates around.
> > >
> > > The memcg implementation of comparing those lists instead reports
> > > whether the whole memcg is low on the requested type of inactive
> > > pages, considering all nodes and zones.
> > >
> > > This can lead to an oversized active list not being scanned because of
> > > the state of the other lists in the memcg, as well as an active list
> > > being scanned while its corresponding inactive list has enough pages.
> > >
> > > Not only is this wrong, it's also a scalability hazard, because the
> > > global memory state over all nodes and zones has to be gathered for
> > > each memcg and zone scanned.
> > >
> > > Make these calculations purely based on the size of the two LRU lists
> > > that are actually affected by the outcome of the decision.
> > >
> > > Signed-off-by: Johannes Weiner <jweiner <at> redhat.com>
> > > Cc: Rik van Riel <riel <at> redhat.com>
> > > Cc: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
> > > Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu <at> jp.fujitsu.com>
> > > Cc: Daisuke Nishimura <nishimura <at> mxp.nes.nec.co.jp>
(Continue reading)

Johannes Weiner | 1 Sep 2011 08:40
Picon
Favicon

Re: [patch] Revert "memcg: add memory.vmscan_stat"

On Wed, Aug 31, 2011 at 11:05:51PM -0700, Ying Han wrote:
> On Tue, Aug 30, 2011 at 1:42 AM, Johannes Weiner <jweiner <at> redhat.com> wrote:
> > You want to look at A and see whether its limit was responsible for
> > reclaim scans in any children.  IMO, that is asking the question
> > backwards.  Instead, there is a cgroup under reclaim and one wants to
> > find out the cause for that.  Not the other way round.
> >
> > In my original proposal I suggested differentiating reclaim caused by
> > internal pressure (due to own limit) and reclaim caused by
> > external/hierarchical pressure (due to limits from parents).
> >
> > If you want to find out why C is under reclaim, look at its reclaim
> > statistics.  If the _limit numbers are high, C's limit is the problem.
> > If the _hierarchical numbers are high, the problem is B, A, or
> > physical memory, so you check B for _limit and _hierarchical as well,
> > then move on to A.
> >
> > Implementing this would be as easy as passing not only the memcg to
> > scan (victim) to the reclaim code, but also the memcg /causing/ the
> > reclaim (root_mem):
> >
> >        root_mem == victim -> account to victim as _limit
> >        root_mem != victim -> account to victim as _hierarchical
> >
> > This would make things much simpler and more natural, both the code
> > and the way of tracking down a problem, IMO.
> 
> This is pretty much the stats I am currently using for debugging the
> reclaim patches. For example:
> 
(Continue reading)


Gmane