KAMEZAWA Hiroyuki | 1 Feb 2010 01:01
Favicon

Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling

On Fri, 29 Jan 2010 13:07:01 -0800 (PST)
David Rientjes <rientjes <at> google.com> wrote:

> On Sat, 30 Jan 2010, KAMEZAWA Hiroyuki wrote:
> 
> > okay...I guess the cause of the problem Vedran met came from
> > this calculation.
> > ==
> >  109         /*
> >  110          * Processes which fork a lot of child processes are likely
> >  111          * a good choice. We add half the vmsize of the children if they
> >  112          * have an own mm. This prevents forking servers to flood the
> >  113          * machine with an endless amount of children. In case a single
> >  114          * child is eating the vast majority of memory, adding only half
> >  115          * to the parents will make the child our kill candidate of
> > choice.
> >  116          */
> >  117         list_for_each_entry(child, &p->children, sibling) {
> >  118                 task_lock(child);
> >  119                 if (child->mm != mm && child->mm)
> >  120                         points += child->mm->total_vm/2 + 1;
> >  121                 task_unlock(child);
> >  122         }
> >  123
> > ==
> > This makes task launcher(the fist child of some daemon.) first victim.
> 
> That "victim", p, is passed to oom_kill_process() which does this:
> 
> 	/* Try to kill a child first */
(Continue reading)

Chris Frost | 1 Feb 2010 03:06
Favicon

Re: [PATCH] mm/readahead.c: update the LRU positions of in-core pages, too

On Sun, Jan 31, 2010 at 10:31:42PM +0800, Wu Fengguang wrote:
> On Tue, Jan 26, 2010 at 09:32:17PM +0800, Wu Fengguang wrote:
> > On Mon, Jan 25, 2010 at 03:36:35PM -0700, Chris Frost wrote:
> > > I changed Wu's patch to add a PageLRU() guard that I believe is required
> > > and optimized zone lock acquisition to only unlock and lock at zone changes.
> > > This optimization seems to provide a 10-20% system time improvement for
> > > some of my GIMP benchmarks and no improvement for other benchmarks.
> 
> I feel very uncomfortable about this put_page() inside zone->lru_lock. 
> (might deadlock: put_page() conditionally takes zone->lru_lock again)
> 
> If you really want the optimization, can we do it like this?

Sorry that I was slow to respond. (I was out of town.)

Thanks for catching __page_cache_release() locking the zone.
I think staying simple for now sounds good. The below locks
and unlocks the zone for each page. Look good?

---
readahead: retain inactive lru pages to be accessed soon
From: Chris Frost <frost <at> cs.ucla.edu>

Ensure that cached pages in the inactive list are not prematurely evicted;
move such pages to lru head when they are covered by
- in-kernel heuristic readahead
- an posix_fadvise(POSIX_FADV_WILLNEED) hint from an application

Before this patch, pages already in core may be evicted before the
pages covered by the same prefetch scan but that were not yet in core.
(Continue reading)

Wu Fengguang | 1 Feb 2010 03:17
Picon
Favicon

Re: [PATCH] mm/readahead.c: update the LRU positions of in-core pages, too

On Sun, Jan 31, 2010 at 07:06:39PM -0700, Chris Frost wrote:
> On Sun, Jan 31, 2010 at 10:31:42PM +0800, Wu Fengguang wrote:
> > On Tue, Jan 26, 2010 at 09:32:17PM +0800, Wu Fengguang wrote:
> > > On Mon, Jan 25, 2010 at 03:36:35PM -0700, Chris Frost wrote:
> > > > I changed Wu's patch to add a PageLRU() guard that I believe is required
> > > > and optimized zone lock acquisition to only unlock and lock at zone changes.
> > > > This optimization seems to provide a 10-20% system time improvement for
> > > > some of my GIMP benchmarks and no improvement for other benchmarks.
> > 
> > I feel very uncomfortable about this put_page() inside zone->lru_lock. 
> > (might deadlock: put_page() conditionally takes zone->lru_lock again)
> > 
> > If you really want the optimization, can we do it like this?
> 
> Sorry that I was slow to respond. (I was out of town.)
> 
> Thanks for catching __page_cache_release() locking the zone.
> I think staying simple for now sounds good. The below locks
> and unlocks the zone for each page. Look good?

OK :)

Thanks,
Fengguang

> ---
> readahead: retain inactive lru pages to be accessed soon
> From: Chris Frost <frost <at> cs.ucla.edu>
> 
> Ensure that cached pages in the inactive list are not prematurely evicted;
(Continue reading)

Shaohui Zheng | 1 Feb 2010 05:12
Picon
Favicon

[Patch - Resend v4] Memory-Hotplug: Fix the bug on interface /dev/mem for 64-bit kernel

Send the v4 patch to mailing list.

changelog:
v1: add memory range to e820 table, and update the varibles max_pfn/max_low_pfn/high_memory
	Peter[hpa]: memory hotplug make sense on 32-bit kernel for virtual environment.
	Andi[ak]: No VM supports it currently, and 64bit is the important part for hotplug 
	Fengguang[wfg]: some review comments on function naming
v2: rename function update_pfn to update_end_of_memory_vars
	KAMEZAWA[kame]: e820map is considerted to be stable, read-only after boot.
	Fengguang[wfg]: rewrite function page_is_ram, make it friendly for hotplug.
v3: keep the old e820map, update the varible max_pfn, high_memory only
	KAMEZAWA[kame]: suggest to use memory hotplug notifier. If it's allowed, it will
					be cleaner.
	Fengguang[wfg]: notifier is for _outsider_ subsystems. It smells a bit overkill to do 
					notifier _inside_ the hotplug code.
	Fengguang[wfg]: suggest to update max_pfn/max_low_pfn/high_memory in arch/x86/mm/init_64.c:
					arch_add_memory() now, for X86_64.  
					Later on we can add code to arch/x86/mm/init_32.c:arch_add_memory() for X86_32.
v4: update max_pfn/max_low_pfn/high_memory in arch/x86/mm/init_64.c:arch_add_memory() for x86_64,
	 and resent it to mailing list, since fengguang's page_is_ram patch was accepted,

Memory-Hotplug: Fix the bug on interface /dev/mem for 64-bit kernel

The new added memory can not be access by interface /dev/mem, because we do not
 update the variables high_memory, max_pfn and max_low_pfn.

Add a function update_end_of_memory_vars to update these variables for 64-bit 
kernel.

CC: Andi Kleen <ak <at> linux.intel.com>
(Continue reading)

Wu Fengguang | 1 Feb 2010 05:41
Picon
Favicon

Re: [Patch - Resend v4] Memory-Hotplug: Fix the bug on interface /dev/mem for 64-bit kernel

Shaohui,

Some style nitpicks..

>  #ifdef CONFIG_MEMORY_HOTPLUG
> +/**

Should use /* here. 

> + * After memory hotplug, the variable max_pfn, max_low_pfn and high_memory will
> + * be affected, it will be updated in this function.
> + */
> +static inline void __meminit update_end_of_memory_vars(u64 start,

The "inline" and "__meminit" are both redundant here.

> +		max_low_pfn = max_pfn = end_pfn;

One assignment per line is preferred.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

(Continue reading)

Andrea Arcangeli | 1 Feb 2010 08:45
Picon
Favicon

Re: [PATCH 04 of 31] update futex compound knowledge

There was a little problem in the futex code, basically we can't take
PG_lock on a page that isn't pinned. This should fix it. I never had a
problem in practice. Untested still.. just wanted to update you on
this one.

diff --git a/kernel/futex.c b/kernel/futex.c
--- a/kernel/futex.c
+++ b/kernel/futex.c
 <at>  <at>  -258,6 +258,18  <at>  <at>  again:
 		local_irq_disable();
 		if (likely(__get_user_pages_fast(address, 1, 1, &page) == 1)) {
 			page_head = compound_head(page);
+			/*
+			 * page_head is valid pointer but we must pin
+			 * it before taking the PG_lock and/or
+			 * PG_compound_lock. The moment we re-enable
+			 * irqs __split_huge_page_splitting() can
+			 * return and the head page can be freed from
+			 * under us. We can't take the PG_lock and/or
+			 * PG_compound_lock on a page that could be
+			 * freed from under us.
+			 */
+			if (page != page_head)
+				get_page(page_head);
 			local_irq_enable();
 		} else {
 			local_irq_enable();
 <at>  <at>  -266,6 +278,8  <at>  <at>  again:
 	}
 #else
(Continue reading)

Tom Zanussi | 1 Feb 2010 09:17
Picon

Re: [RFC PATCH -tip 2/2 v2] add a scripts for pagecache usage per process

Hi,

On Mon, 2010-01-25 at 17:16 -0500, Keiichi KII wrote:
> (2010年01月23日 03:21), Tom Zanussi wrote:
> > Hi,
> > 
> > On Fri, 2010-01-22 at 19:08 -0500, Keiichi KII wrote:
> >> The scripts are implemented based on the trace stream scripting support.
> >> And the scripts implement the following.
> >>  - how many pagecaches each process has per each file
> >>  - how many pages are cached per each file
> >>  - how many pagecaches each process shares
> >>
> > 
> > Nice, it looks like a very useful script - I gave it a quick try and it
> > seems to work well...
> > 
> > The only problem I see, nothing to do with your script and nothing you
> > can do anything about at the moment, is that the record step generates a
> > huge amount of data, which of course makes the event processing take
> > awhile.  A lot of it appears to be due to perf itself - being able to
> > filter out the perf-generated events in the kernel would make a big
> > difference, I think; you normally don't want to see those anyway...
> 
> Yes, right. I don't want to process the data of perf itself.
> I will try to find any way to solve this problem. 
> 

Here's one way, using the tracepoint filters - it does make a big
difference in this case.
(Continue reading)

Nick Piggin | 1 Feb 2010 07:15
Picon

Re: [PATCH -mm] remove VM_LOCK_RMAP code

On Fri, Jan 29, 2010 at 07:34:10PM -0500, Rik van Riel wrote:
> When a VMA is in an inconsistent state during setup or teardown, the
> worst that can happen is that the rmap code will not be able to find
> the page.

OK, but you missed the interesting thing, which is to explain why
that worst case is not a problem.

rmap of course is not just used for reclaim but also invalidations
from mappings, and those guys definitely need to know that all
page table entries have been handled by the time they return.

> 
> It is also impossible for the rmap code to follow a pointer to an
> already freed VMA, because the rmap code holds the anon_vma->lock,
> which the VMA teardown code needs to take before the VMA is removed
> from the anon_vma chain.
> 
> Hence, we should not need the VM_LOCK_RMAP locking at all.
> 
> Sent as a separate patch because I would appreciate it if others
> could verify my logic :)
> 
> Signed-off-by: Rik van Riel <riel <at> redhat.com>
> ---
>  include/linux/mm.h |    4 ----
>  mm/mmap.c          |   15 ---------------
>  mm/rmap.c          |   12 ------------
>  3 files changed, 0 insertions(+), 31 deletions(-)
> 
(Continue reading)

Mel Gorman | 1 Feb 2010 11:29
Picon

Re: PROBLEM: kernel BUG at mm/page_alloc.c:775

On Fri, Jan 29, 2010 at 11:01:57PM +0100, Michail Bachmann wrote:
> > > On Tue, Jan 12, 2010 at 03:25:23PM -0600, Christoph Lameter wrote:
> > > > On Sat, 9 Jan 2010, Michail Bachmann wrote:
> > > > > [   48.505381] kernel BUG at mm/page_alloc.c:775!
> > > >
> > > > Somehow nodes got mixed up or the lookup tables for pages / zones are
> > > > not giving the right node numbers.
> > >
> > > Agreed. On this type of machine, I'm not sure how that could happen
> > > short of struct page information being corrupted. The range should
> > > always be aligned to a pageblock boundary and I cannot see how that
> > > would cross a zone boundary on this machine.
> > >
> > > Does this machine pass memtest?
> > 
> > I ran one pass with memtest86 without errors before posting this bug, but I
> > can let it run "all tests" for a while just to be sure it is not caused by
> > broken hw.
> 
> Please disregard this bug report. After running memtest for more than 10 hours 
> it found a memory error.

I'm sorry to hear it but at least the source of the bug is known.

> The funny thing is, linux found it much faster...
> 

It could be that your power supply is slightly too inefficient and the
errors only occur when all cores are active or all disks - something
Linux might do easily where as memtest does not necessarily stress the
(Continue reading)

David Rientjes | 1 Feb 2010 11:33
Picon
Favicon

Re: [PATCH v3] oom-kill: add lowmem usage aware oom kill handling

On Sun, 31 Jan 2010, Vedran Furac wrote:

> > You snipped the code segment where I demonstrated that the selected task 
> > for oom kill is not necessarily the one chosen to die: if there is a child 
> > with disjoint memory that is killable, it will be selected instead.  If 
> > Xorg or sshd is being chosen for kill, then you should investigate why 
> > that is, but there is nothing random about how the oom killer chooses 
> > tasks to kill.
> 
> I know that it isn't random, but it sure looks like that to the end user
> and I use it to emphasize the problem. And about me investigating, that
> simply not possible as I am not a kernel hacker who understands the code
> beyond the syntax level. I can only point to the problem in hope that
> someone will fix it.
> 

Disregarding the opportunity that userspace has to influence the oom 
killer's selection for a moment, it really tends to favor killing tasks 
that are the largest in size.  Tasks that typically get the highest 
badness score are those that have the highest mm->total_vm, it's that 
simple.  There are definitely cornercases where the first generation 
children have a strong influence, but they are often killed either as a 
result of themselves being a thread group leader with seperate memory from 
the parent or as the result of the oom killer killing a task with seperate 
memory before the selected task.  It's completely natural for the oom 
killer to select bash, for example, when in actuality it will kill a 
memory leaker that has a high badness score as a result of the logic in 
oom_kill_process().

If you have specific logs that you'd like to show, please enable 
(Continue reading)


Gmane