Andi Kleen | 2 Jun 2009 22:13

Re: alloc_pages_node(... GFP_KERNEL | GFP_THISNODE ...) fails

> I tried to build a test module to recreate the problem, but the test
> module doesn't exhibit the bug.  I assume that is because there is
> something different with the context from which the allocation is
> happening.  I tried to follow the code path of alloc_pages_node() to find
> out what that difference might be, but didn't succeed.  Hence I attached a
> boiled down version of my changes that still demonstrate the problem. 
> It's really a one line change in kvm.

Yes it looks harmless.
> 
> I was hoping that someone with real numa hardware and kvm capabilities
> would be able to test this tiny patch in order to find out if the problem
> lies with the fake numa or indeed the context of the allocation so I could
> focus my efforts on the right target.

One way you could test that yourself would be to fake real NUMA.
As in hardcode a very simple SRAT into drivers/acpi/numa.c that splits
your memory/cores in half. That should be about equivalent to the 
real thing on the Linux level. Originally fake numa was like this too,
but that got changed later.

> Alternatively, I'd also be greatful if anybody had any idea why
> alloc_pages_node() would behave differently in different context and what
> that difference would be.

Maybe someone set a task memory policy? I thought NUMA support for that got
added to KVM qemu at some point.  You can check with stracing qemu
or by dumping the policies of the kvm task.

Actually you specified GFP_THISNODE which should avoid this, but maybe
(Continue reading)

David Rientjes | 3 Jun 2009 00:59
Picon
Favicon

Re: alloc_pages_node(... GFP_KERNEL | GFP_THISNODE ...) fails

On Fri, 29 May 2009, Max Laier wrote:

> I'm trying to replace a call to alloc_page() with alloc_pages_node() in
> order to membind the allocation to a specific node set.  This seemed to
> work - as page_to_nid() returns the selected node.  However, the free
> memory in that node isn't decreasing.
> 

I'm assuming that pgalloc_{zone} is being incremented correctly in 
/proc/vmstat if you have CONFIG_VM_EVENT_COUNTERS enabled (and that you're 
using a recent kernel, you never mentioned your version).

Keep in mind the vm stat threshold which will purposefully neglect to 
update ZVCs if it is not reached.

nr_free_pages in /proc/zoneinfo won't decrease if the allocated page was 
from the pcp and is order 0; check the difference in `count' for the 
allocating cpu's pageset on the target node in the same file.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter | 3 Jun 2009 16:04

Re: alloc_pages_node(... GFP_KERNEL | GFP_THISNODE ...) fails

On Tue, 2 Jun 2009, David Rientjes wrote:

> I'm assuming that pgalloc_{zone} is being incremented correctly in
> /proc/vmstat if you have CONFIG_VM_EVENT_COUNTERS enabled (and that you're
> using a recent kernel, you never mentioned your version).
>
> Keep in mind the vm stat threshold which will purposefully neglect to
> update ZVCs if it is not reached.

ZVCs will be updated after a second or so. The threshhold is to bound
deviations within the ZVC update interval.

--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Rientjes | 3 Jun 2009 20:24
Picon
Favicon

Re: alloc_pages_node(... GFP_KERNEL | GFP_THISNODE ...) fails

On Wed, 3 Jun 2009, Christoph Lameter wrote:

> > I'm assuming that pgalloc_{zone} is being incremented correctly in
> > /proc/vmstat if you have CONFIG_VM_EVENT_COUNTERS enabled (and that you're
> > using a recent kernel, you never mentioned your version).
> >
> > Keep in mind the vm stat threshold which will purposefully neglect to
> > update ZVCs if it is not reached.
> 
> ZVCs will be updated after a second or so. The threshhold is to bound
> deviations within the ZVC update interval.
> 

The point is that you can see disagreement between NUMA_HIT and 
NR_FREE_PAGES depending on when you collect your statistics if the 
threshold has been reached for one ZVC and not another inside the 
interval.

This likely isn't the case for Max since it's easily reproducible; I would 
suggest checking the `count' for the allocating cpu's pcp from 
/proc/zoneinfo as I earlier described.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen | 5 Jun 2009 11:33

[PATCH] numactl: Support dumping nodes of pages in shared memory segments


I got a request to be able to dump the nodes individually of a shared memory
segment. This can be very verbose for interleaving on large segments, but is still useful
for some analysis.

This patch implements that with a new --dump-nodes option 

Against numactl 2.0.3-rc3

-Andi

---
 numactl.8 |    5 +++++
 numactl.c |   13 ++++++++++---
 shm.c     |   31 +++++++++++++++++++++++++++++++
 shm.h     |    1 +
 4 files changed, 47 insertions(+), 3 deletions(-)

Index: numactl-2.0.3-rc3/numactl.c
===================================================================
--- numactl-2.0.3-rc3.orig/numactl.c
+++ numactl-2.0.3-rc3/numactl.c
 <at>  <at>  -49,6 +49,7  <at>  <at>  struct option opts[] = {
 	{"strict", 0, 0, 't'},
 	{"shmmode", 1, 0, 'M'},
 	{"dump", 0, 0, 'd'},
+	{"dump-nodes", 0, 0, 'D'},
 	{"shmid", 1, 0, 'I'},
 	{"huge", 0, 0, 'u'},
 	{"touch", 0, 0, 'T'},
(Continue reading)

Cliff Wickman | 5 Jun 2009 15:54
Picon
Favicon

Re: [PATCH] numactl: Support dumping nodes of pages in shared memory segments


Thanks Andi.  I tested and applied your patch.
It's part of numactl-2.0.3-rc4.tar.gz
(ftp://oss.sgi.com/www/projects/libnuma/download/)

It's been a year(!) since the 2.0.2 release.

I think it's about time to release the current libnuma/numactl as 2.0.3.
Does anyone have other fixes or enhancements forthcoming?

-Cliff

On Fri, Jun 05, 2009 at 11:33:24AM +0200, Andi Kleen wrote:
> 
> I got a request to be able to dump the nodes individually of a shared memory
> segment. This can be very verbose for interleaving on large segments, but is still useful
> for some analysis.
> 
> This patch implements that with a new --dump-nodes option 
> 
> Against numactl 2.0.3-rc3
> 
> -Andi
> 
> ---
>  numactl.8 |    5 +++++
>  numactl.c |   13 ++++++++++---
>  shm.c     |   31 +++++++++++++++++++++++++++++++
>  shm.h     |    1 +
>  4 files changed, 47 insertions(+), 3 deletions(-)
(Continue reading)

Andi Kleen | 5 Jun 2009 20:37

Re: [PATCH] numactl: Support dumping nodes of pages in shared memory segments

On Fri, Jun 05, 2009 at 08:54:25AM -0500, Cliff Wickman wrote:
> 
> Thanks Andi.  I tested and applied your patch.
> It's part of numactl-2.0.3-rc4.tar.gz
> (ftp://oss.sgi.com/www/projects/libnuma/download/)
> 
> 
> 
> It's been a year(!) since the 2.0.2 release.
> 
> I think it's about time to release the current libnuma/numactl as 2.0.3.
> Does anyone have other fixes or enhancements forthcoming?

I got a request to display large pages in --hardware. Haven't had time
to code that up yet. I'll be also gone for the next two weeks.

You probably don't want to delay a release for that long, i can submit it after
I'm back.

-Andi

--

-- 
ak <at> linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cliff Wickman | 10 Jun 2009 15:09
Picon
Favicon

2.0.3 release of libnuma/numactl


The current version of libnuma and numactl is now released as 2.0.3

  ftp://oss.sgi.com/www/projects/libnuma/download/numactl-2.0.3.tar.gz

Thanks to the many contributors of fixes and enhancements to this release.
At risk of forgetting some, they include:
Amit Arora
Arnd Bergmann
Bill Buros
Mike Frysinger
Brice Goglin
Daniel Gollub
Frederik Himpe
Andi Kleen
Kornilios Kourtis
Lee Schermerhorn
Bernhard Walle

-Cliff
--

-- 
Cliff Wickman
SGI
cpw <at> sgi.com
(651) 683-3824
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

(Continue reading)

Stefan Lankes | 11 Jun 2009 20:45
Picon
Picon
Favicon

RE: [RFC PATCH 0/4]: affinity-on-next-touch

> Your patches seem to have a lot of overlap with
> Lee Schermerhorn's old migrate memory on cpu migration patches.
> I don't know the status of those.

I analyze Lee Schermerhorn's migrate memory on cpu migration patches
(http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
Schermerhorn add similar functionalities to the kernel. He called the
"affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
patches the normal NUMA memory policies. Therefore, his solution fits better
to the Linux kernel. I tested his patches with our test applications and got
nearly the same performance results. 

I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
these patches further?

Stefan

--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo <at> vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen | 12 Jun 2009 12:32

Re: [RFC PATCH 0/4]: affinity-on-next-touch

On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote:
> > Your patches seem to have a lot of overlap with
> > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > I don't know the status of those.
> 
> I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> Schermerhorn add similar functionalities to the kernel. He called the
> "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> patches the normal NUMA memory policies. Therefore, his solution fits better
> to the Linux kernel. I tested his patches with our test applications and got
> nearly the same performance results. 

That's great to know.

I didn't think he had a per process setting though, did he?

> I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> these patches further?

Not to much knowledge. Maybe Lee will pick them up again now that there
are more use cases.

If he doesn't have time maybe you could update them?

-Andi
--

-- 
ak <at> linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
(Continue reading)


Gmane