Andrew Morton | 1 Aug 06:45 2004

Re: [PATCH] Create cpu_sibling_map for PPC64

Matthew Dobson <colpatch <at> us.ibm.com> wrote:
>
> In light of some proposed changes in the sched_domains code, I coded up
>  this little ditty that simply creates and populates a cpu_sibling_map
>  for PPC64 machines.

err, did you compile it?

--- 25-power4/arch/ppc64/kernel/smp.c~create-cpu_sibling_map-for-ppc64-fix	2004-07-31
21:43:18.620845384 -0700
+++ 25-power4-akpm/arch/ppc64/kernel/smp.c	2004-07-31 21:43:21.258444408 -0700
 <at>  <at>  -64,7 +64,7  <at>  <at>  cpumask_t cpu_possible_map = CPU_MASK_NO
 cpumask_t cpu_online_map = CPU_MASK_NONE;
 cpumask_t cpu_available_map = CPU_MASK_NONE;
 cpumask_t cpu_present_at_boot = CPU_MASK_NONE;
-cpumask_t cpu_sibling_map[NR_CPUS] = { [0 .. NR_CPUS-1] = CPU_MASK_NONE };
+cpumask_t cpu_sibling_map[NR_CPUS] = { [0 ... NR_CPUS-1] = CPU_MASK_NONE };

 EXPORT_SYMBOL(cpu_online_map);
 EXPORT_SYMBOL(cpu_possible_map);
_

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
Matthew Dobson | 2 Aug 23:16 2004
Picon

Re: [PATCH] Create cpu_sibling_map for PPC64

On Sat, 2004-07-31 at 21:45, Andrew Morton wrote:
> Matthew Dobson <colpatch <at> us.ibm.com> wrote:
> >
> > In light of some proposed changes in the sched_domains code, I coded up
> >  this little ditty that simply creates and populates a cpu_sibling_map
> >  for PPC64 machines.
> 
> err, did you compile it?

Oh, you wanted the version that compiles!?!  ;)  Besides me being really
bad at counting dots, our lab was down for a bit and I didn't test or
compile that before I sent it.  I was sending it more as an RFC than a
patch destined for acceptance (yet).  I still don't have access to a
machine to test it on, but I hope to snag some time soon.  Glad to see
*someone* is reading my code, though, since it obviously isn't me! ;)

-Matt

Matthew Dobson | 2 Aug 23:45 2004
Picon

Re: [PATCH] Create cpu_sibling_map for PPC64

On Sat, 2004-07-31 at 21:45, Andrew Morton wrote:
> Matthew Dobson <colpatch <at> us.ibm.com> wrote:
> >
> > In light of some proposed changes in the sched_domains code, I coded up
> >  this little ditty that simply creates and populates a cpu_sibling_map
> >  for PPC64 machines.
> 
> err, did you compile it?

Ok...  beyond not compiling it, I must have had my brain turned off when
I pushed the patch forward from 2.6.7-mm? to 2.6.8-rc2-mm1, because
cpu_sibling_map[] is already there, although only conditionally defined
for CONFIG_SCHED_SMT.  Talking to PPC developers at OLS, I think
cpu_sibling_map[] has (or more properly, will have) uses outside
sched_domains and thus deserves an unconditional definition.  Here's a
patch to change that.

[mcd <at> arrakis source]$ diffstat
~/linux/patches/ppc64-cpu_sibling_map.patch
 arch/ppc64/kernel/smp.c |    7 +------
 include/asm-ppc64/smp.h |    4 +---
 2 files changed, 2 insertions(+), 9 deletions(-)

-Matt

diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c linux-2.6.8-rc2-mm1+ppc64-cpu_sibling_map/arch/ppc64/kernel/smp.c
--- linux-2.6.8-rc2-mm1/arch/ppc64/kernel/smp.c	2004-07-28 10:50:29.000000000 -0700
+++ linux-2.6.8-rc2-mm1+ppc64-cpu_sibling_map/arch/ppc64/kernel/smp.c	2004-08-02
14:27:18.000000000 -0700
 <at>  <at>  -55,15 +55,13  <at>  <at> 
(Continue reading)

Paul Jackson | 3 Aug 01:35 2004
Picon

[PATCH] subset zonelists and big numa friendly mempolicy MPOL_MBIND

I hope that this patch will end up in *-mm soon.

But not yet -- first it must be reviewed by Andi Kleen, the mempolicy
author, and whomever else would like to comment.

This patch adds support for subset_zonelists.  These are zonelists
suitable for passing to __alloc_pages() for allocations that are
restricted to some subset of the systems memory nodes.

This patch also modifies the MPOL_BIND code in mm/mempolicy.c to
use these subset_zonelists.  This replaces an earlier, simpler,
implementation of such restricted zonelists that was in mm/mempolicy.c.

The current implementation provides a single list, sorted in node
order, for all nodes in the subset.  This is numa unfriendly when
used on sets of several memory nodes.  It loads up the first node in
the list, not trying the cpu local node first, and pays no regard to
distance between nodes (not trying near nodes before far nodes).

This new implementation preserves the "first-touch" preference
(allocate on the node currently executing on, if that node is
in the allowed subset).  It also preserves the zonelist ordering
that favors nearer nodes over further nodes on large NUMA systems.
And it removes the mempolicy restriction that only the highest zone
in the zone hierarchy is policied.

For smaller (a few nodes) dedicated application servers, this is
probably not essential.  But for larger (hundreds of nodes) NUMA
systems, running several simultaneous jobs in disjoint node subsets
of dozens or hundreds of nodes each, the numa friendly placement of
(Continue reading)

Andi Kleen | 3 Aug 02:08 2004
Picon

Re: [PATCH] subset zonelists and big numa friendly mempolicy MPOL_MBIND

On Mon, 2 Aug 2004 16:35:06 -0700 (PDT)
Paul Jackson <pj <at> sgi.com> wrote:

> I hope that this patch will end up in *-mm soon.

[...]

That's a *lot* of complexity and overhead you add. I hope it is really 
worth it. I suppose you need this for your cpu memset stuff, right? 
On a large machine it will be quite cache blowing too.

What's the worst case memory usage of all this?
And do you have any benchmarks that show that it is worth it?

My first reaction that if you really want to do that, just pass
the policy node bitmap to alloc_pages and try_to_free_pages
and use the normal per node zone list with the bitmap as filter. 
This way you would get the same effect with a lot less complexity
and only slightly more overhead in the fallback case.

For the simple libnuma it was still good enough to construct the 
zones without core changes, but for such complex things it is better
to attack the core of things.

-Andi

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
(Continue reading)

Paul Jackson | 3 Aug 04:14 2004
Picon

Re: [PATCH] subset zonelists and big numa friendly mempolicy MPOL_MBIND

Thanks for your quick reply, Andi.

> I suppose you need this for your cpu memset stuff, right? 
> On a large machine it will be quite cache blowing too.

Yes - subset zonelists will also be used in Simon Derr's and my cpuset
stuff.  As to cache blowing, actually I'd hope not.  Once a subset
zonelist is setup, it shouldn't hit anymore cache lines during normal
use than the system NODE_DATA zonelists.  Normal __alloc_pages() use
succeeds with the first zone pointer in the list, in any case.

> What's the worst case memory usage of all this?

They are N-squared, in the number of zones.  On a system with only
GFP_DMA memory, for a job bound to say 8 nodes, thats 72 (8 * 8, plus 8
NULL's) zone pointers, plus a linear array of 3 integer gfp indices per
CPU.  If someone tries to bind something to 254 out of 256 nodes on a
big system, then, yes, it eats about 0.5 Mbytes of space.  They are
shared and kref'd.  Of course, on such a system, the NODE_DATA zone
lists are consuming perhaps 6 Mbytes, if my math is right.  This worst
case 0.5 Mbytes could be trimmed by an order of magnitude, by cutting
off the tails of each zonelist (give up the search after 16 nodes, say)
-- I had that coded, but removed it because I didn't figure it was worth
it.

Basically, on these big numa boxes, the jobs don't overlap - they get
dedicated resources.  So the worst case is one job using _almost_ the
entire machine, needing a single 0.5 Mbyte subset zonelist (on a 256
node, GFP_DMA only, system).  Next worst is two jobs, half the machine
each, needing two subset zonelists of 0.13 Mbytes each (0.25 Mb total). 
(Continue reading)

Paul Jackson | 3 Aug 09:58 2004
Picon

Re: Re: [PATCH] subset zonelists and big numa friendly mempolicy MPOL_MBIND

Earlier, I (pj) wrote:
> It has poor cache performance on big iron.  For a modest job on a big
> system, the allocator has to walk down an average of 128 out of 256 zone
> pointers in the list, derefencing each one into the zone struct, then
> into the struct pglist_data, before it finds one that matches an allowed
> node id. That's a nasty memory footprint for a hot code path.

This paragraph is B.S.  Most tasks are running on CPUs that are on nodes
whose memory they are allowed to use.  That node is at the front of the
local zonelist, and they get their memory on the first node they look.

Damn ... hate it when that happens ;).

Still, either MPOL_BIND needs a more numa friendly set of zonelists
having a differently sorted list for each node in the set, or it's
usefulness for binding to more than one or a few very close nodes, if
you care about memory performance, falls off quickly as the number of
nodes increases.  As you well know, any such numa-friendly set of sorted
zonelists will require space on the Order of N**2, for N the node count,
given the NULL-terminated linear list form in which they must be handed
to __alloc_pages.

I suspect that the English phrase you are searching for now to tell me
is "if it hurts, don't use it ;)."  That is, you are clearly advising me
not to use MPOL_BIND if I need a fancy zonelist sort.

The place I ran into the most complexity doing this in the 2.4 kernel
was in the per-memory region binding.  You're dealing with this in the
2.6 kernels, and when you get to stuff like shared memory and huge
pages, it's not easy.  At least the vma splitting code is better in
(Continue reading)

Monica Correa | 4 Aug 20:36 2004
Picon

NUMA inter-node load balance

Hi, I'm a brazilian student and my research area is scalable operating 
systems. Currently, I'm studing the kernel 2.6.7 scheduler (specially 
for NUMA systems) and I din't find where inter-node load balance is 
done. There was a routine called balance_node in sched.c file in kernel 
2.6.5, but in version 2.6.7 the scheduler works with scheduler domains 
and CPU groups.
I read the sched.c code and I think the load balance is done for 
specific scheduler domain's CPU groups, but in the 
arch_init_sched_domains routine the highest level scheduler domain 
created is node scheduler domain. So, I can't understand how and when 
the inter-node load balance is performed. Is there a scheduler domain 
for the entire system? Where is it set up?

Sorry if this is a stupid question, I'm just a beginner and the Linux 
scheduler don´t have a good documentation, so I decided to ask to you.

Sorry too my terrible english and thank very much if someone read all 
this message! :)

I hope to receive an answer...

Thanks again.

Monica

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
(Continue reading)

Martin J. Bligh | 4 Aug 21:21 2004

Re: NUMA inter-node load balance

> Hi, I'm a brazilian student and my research area is scalable operating systems. Currently, I'm studing
the kernel 2.6.7 scheduler (specially for NUMA systems) and I din't find where inter-node load balance is
done. There was a routine called balance_node in sched.c file in kernel 2.6.5, but in version 2.6.7 the
scheduler works with scheduler domains and CPU groups.
> I read the sched.c code and I think the load balance is done for specific scheduler domain's CPU groups, but
in the arch_init_sched_domains routine the highest level scheduler domain created is node scheduler
domain. So, I can't understand how and when the inter-node load balance is performed. Is there a scheduler
domain for the entire system? Where is it set up?

Yes, arch_init_sched domains would create a domain for each node, then one
for the whole system. Our OLS paper might help you, it's here:

http://ftp.kernel.org/pub/linux/kernel/people/mbligh/presentations/

(might take a couple of minutes to replicate across before it appears).

The presentation slides didn't convert the pictures very well from mgp 
... sorry.

M.

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
Rick Lindsley | 4 Aug 21:29 2004
Picon

Re: NUMA inter-node load balance

You are right; the scheduler (and much of the code) is not documented
as well as it could be.

Have you read Documentation/sched-domains.txt?  It is brief, but may
help you understand how the domains are set up.  If it still leaves you
with questions, write back and I can provide a more clear explanation.

Rick

-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com

Gmane