Picon

Re: NUMA page allocation from next Node

Dear All,

On Sat, Oct 30, 2010 at 2:00 AM, Lee Schermerhorn
<Lee.Schermerhorn <at> hp.com> wrote:
>
> Also, if you could add "mminit_loglevel=2" to the boot command line, and
> grep for 'zonelist general'.  The general zonelists for the Normal zones
> will show the order of allocation for the two nodes.  On a 2 node [AMD]
> platform, I see:
>

Output of "mminit_loglevel=2" ...

mminit::zonelist general 0:DMA = 0:DMA
mminit::zonelist general 0:DMA32 = 0:DMA32 0:DMA
mminit::zonelist general 0:Normal = 0:Normal 1:Normal 0:DMA32 0:DMA
mminit::zonelist general 1:Normal = 1:Normal 0:Normal 0:DMA32 0:DMA

>
> And, just to be sure, you could suspend your dd job [^Z] and take a look
> at it's mempolicy and such via /proc/≤pid>/status [Mems_allowed*] and
> it's /proc/≤pid>/numa_maps.   If you haven't changed anything you should
> see  both nodes in Mems_allowed[_list] and all of the policies in the
> numa_maps should show 'default'.
>

/proc/≤PID>/numa_maps shows "default".

Both nodes are shown in "Mems_allowed*".

(Continue reading)

Picon

Re: NUMA page allocation from next Node

Tim,

I found that default value for "zone_reclaim_mode" is zero in HP
machine. But It is one in IBM.
Why does it set 1 or 0 in different hardware ?

On Sat, Oct 30, 2010 at 1:22 AM, Tim Pepper <lnxninja <at> linux.vnet.ibm.com> wrote:
>
> It would be interesting to see the output of "numactl --hardware" for each
> of these scenarios.
>

1. SLES11 + IBM HW
After consuming all memory in node1, it shows following ...

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 24564 MB
node 0 free: 23025 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 24576 MB
node 1 free: 16 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

2. SLES 11 + HP

available: 2 nodes (0-1)
(Continue reading)

Lee Schermerhorn | 1 Nov 2010 15:59
Picon
Favicon

Re: NUMA page allocation from next Node

On Mon, 2010-11-01 at 19:48 +0530, Tharindu Rukshan Bamunuarachchi
wrote:
> Tim,
> 
> I found that default value for "zone_reclaim_mode" is zero in HP
> machine. But It is one in IBM.
> Why does it set 1 or 0 in different hardware

Because the SLIT on the IBM platform has distances > 20.  Looks like IBM
is populating the SLIT on those platforms with "real" values.  The HP
bios is not supplying a slit, letting the remote distances default to
20.  That is the threshold for setting zone_reclaim_mode.  A patch was
submitted recently to bump the threshold to ~30.  Now that vendors are
starting to populate the SLIT with values > 20, we've begun to see the
behavior that you experienced. 

Regards,
Lee

> 
> On Sat, Oct 30, 2010 at 1:22 AM, Tim Pepper <lnxninja <at> linux.vnet.ibm.com> wrote:
> >
> > It would be interesting to see the output of "numactl --hardware" for each
> > of these scenarios.
> >
> 
> 1. SLES11 + IBM HW
> After consuming all memory in node1, it shows following ...
> 
> available: 2 nodes (0-1)
(Continue reading)

Andi Kleen | 1 Nov 2010 18:59

Re: NUMA page allocation from next Node

On Mon, Nov 01, 2010 at 07:48:08PM +0530, Tharindu Rukshan Bamunuarachchi wrote:
> Tim,
> 
> I found that default value for "zone_reclaim_mode" is zero in HP
> machine. But It is one in IBM.
> Why does it set 1 or 0 in different hardware ?

See my earlier mail: it depends on the SLIT.

-Andi
Andi Kleen | 1 Nov 2010 19:00

Re: NUMA page allocation from next Node

> Because the SLIT on the IBM platform has distances > 20.  Looks like IBM
> is populating the SLIT on those platforms with "real" values.  The HP
> bios is not supplying a slit, letting the remote distances default to
> 20.  That is the threshold for setting zone_reclaim_mode.  A patch was
> submitted recently to bump the threshold to ~30.  Now that vendors are
> starting to populate the SLIT with values > 20, we've begun to see the
> behavior that you experienced. 

I think it's intentional by the vendors: they use it a way to make
Linux behave like they want.

-Andi
Michael Spiegel | 1 Nov 2010 20:52
Picon

segmentation fault in numa_node_to_cpus_v1

Hi,

I'm trying to run the HotSpot Java VM on an SGI UV 1000 with 4096
cores.  When I enable the NUMA-aware garbage collection algorithm, I
get a segmentation fault as the virtual machine is initializing.  The
sigsegv is occurring at one of the memcpy's in numa_node_to_cpus_v1,
although I'm afraid I can't determine whether libnuma is being called
correctly or incorrectly.  I am testing on a system that has numactl
2.0.5.

Thanks,
--Michael

#6  <signal handler called>
#7  0x00007f4066fb9ad0 in memcpy () from /lib64/libc.so.6
#8  0x00007f40658d4c6a in numa_node_to_cpus_v1 (node=132, buffer=0x40112d40,
   bufferlen=<value optimized out>) at libnuma.c:1203
#9  0x00007f4066a85255 in os::Linux::rebuild_cpu_to_node_map() ()
  from /usr/ue/0/mspiegel/jdk1.6.0_22/jre/lib/amd64/server/libjvm.so
#10 0x00007f4066a8502f in os::Linux::libnuma_init() ()
  from /usr/ue/0/mspiegel/jdk1.6.0_22/jre/lib/amd64/server/libjvm.so
#11 0x00007f4066a86c38 in os::init_2() ()
  from /usr/ue/0/mspiegel/jdk1.6.0_22/jre/lib/amd64/server/libjvm.so
#12 0x00007f4066b81c4d in Threads::create_vm(JavaVMInitArgs*, bool*) ()
  from /usr/ue/0/mspiegel/jdk1.6.0_22/jre/lib/amd64/server/libjvm.so
Scott Lurndal | 1 Nov 2010 21:39

Re: segmentation fault in numa_node_to_cpus_v1

On Mon, November 1, 2010 11:52 am, Michael Spiegel wrote:
> Hi,
>
> I'm trying to run the HotSpot Java VM on an SGI UV 1000 with 4096
> cores.  When I enable the NUMA-aware garbage collection algorithm, I
> get a segmentation fault as the virtual machine is initializing.  The
> sigsegv is occurring at one of the memcpy's in numa_node_to_cpus_v1,
> although I'm afraid I can't determine whether libnuma is being called
> correctly or incorrectly.  I am testing on a system that has numactl
> 2.0.5.

I ran into this issue with Oracle 11i.  It was linked against a library
with an older API.   Because I coulnd't change oracle, I modified the
libnuma.so we were using as follows:

libnuma.c:

 <at>  <at>  -1240,7 +1246,7  <at>  <at> 
        }
        return err;
 }
-__asm__(".symver numa_node_to_cpus_v1,numa_node_to_cpus <at> libnuma_1.1");
+__asm__(".symver numa_node_to_cpus_v2,numa_node_to_cpus <at> libnuma_1.1");

 /*
  * test whether a node has cpus
 <at>  <at>  -1316,7 +1322,7  <at>  <at> 
        }
        return err;
 }
(Continue reading)

Cliff Wickman | 1 Nov 2010 23:59
Picon
Favicon

Re: segmentation fault in numa_node_to_cpus_v1

On Mon, Nov 01, 2010 at 03:52:59PM -0400, Michael Spiegel wrote:
> Hi,
> 
> I'm trying to run the HotSpot Java VM on an SGI UV 1000 with 4096
> cores.  When I enable the NUMA-aware garbage collection algorithm, I
> get a segmentation fault as the virtual machine is initializing.  The
> sigsegv is occurring at one of the memcpy's in numa_node_to_cpus_v1,
> although I'm afraid I can't determine whether libnuma is being called
> correctly or incorrectly.  I am testing on a system that has numactl
> 2.0.5.
> 
> Thanks,
> --Michael

Hi Michael,

  I see that Scott Lundal gave you a possible fix.
  There were some important corrections added to the latest version, so
  if you could try building numactl/libnuma from numactl-2.0.6-rc3.tar.gz 
  that would be an interesting test.
  (ftp://oss.sgi.com/www/projects/libnuma/download/)
> 
> #6  <signal handler called>
> #7  0x00007f4066fb9ad0 in memcpy () from /lib64/libc.so.6
> #8  0x00007f40658d4c6a in numa_node_to_cpus_v1 (node=132, buffer=0x40112d40,
>    bufferlen=<value optimized out>) at libnuma.c:1203
> #9  0x00007f4066a85255 in os::Linux::rebuild_cpu_to_node_map() ()
>   from /usr/ue/0/mspiegel/jdk1.6.0_22/jre/lib/amd64/server/libjvm.so
> #10 0x00007f4066a8502f in os::Linux::libnuma_init() ()
>   from /usr/ue/0/mspiegel/jdk1.6.0_22/jre/lib/amd64/server/libjvm.so
(Continue reading)

Picon

Re: NUMA page allocation from next Node

Andi/Lee/Tim/Scott/Cliff/Jiahua,

Thankx a lot for your valuable inputs/advices.

__
Tharindu R Bamunuarachchi.

On Mon, Nov 1, 2010 at 11:30 PM, Andi Kleen <andi <at> firstfloor.org> wrote:
>> Because the SLIT on the IBM platform has distances > 20.  Looks like IBM
>> is populating the SLIT on those platforms with "real" values.  The HP
>> bios is not supplying a slit, letting the remote distances default to
>> 20.  That is the threshold for setting zone_reclaim_mode.  A patch was
>> submitted recently to bump the threshold to ~30.  Now that vendors are
>> starting to populate the SLIT with values > 20, we've begun to see the
>> behavior that you experienced.
>
> I think it's intentional by the vendors: they use it a way to make
> Linux behave like they want.
>
> -Andi
>
Michael Spiegel | 2 Nov 2010 02:10
Picon

Re: segmentation fault in numa_node_to_cpus_v1

Hi everyone,

I tried the suggestions from Cliff and Scott, with no change in
behavior.  I tried some primitive debugging and noticed that
NUMA_NUM_NODES was 128 and the node argument to numa_node_to_cpus_v1
was 132.  When I changed the definition of NUMA_NUM_NODES in numa.h to
2048, I can eliminate the segmentation fault.  Now I'm getting "mbind:
invalid argument" errors.

Thanks,
--Michael

On Mon, Nov 1, 2010 at 6:59 PM, Cliff Wickman <cpw <at> sgi.com> wrote:
> On Mon, Nov 01, 2010 at 03:52:59PM -0400, Michael Spiegel wrote:
>> Hi,
>>
>> I'm trying to run the HotSpot Java VM on an SGI UV 1000 with 4096
>> cores.  When I enable the NUMA-aware garbage collection algorithm, I
>> get a segmentation fault as the virtual machine is initializing.  The
>> sigsegv is occurring at one of the memcpy's in numa_node_to_cpus_v1,
>> although I'm afraid I can't determine whether libnuma is being called
>> correctly or incorrectly.  I am testing on a system that has numactl
>> 2.0.5.
>>
>> Thanks,
>> --Michael
>
> Hi Michael,
>
>  I see that Scott Lundal gave you a possible fix.
(Continue reading)


Gmane