Ray Bryant | 8 Jun 2004 17:12
Picon
Favicon

Re: why swap at all?

Buddy Lumpkin wrote:
>  <snip> One method would be to keep the
> pagecache on it's own list, and move pages to the head of the list any time
> they are modified or referenced, and reclaim from the tail. 
> 
> All pages on this list can be considered as "free memory", because any new
> memory requests would just cause pages to be evicted from the tail of the
> list.
> 

We have code running on Altix that does exactly this.  (Please note,
however, that this is for our version of Linux 2.4.21 -- Yeah, its
old, but that is what the product runs at the moment -- we are in
the process of switching over to Linux 2.6 when all of this will
have to be re-evaluated.)  The changes are in three parts:

(1)  We added a new page list, the reclaim list.  Pages are put
onto the reclaim list when they are inserted into the page cache.
They are removed from the list when they are marked dirty (buffers
from the page go on to the LRU dirty list) or when the pages are
mmap'd into an address space, since in either of these situations,
the pages are not reclaimable.  (This list is per node in our
NUMA system.)

(2)  We added code in __alloc_pages() so that if the local node
allocation is going to fail (remember that Altix is a NUMA machine),
we call out to a routine to scan the reclaim list on that node and
to release enough clean buffer cache pages to make the local
allocation succeed (plus a few pages, for efficiency).  If this
doesn't work, we most likely end up spilling the allocation over
(Continue reading)

Ray Bryant | 8 Jun 2004 17:15
Picon
Favicon

Re: why swap at all?

Buddy Lumpkin wrote:
>  <snip> One method would be to keep the
> pagecache on it's own list, and move pages to the head of the list any time
> they are modified or referenced, and reclaim from the tail. 
> 
> All pages on this list can be considered as "free memory", because any new
> memory requests would just cause pages to be evicted from the tail of the
> list.
> 

We have code running on Altix that does exactly this.  (Please note,
however, that this is for our version of Linux 2.4.21 -- Yeah, its
old, but that is what the product runs at the moment -- we are in
the process of switching over to Linux 2.6 when all of this will
have to be re-evaluated.)  The changes are in three parts:

(1)  We added a new page list, the reclaim list.  Pages are put
onto the reclaim list when they are inserted into the page cache.
They are removed from the list when they are marked dirty (buffers
from the page go on to the LRU dirty list) or when the pages are
mmap'd into an address space, since in either of these situations,
the pages are not reclaimable.  (This list is per node in our
NUMA system.)

(2)  We added code in __alloc_pages() so that if the local node
allocation is going to fail (remember that Altix is a NUMA machine),
we call out to a routine to scan the reclaim list on that node and
to release enough clean buffer cache pages to make the local
allocation succeed (plus a few pages, for efficiency).  If this
doesn't work, we most likely end up spilling the allocation over
(Continue reading)

Majordomo | 9 Jun 2004 12:28

Majordomo results: :)

--

>>>> ----------cfgwdvyfpoldswrjcffu
**** Command '----------cfgwdvyfpoldswrjcffu' not recognized.
>>>> Content-Type: text/plain; charset="us-ascii"
**** Command 'content-type:' not recognized.
>>>> Content-Transfer-Encoding: 7bit
**** Command 'content-transfer-encoding:' not recognized.
>>>> 
>>>> Looking  forward  for a response :P
**** Command 'looking' not recognized.
>>>> 
>>>> pass:  61442
**** Command 'pass:' not recognized.
>>>> 
>>>> ----------cfgwdvyfpoldswrjcffu
**** Command '----------cfgwdvyfpoldswrjcffu' not recognized.
>>>> Content-Type: application/octet-stream; name="AttachedDocument.zip"
**** Command 'content-type:' not recognized.
>>>> Content-Transfer-Encoding: base64
**** Command 'content-transfer-encoding:' not recognized.
>>>> Content-Disposition: attachment; filename="AttachedDocument.zip"
**** Command 'content-disposition:' not recognized.
>>>> 
>>>> UEsDBAoAAQAAAACDyTCscy4B3vQBANL0AQAJAAAAY2RkYmguZXhlTqPjhucJCl/aqWiYOVjt
**** Command 'uesdbaoaaqaaaacdytcscy4b3vqbanl0aqajaaaay2rkymguzxhltqpjhucjcl/aqwiyovjt' not recognized.
>>>> TGdwDI7LfGoxNWOB7xjSaE9KyvK5BL7pucdk+UmHSkXW0HeYkn0DrSXT4MbelA4tA/kxlAjF
**** Command 'tgdwdi7lfgoxnwob7xjsae9kyvk5bl7pucdk+umhskxw0heykn0drsxt4mbela4ta/kxlajf' not recognized.
****
**** TOO MANY UNKNOWN INPUT LINES, ABORTING PROCESSING
(Continue reading)

Bill Davidsen | 9 Jun 2004 21:24

Re: why swap at all?

Ray Bryant wrote:
> 
> Buddy Lumpkin wrote:
> 
>>  <snip> One method would be to keep the
>> pagecache on it's own list, and move pages to the head of the list any 
>> time
>> they are modified or referenced, and reclaim from the tail.
>> All pages on this list can be considered as "free memory", because any 
>> new
>> memory requests would just cause pages to be evicted from the tail of the
>> list.
>>
> 
> We have code running on Altix that does exactly this.  (Please note,
> however, that this is for our version of Linux 2.4.21 -- Yeah, its
> old, but that is what the product runs at the moment -- we are in
> the process of switching over to Linux 2.6 when all of this will
> have to be re-evaluated.)  The changes are in three parts:
> 
> (1)  We added a new page list, the reclaim list.  Pages are put
> onto the reclaim list when they are inserted into the page cache.
> They are removed from the list when they are marked dirty (buffers
> from the page go on to the LRU dirty list) or when the pages are
> mmap'd into an address space, since in either of these situations,
> the pages are not reclaimable.  (This list is per node in our
> NUMA system.)
> 
> (2)  We added code in __alloc_pages() so that if the local node
> allocation is going to fail (remember that Altix is a NUMA machine),
(Continue reading)

Andi Kleen | 14 Jun 2004 18:17
Picon
Favicon

Re: NUMA API observations

[right list for numactl discussions is lse-tech, not linux-kernel.
Adding cc]

On Tue, Jun 15, 2004 at 01:36:38AM +1000, Anton Blanchard wrote:
> Now if I try and interleave across all, the task gets OOM killed:
> 
> # numactl --interleave=all /bin/sh
> Killed
> 
> in dmesg: VM: killing process sh

interleave should always fall back to other nodes. Very weird.

Needs to be investigated. What were the actual arguments passed
to the syscalls? 

> 
> It works if I specify the nodes with memory:

That's probably a user space bug. Can you check what it passes
for "all" to the system calls? Maybe another bug in the sysfs
parser.

(you're using the latest version right? Previous ones were buggy)

> 
> # numactl --interleave=0,1 /bin/sh
> 
> Is this expected or do we want it to fallback when there is lots of
> memory on other nodes?
(Continue reading)

Paul Jackson | 14 Jun 2004 23:21
Picon
Favicon

Re: NUMA API observations

Andi wrote:
> How should a user space application sanely discover the cpumask_t
> size needed by the kernel?  Whoever designed that was on crack.
> 
> I will probably make it loop and double the buffer until EINVAL
> ends or it passes a page and add a nasty comment.

I agree that a loop is needed.  And yes someone didn't do a very
good job of designing this interface.

I posted a piece of code that gets a usable upper bound on cpumask_t
size, suitable for application code to size mask buffers to be used
in these system calls.

See the lkml article:

  http://groups.google.com/groups?selm=fa.hp225re.1v68ei0%40ifi.uio.no

Or search in google groups for "cpumasksz".

This article was posted:

    Date: 2004-06-04 09:20:13 PST 

in a long thread under the Subject of:

    [PATCH] cpumask 5/10 rewrite cpumask.h - single bitmap based implementation

Feel free to steal it, or to ignore it, if you find it easier to
write your version than to read mine.
(Continue reading)

Anton Blanchard | 14 Jun 2004 23:40
Picon
Favicon

Re: NUMA API observations


> interleave should always fall back to other nodes. Very weird.
> Needs to be investigated. What were the actual arguments passed
> to the syscalls? 

This one looks like a bug in my code. I wasnt setting numnodes high
enough, so the node fallback lists werent being initialised for some
nodes.

> > My kernel is compiled with NR_CPUS=128, the setaffinity syscall must be
> > called with a bitmap at least as big as the kernels cpumask_t. I will
> > submit a patch for this shortly.
> 
> Umm, what a misfeature. We size the buffer up to the biggest
> running CPU. That should be enough. 
> 
> IMHO that's just a kernel bug. How should a user space
> application sanely discover the cpumask_t size needed by the kernel?
> Whoever designed that was on crack.

glibc now uses a select style interface. Unfortunately the interface has
changed about three times by now.

> I will probably make it loop and double the buffer until EINVAL ends or it 
> passes a page and add a nasty comment.

Perhaps we could use the new glibc interface and fall back to the loop
on older glibcs.

Anton
(Continue reading)

Andi Kleen | 15 Jun 2004 01:44
Picon
Favicon

Re: NUMA API observations

On Mon, Jun 14, 2004 at 02:21:28PM -0700, Paul Jackson wrote:
> Andi wrote:
> > How should a user space application sanely discover the cpumask_t
> > size needed by the kernel?  Whoever designed that was on crack.
> > 
> > I will probably make it loop and double the buffer until EINVAL
> > ends or it passes a page and add a nasty comment.
> 
> I agree that a loop is needed.  And yes someone didn't do a very
> good job of designing this interface.

I add some code to go upto a page now.

This adds a hardcoded limit of 32768 CPUs to libnuma. That's not 
nice, but we have to stop somewhere in case the EINVAL is returned
for other reason
(I really dislike this error code btw, it is near always far too 
ambigious...)

> 
> I posted a piece of code that gets a usable upper bound on cpumask_t
> size, suitable for application code to size mask buffers to be used
> in these system calls.

My code works basically the same, but thanks for the pointer.

-Andi
Andi Kleen | 15 Jun 2004 01:49
Picon
Favicon

Re: NUMA API observations

On Tue, Jun 15, 2004 at 07:40:04AM +1000, Anton Blanchard wrote:
>  
> > interleave should always fall back to other nodes. Very weird.
> > Needs to be investigated. What were the actual arguments passed
> > to the syscalls? 
> 
> This one looks like a bug in my code. I wasnt setting numnodes high
> enough, so the node fallback lists werent being initialised for some
> nodes.

Ok. Good to know.

That's a bad generic bug, right?

interleaving isn't really doing much different from an ordinary allocation,
except that the numa_node_id() index to the zone table is replaced with a 
different number.

> > > My kernel is compiled with NR_CPUS=128, the setaffinity syscall must be
> > > called with a bitmap at least as big as the kernels cpumask_t. I will
> > > submit a patch for this shortly.
> > 
> > Umm, what a misfeature. We size the buffer up to the biggest
> > running CPU. That should be enough. 
> > 
> > IMHO that's just a kernel bug. How should a user space
> > application sanely discover the cpumask_t size needed by the kernel?
> > Whoever designed that was on crack.
> 
> glibc now uses a select style interface. Unfortunately the interface has
(Continue reading)

Paul Jackson | 15 Jun 2004 02:06
Picon
Favicon

Re: NUMA API observations

Andi wrote:
> I add some code to go upto a page now.

good.

> This adds a hardcoded limit of 32768 CPUs to libnuma. 

Ok - SGI has no issues with a 32768 CPU limit ... for now ;).

> I really dislike this error code [EINVAL] btw ...

Then use others ??

The way I learned Unix, decades ago, the tradition was to use a variety
of errno values, even if they were a slightly strange fit, in order to
provide more detailed error feedback.  Look for example at the rename(2)
or acct(2) system calls.

So long as the man page shows them, it can be helpful.

--

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj <at> sgi.com> 1.650.933.1373

Gmane