Keith Mannthey | 1 Sep 2006 05:08
Picon

Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes

On 8/31/06, Mel Gorman <mel <at> csn.ul.ie> wrote:
> On Thu, 31 Aug 2006, Keith Mannthey wrote:
> > On 8/31/06, Mel Gorman <mel <at> skynet.ie> wrote:
> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
> >> > On 8/21/06, Mel Gorman <mel <at> csn.ul.ie> wrote:
> >> > >

> Can you confirm that happens by applying the patch I sent to you and
> checking the output? When the reserve fails, it should print out what
> range it actually checked. I want to be sure it's not checking the
> addresses 0->0x1070000000

See below

> >> > > <at>  <at>  -329,6 +330,8  <at>  <at>  acpi_numa_memory_affinity_init(struct ac
> >> > >
> >> > >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
> >> > >               nd->start, nd->end);
> >> > >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
> >> > >+                                               nd->end >> PAGE_SHIFT);
> >> >
> >> > A node chunk in this section of code may be a hot-pluggable zone. With
> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
> >> >
> >>
> >> The ranges should not get registered as active memory by
> >> e820_register_active_regions() unless they are marked E820_RAM. My
> >> understanding is that the regions for hotadd would be marked "reserved"
> >> in the e820 map. Is that wrong?
> >
(Continue reading)

Mel Gorman | 1 Sep 2006 10:33
Picon

Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes

On Thu, 31 Aug 2006, Keith Mannthey wrote:

> On 8/31/06, Mel Gorman <mel <at> csn.ul.ie> wrote:
>> On Thu, 31 Aug 2006, Keith Mannthey wrote:
>> > On 8/31/06, Mel Gorman <mel <at> skynet.ie> wrote:
>> >> On (30/08/06 13:57), Keith Mannthey didst pronounce:
>> >> > On 8/21/06, Mel Gorman <mel <at> csn.ul.ie> wrote:
>> >> > >
>
>> Can you confirm that happens by applying the patch I sent to you and
>> checking the output? When the reserve fails, it should print out what
>> range it actually checked. I want to be sure it's not checking the
>> addresses 0->0x1070000000
>
> See below
>

Perfect, thanks a lot. I should have enough to reproduce without a test 
machine what is going on and develop the required patches.

>> >> > > <at>  <at>  -329,6 +330,8  <at>  <at>  acpi_numa_memory_affinity_init(struct ac
>> >> > >
>> >> > >        printk(KERN_INFO "SRAT: Node %u PXM %u %Lx-%Lx\n", node, pxm,
>> >> > >               nd->start, nd->end);
>> >> > >+       e820_register_active_regions(node, nd->start >> PAGE_SHIFT,
>> >> > >+                                               nd->end >> 
>> PAGE_SHIFT);
>> >> >
>> >> > A node chunk in this section of code may be a hot-pluggable zone. With
>> >> > MEMORY_HOTPLUG_SPARSE we don't want to register these regions.
(Continue reading)

Mika Penttilä | 1 Sep 2006 10:46
Picon
Picon

Re: [PATCH 4/6] Have x86_64 use add_active_range() and free_area_init_nodes


>
> Right, it's all very clear now. At some point in the future, I'd like 
> to visit why SPARSEMEM-based hot-add is not always used but it's a 
> separate issue.
>
>> The add areas
>> are marked as RESERVED during boot and then later onlined during add.
>
> That explains the reserve_bootmem_node()
>
But pages are marked reserved by default. You still have to alloc the 
bootmem map for the the whole node range, including reserve hot add 
areas and areas beyond e820-end-of-ram. So all the areas are already 
reserved, until freed.

--Mika

Christoph Lameter | 2 Sep 2006 00:33
Picon
Favicon

[MODSLAB 0/5] Modular slab allocator V3

Modular Slab Allocator:

Why would one use this?

1. Reduced memory requirements.

  Saving range from a few hundred kbyte on i386 to 5GB on a 1024p 4TB
  Altix NUMA system.

  The slabifier has no caches in the sense of the slab allocator. No storage
  is allocated for per cpu, shared or alien caches. A slab in itself functions
  as the cache. Objects are served directly from a per cpu slab (an "active"
  slab). The management overhead for caches is gone.

  Slabs do not contain metadata but only the payload. Metadata is kept
  in the associated page struct. This means that object can begin at the
  start of a slab and are always properly aligned.

2. No cache reaper

  The current slab allocator needs to periodically check its slab caches and
  move objects back into the slabs. Every 2 seconds on every cpu all slab caches
  are scanned and object move around the system. The system cannot really
  enter a quiescent state.

  The slabifier needs no such mechanism in the single processor case. In the
  SMP case we have a per slab flusher that is active as long as processors
  have active slabs. After a timeout it flushes the active slabs back into
  the slab lists. If no active slabs exist then the flusher is deactivated.

(Continue reading)

Christoph Lameter | 2 Sep 2006 00:34
Picon
Favicon

[MODSLAB 4/5] Kmalloc subsystem

A generic kmalloc layer for the modular slab

Regular kmalloc allocations are optimized. DMA kmalloc slabs are
created on demand.

Also re exports the kmalloc array as a new slab_allocator that
can be used to tie into the kmalloc array (the slabulator
uses that to avoid creating new slabs that are compatible
with generic kmalloc caches).

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

Index: linux-2.6.18-rc5-mm1/include/linux/kmalloc.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc5-mm1/include/linux/kmalloc.h	2006-09-01 11:54:43.745232343 -0700
 <at>  <at>  -0,0 +1,136  <at>  <at> 
+#ifndef _LINUX_KMALLOC_H
+#define _LINUX_KMALLOC_H
+/*
+ * In kernel dynamic memory allocator.
+ *
+ * (C) 2006 Silicon Graphics, Inc,
+ * 		Christoph Lameter <clameter <at> sgi.com>
+ */
+
+#include <linux/allocator.h>
+#include <linux/config.h>
+#include <linux/types.h>
+
(Continue reading)

Christoph Lameter | 2 Sep 2006 00:34
Picon
Favicon

[MODSLAB 3/5] /proc/slabinfo display

Generic Slab statistics module

A statistic module for generic slab allocator framework.

The creator of a cache must register the slab cache with

register_slab()

in order for something to show up in slabinfo.

Here is a sample of slabinfo output:

slabinfo - version: 3.0
# name            <objects> <objsize> <num_slabs> <partial_slabs> <active_slabs> <order> <allocator>
nfs_direct_cache           0     136       0       0       0  0 reclaimable:page_allocator
nfs_write_data            36     896       2       0       0  0 unreclaimable:page_allocator
nfs_read_data             21     768       2       1       0  0 unreclaimable:page_allocator
nfs_inode_cache           17    1032       3       3       0  0 ctor_dtor:reclaimable:page_allocator
rpc_tasks                  0     384       1       1       0  0 unreclaimable:page_allocator
rpc_inode_cache            0     896       1       1       0  0 ctor_dtor:reclaimable:page_allocator
ip6_dst_cache              0     384       1       1       0  0 unreclaimable:page_allocator
TCPv6                      1    1792       2       2       0  0 unreclaimable:page_allocator
UNIX                     112     768      11       7       0  0 unreclaimable:page_allocator
dm_tio                     0      24       0       0       0  0 unreclaimable:page_allocator
dm_io                      0      40       0       0       0  0 unreclaimable:page_allocator
kmalloc                    0      64       0       0       0  0 dma:unreclaimable:page_allocator
cfq_ioc_pool               0     160       0       0       0  0 unreclaimable:page_allocator
cfq_pool                   0     160       0       0       0  0 unreclaimable:page_allocator
mqueue_inode_cache         0     896       1       1       0  0 ctor_dtor:unreclaimable:page_allocator
xfs_chashlist            822      40       9       9       0  0 unreclaimable:page_allocator
(Continue reading)

Christoph Lameter | 2 Sep 2006 00:34
Picon
Favicon

[MODSLAB] Bypass indirections [for performance testing only]

Bypass indirections.

This is a patch to bypass indirections so that one can get some statistics
on how high the impact of the indirect calls is.

Only use this for testing.

Index: linux-2.6.18-rc5-mm1/mm/slabifier.c
===================================================================
--- linux-2.6.18-rc5-mm1.orig/mm/slabifier.c	2006-09-01 14:25:55.907938735 -0700
+++ linux-2.6.18-rc5-mm1/mm/slabifier.c	2006-09-01 15:20:51.915297648 -0700
 <at>  <at>  -498,12 +498,13  <at>  <at>  gotpage:
 	goto redo;
 }

-static void *slab_alloc(struct slab_cache *sc, gfp_t gfpflags)
+void *slab_alloc(struct slab_cache *sc, gfp_t gfpflags)
 {
 	return __slab_alloc(sc, gfpflags, -1);
 }
+EXPORT_SYMBOL(slab_alloc);

-static void *slab_alloc_node(struct slab_cache *sc, gfp_t gfpflags,
+void *slab_alloc_node(struct slab_cache *sc, gfp_t gfpflags,
 							int node)
 {
 #ifdef CONFIG_NUMA
 <at>  <at>  -512,8 +513,9  <at>  <at>  static void *slab_alloc_node(struct slab
 	return slab_alloc(sc, gfpflags);
 #endif
(Continue reading)

Christoph Lameter | 2 Sep 2006 00:34
Picon
Favicon

[MODSLAB 2/5] Slabifier

Slabifier: A slab allocator with minimal meta information

V2->V3:
- Overload struct page
- Add new PageSlabsingle flag

Lately I have started tinkering around with the slab in particular after
Matt Mackal mentioned that the slab should be more modular at the KS.
One particular design issue with the current slab is that it is build on the
basic notion of shifting object references from list to list. Without NUMA this
is wild enough with the per cpu caches and the shared cache but with NUMA we now
have per node shared arrays, per node list and per node per node alien caches.
Somehow this all works but one wonders does it have to be that way? On very
large systems the number of these entities grows to unbelievable numbers.
On our 1k cpu/node system each slab need 128M for alien caches alone.

So I thought it may be best to try to develop another basic slab layer
that does not have all the object queues and that does not have to carry
so much state information. I also have had concerns about the way locking
is handled for awhile. We could increase parallelism by finer grained locking.
This in turn may avoid the need for object queues.

One of the problems of the NUMA slab allocator is that per node partial
slab lists are used. Partial slabs cannot be filled up from other nodes.
So what I have tried to do here is to have minimal metainformation combined
with one centralized list of partially allocated slabs. The list_lock
is only taken if list modifications become necessary. The need for those
has been drastically reduced with a few measures. See below.

After toying around for awhile I came to the realization that the page struct
(Continue reading)

Christoph Lameter | 2 Sep 2006 00:34
Picon
Favicon

[MODSLAB 5/5] Slabulator: Emulate the existing Slab Layer

The slab emulation layer.

This provides a layer that implements the existing slab API.
We try to keep the definitions that we copy from slab.h
to an absolute minimum. If things break then more
(useless) definitions from slab.h may be needed.

We put a hook into slab.h to redirect includes for slab.h to
slabulator.h.

The slabulator also contains the remnants of the slab reaper since it is
used by the page allocator in the CONFIG_NUMA case. The slabifier does not
need this anymore since it is not object cache based.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

Index: linux-2.6.18-rc5-mm1/mm/slabulator.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.18-rc5-mm1/mm/slabulator.c	2006-09-01 14:10:50.808570950 -0700
 <at>  <at>  -0,0 +1,299  <at>  <at> 
+/*
+ * Slabulator = Emulate the Slab API.
+ *
+ * (C) 2006 Silicon Graphics, Inc. Christoph Lameter <clameter <at> sgi.com>
+ *
+ */
+#include <linux/mm.h>
+#include <linux/kmalloc.h>
+#include <linux/module.h>
(Continue reading)

Christoph Lameter | 2 Sep 2006 00:34
Picon
Favicon

[MODSLAB 1/5] Generic Allocator Framework

Add allocator abstraction

The allocator abstraction layer provides sources of pages for the slabifier
and it provides ways to customize the slabifier to ones needs (one can
put dmaificiation, rcuification and so on of slab frees etc on top of the
standard page allocator).

The allocator framework also provides a means for deriving new slab
allocators from old ones. That way features can be added in a generic way.
It would be possible to add rcu for slab objects or debugging in that
fashion.

The object-oriented style of deconstructing the allocators has the
advantage that we can deal with small pieces of code that add special
functionality. The overall framework makes it easy to replace pieces
and evolve the whole allocator systems in a faster way.

It also provides a generic way to operate on different allocators.

It is no problem to define a new allocator that allocates from
memory pools and then use the slab allocator on that memory pool.

The code in mm/allocators.c provides some examples what could be
done with derived allocators.

Signed-off-by: Christoph Lameter <clameter>.

Index: linux-2.6.18-rc5-mm1/mm/Makefile
===================================================================
--- linux-2.6.18-rc5-mm1.orig/mm/Makefile	2006-09-01 10:13:42.824597049 -0700
(Continue reading)


Gmane