Alasdair G Kergon | 1 Nov 2007 01:00
Picon
Favicon

Re: Re: dm: bounce_pfn limit added

On Wed, Oct 31, 2007 at 05:00:16PM -0500, Kiyoshi Ueda wrote:
> How about the case that other dm device is stacked on the dm device?
> (e.g. dm-linear over dm-multipath over i2o with bounce_pfn=64GB, and
>       the multipath table is changed to i2o with bounce_pfn=1GB.)

Let's not broaden the problem out in that direction yet - that's a
known flaw in the way all these device restrictions are handled.
(Which would, it happens, also be resolved by the dm architectural
changes I'm contemplating.)

Yes, we could certainly take this patch - it won't do much harm (just
hit performance in some configurations).  But I am not yet convinced
that there isn't some further underlying problem with the way the
responsibility for this bouncing is divided up between the various
layers: I still don't feel I completely understand this problem yet.

- How does that bio_alloc() in blk_queue_bounce() guarantee never to
lead a deadlock (in the device-mapper context)?
- Are some functions failing to take account of the hw_segments
(and perhaps other) restrictions?
- Are things actually simpler if the bouncing is dealt with just once 
prior to entering the device stack (even though that may involve
bouncing some data that does not need it) or is it better to endeavour
to keep the bouncing as close to the final layer as possible?

Alasdair
--

-- 
agk <at> redhat.com

(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 3/7] Allocpercpu: Do __percpu_disguise() only if CONFIG_DEBUG_VM is set

Disguising costs a few cycles in the hot paths. So switch it off if
we are not debuggin.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

---
 include/linux/percpu.h |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:40:14.892121256 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:41:00.907621059 -0700
 <at>  <at>  -33,7 +33,11  <at>  <at> 

 #ifdef CONFIG_SMP

+#ifdef CONFIG_DEBUG_VM
 #define __percpu_disguise(pdata) ((void *)~(unsigned long)(pdata))
+#else
+#define __percpu_disguide(pdata) ((void *)(pdata))
+#endif

 /* 
  * Use this to get to a cpu's version of the per-cpu object dynamically

--

-- 
Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 5/7] SLUB: Use allocpercpu to allocate per cpu data instead of running our own per cpu allocator

Using allocpercpu removes the needs for the per cpu arrays in the kmem_cache struct.
These could get quite big if we have to support system of up to thousands of cpus.
The use of alloc_percpu means that:

1. The size of kmem_cache for SMP configuration shrinks since we will only
need 1 pointer instead of NR_CPUS. The same pointer can be used by all
processors. Reduces cache footprint of the allocator.

2. We can dynamically size kmem_cache according to the actual nodes in the
system meaning less memory overhead for configurations that may potentially
support up to 1k NUMA nodes.

3. We can remove the diddle widdle with allocating and releasing kmem_cache_cpu
   structures when bringing up and shuttting down cpus. The allocpercpu
   logic will do it all for us.

4. Fastpath performance increases by another 20% vs. the earlier improvements.
   Instead of having fastpath with 40-50 cycles we are now in the 30-40 range.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

---
 include/linux/slub_def.h |   11 ++--
 mm/slub.c                |  125 ++++-------------------------------------------
 2 files changed, 18 insertions(+), 118 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-10-30 16:34:41.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2007-10-31 09:23:26.000000000 -0700
(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

This patch increases the speed of the SLUB fastpath by
improving the per cpu allocator and makes it usable for SLUB.

Currently allocpercpu manages arrays of pointer to per cpu objects.
This means that is has to allocate the arrays and then populate them
as needed with objects. Although these objects are called per cpu
objects they cannot be handled in the same way as per cpu objects
by adding the per cpu offset of the respective cpu.

The patch here changes that. We create a small memory pool in the
percpu area and allocate from there if alloc per cpu is called.
As a result we do not need the per cpu pointer arrays for each
object. This reduces memory usage and also the cache foot print
of allocpercpu users. Also the per cpu objects for a single processor
are tightly packed next to each other decreasing cache footprint
even further and making it possible to access multiple objects
in the same cacheline.

SLUB has the same mechanism implemented. After fixing up the
alloccpu stuff we throw the SLUB method out and use the new
allocpercpu handling. Then we optimize allocpercpu addressing
by adding a new function

	this_cpu_ptr()

that allows the determination of the per cpu pointer for the
current processor in an more efficient way on many platforms.

This increases the speed of SLUB (and likely other kernel subsystems
that benefit from the allocpercpu enhancements):
(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 2/7] allocpercpu: Remove functions that are rarely used.

Population and depopulation is no longer needed since newly created
per cpu areas will have all the fields needed. Teardown of per cpu
areas will remove objects no longer needed.

This basically reverts the API to the way it was before the population and
depopulation went in. There is only a single user in the kernel that uses these
functions in net/iucv/iucv.c which is S/390 specific.

Remove the useless population and depopulation functions there. In that driver
we have the single occurrence of a per cpu allocations that uses GFP flags.
The allocation from the DMA zone is required in order to have memory below 2G. But
it seems that the per cpu areas are also under 2G so we are fine there.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

---
 include/linux/percpu.h |   42 +------------------
 mm/allocpercpu.c       |  104 +++++--------------------------------------------
 net/iucv/iucv.c        |   31 +++-----------
 3 files changed, 21 insertions(+), 156 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:40:04.831371052 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:40:14.892121256 -0700
 <at>  <at>  -47,41 +47,16  <at>  <at> 
     	(__typeof__(ptr))(p + q);			\
 })

-extern void *percpu_populate(void *__pdata, size_t size, gfp_t gfp, int cpu);
(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 4/7] Percpu: Add support for this_cpu_offset() to be able to create this_cpu_ptr()

Support for this_cpu_ptr() is important for those arches that allow a faster
way to get to the per cpu area of the local processor.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

---
 include/asm-generic/percpu.h |    4 ++++
 include/asm-ia64/percpu.h    |    3 +++
 include/asm-powerpc/percpu.h |    3 +++
 include/asm-s390/percpu.h    |    4 ++++
 include/asm-sparc64/percpu.h |    2 ++
 include/asm-x86/percpu_32.h  |    2 ++
 include/asm-x86/percpu_64.h  |    4 ++++
 include/linux/percpu.h       |    7 +++++++
 8 files changed, 29 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2007-10-31 16:41:00.907621059 -0700
+++ linux-2.6/include/linux/percpu.h	2007-10-31 16:42:45.748121446 -0700
 <at>  <at>  -51,6 +51,13  <at>  <at> 
     	(__typeof__(ptr))(p + q);			\
 })

+#define this_cpu_ptr(ptr)           			\
+({							\
+	void *p = ptr;					\
+    	(__typeof__(ptr))(p + this_cpu_offset());	\
+})
+
(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 7/7] SLUB: Optimize per cpu access on the local cpu using this_cpu_ptr()

Use this_cpu_ptr to optimize access to the per cpu area in the fastpaths.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

---
 mm/slub.c |   27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-10-31 14:00:28.635673087 -0700
+++ linux-2.6/mm/slub.c	2007-10-31 14:01:29.803422492 -0700
 <at>  <at>  -248,6 +248,15  <at>  <at>  static inline struct kmem_cache_cpu *get
 #endif
 }

+static inline struct kmem_cache_cpu *this_cpu_slab(struct kmem_cache *s)
+{
+#ifdef CONFIG_SMP
+	return this_cpu_ptr(s->cpu_slab);
+#else
+	return &s->cpu_slab;
+#endif
+}
+
 /*
  * The end pointer in a slab is special. It points to the first object in the
  * slab but has bit 0 set to mark it.
 <at>  <at>  -1521,7 +1530,7  <at>  <at>  static noinline unsigned long get_new_sl
 	if (!page)
(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 6/7] SLUB: No need to cache kmem_cache data in kmem_cache_cpu anymore

Remove the fields in kmem_cache_cpu that were used to cache data from
kmem_cache when they were in different cachelines. The cacheline that holds
the per cpu array pointer now also holds these values. We can cut down the
kmem_cache_cpu size to almost half.

The get_freepointer() and set_freepointer() functions that used to be only
intended for the slow path now are also useful for the hot path since access
to the field does not require an additional cacheline anymore. This results
in consistent use of setting the freepointer for objects throughout SLUB.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

---
 include/linux/slub_def.h |    3 --
 mm/slub.c                |   50 +++++++++++++++--------------------------------
 2 files changed, 16 insertions(+), 37 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2007-10-31 12:57:00.131421982 -0700
+++ linux-2.6/include/linux/slub_def.h	2007-10-31 12:57:43.446922264 -0700
 <at>  <at>  -15,9 +15,6  <at>  <at>  struct kmem_cache_cpu {
 	void **freelist;
 	struct page *page;
 	int node;
-	unsigned int offset;
-	unsigned int objsize;
-	unsigned int objects;
 };

(Continue reading)

Christoph Lameter | 1 Nov 2007 01:02
Picon
Favicon

[patch 1/7] allocpercpu: Make it a true per cpu allocator by allocating from a per cpu array

Currently each call to alloc_percpu allocates an array of pointer to objects.
For each operation on a percpu structure we need to follow a pointer from that
map. Usually a processor used only the entry for its own processor id in that
array. The rest of the bytes in the cacheline are not needed. This repeats
itself for each and every per cpu array in use.

Moreover the result of alloc_percpu is not a variable that can be handled like
a regular per cpu variable.

The approach here changes the way allocpercpu is done. Objects are placed
in preallocated per cpu areas that are indexed via the existing per cpu array
of pointers. So we have a single array of pointer to per cpu areas that
is used by all per cpu operations. The data is placed tightly next to each
other for each processor so that the likelyhood of a single cache line covering
data for multiple needs is increased. The cache footprint of the allocpercpu
operations sinks dramatically. Some processors have the ability to map the
per cpu area of the current processor in a special way so that variables in
that area can be reached very efficiently. It is rather typical that a processor
only uses its own per processor area. On many architectures the indexing via
the per cpu array can then be completely bypassed.

The size of the per cpu alloc area is defined to be 32k per processor for now.

Another advantage of this approach is that the onlining and offlining
of the per cpu items is handled in a global way. On onlining a cpu all
objects become present without callbacks. Similarly on offlining a cpu
all per cpu objects vanish without the need of callbacks. Callbacks
may still be needed to do preparation and cleanup of the data areas
but the freeing and allocation of the per cpu areas no longer needs to
be done by the subsystems.
(Continue reading)

Pavel Machek | 1 Nov 2007 01:12
Picon

Re: iwl3945 in 2.6.24-rc1 dies under load

Hi!

> > tcpspray -n 1 -b 10000000 10.0.0.2
> > 
> > Oct 31 01:42:56 amd kernel: iwl3945: Microcode SW error detected.
> > Restarting 0x82000008.
> > Oct 31 01:42:56 amd kernel: iwl3945: Error Reply type 0x00000005 cmd
> > REPLY_TX (0x1C) seq 0x02C7 ser 0x0000004B
> > Oct 31 01:42:58 amd kernel: iwl3945: Can't stop Rx DMA.
> 
> Please attach the dmesg with module param debug=0x43fff (make sure
> CONFIG_IWLWIFI_DEBUG is selected). This will dump the firmware event log
> for debug.

I could reproduce it with reduced deubgging... after 2 hours or so,
and I did 2x tcpspray + ping -f.

Is it useful, or is full dump needed?
								Pavel

iwl3945: I iwl_rx_handle r = 124, i = 123, REPLY_3945_RX, 0x1b
iwl3945: I iwl_rx_handle r = 125, i = 124, STATISTICS_NOTIFICATION, 0x9d
iwl3945: U iwl_pci_remove *** UNLOAD DRIVER ***
iwl3945: U __iwl_down iwl3945 is going down
iwl3945: U iwl_hw_nic_stop_master stop master
iwl3945: U iwl_clear_free_frames 0 frames on pre-allocated heap on clear.
iwl3945: U iwl_mac_remove_interface enter
iwl3945: U iwl_mac_remove_interface leave
iwl3945: U iwl_mac_stop enter
iwl3945: U iwl_mac_stop leave
(Continue reading)


Gmane