Nick Piggin | 1 May 2008 02:29
Picon

Re: [rfc] data race in page table setup/walking?

On Wed, Apr 30, 2008 at 08:53:44AM -0700, Linus Torvalds wrote:
> 
> 
> On Wed, 30 Apr 2008, Nick Piggin wrote:
> > 
> > Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> > probably go away if we cared: because we could be sneaky and leverage
> > the assumption that top and bottom will always be in the same cacheline
> > and thus should be shielded from memory consistency problems :)
> 
> Umm.
> 
> Why would we care, since smp_wmb() is a no-op? (Yea, it's a compiler 
> barrier, big deal, it's not going to cost us anything).

Oh there needs to be a compiler barrier there. I was just saying...
I don't actually think we care (whether or not I'm right).

> Also, write barriers are not about cacheline access order, they tend to be 
> more about the write *buffer*, ie before the write even hits the cache 
> line. And a write coudl easily pass another write in the write buffer if 
> there is (for example) a dependency on the address.
> 
> So even if they are in the same cacheline, if the first write needs an 
> offset addition, and the second one does not, it could easily be that the 
> second one hits the write buffer first (together with some alias 
> detection that re-does the things if they alias).
> 
> Of course, on x86, the write ordering is strictly defined, and even if the 
> CPU reorders writes they are guaranteed to never show up re-ordered, so 
(Continue reading)

Nick Piggin | 1 May 2008 02:35
Picon

Re: [rfc] data race in page table setup/walking?

On Wed, Apr 30, 2008 at 12:14:51PM +0100, Hugh Dickins wrote:
> On Wed, 30 Apr 2008, Nick Piggin wrote:
> > 
> > Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> > probably go away if we cared: because we could be sneaky and leverage
> > the assumption that top and bottom will always be in the same cacheline
> > and thus should be shielded from memory consistency problems :)
> 
> I've sometimes wondered along those lines.  But it would need
> interrupts disabled, wouldn't it?  And could SMM mess it up?
> And what about another CPU taking the cacheline to modify it
> in between our two accesses?

Nothing more than could not already happen with the smp_wmb in there,
AFAIKS.

 
> I don't think we do care in that x86 PAE case, but as a general
> principal, if it can be safely assumed on all architectures (or
> more messily, just on some) under certain conditions, then shouldn't
> we be looking to use that technique (relying on a consistent view of
> separate variables clustered into the same cacheline) in critical
> places, rather than regarding it as sneaky?
> 
> But I suspect this is a chimaera, that there's actually no
> safe use to be made of it.  I'd be glad to be shown wrong.

Well Linus put a dampener on it... but if it actually did work, then
yeah I guess there are some places it could be used. I suspect that
on some implementations, being in the same cacheline would actually
(Continue reading)

Nick Piggin | 1 May 2008 03:54
Picon

Re: [patch] SLQB v2

Hi, sorry I missed this message initially..

On Tue, Apr 15, 2008 at 05:44:07AM +0200, Ahmed S. Darwish wrote:
> Hi!,
> 
> On Thu, Apr 10, 2008 at 09:31:38PM +0200, Nick Piggin wrote:
> ...
> > +
> > +/*
> > + * We use struct slqb_page fields to manage some slob allocation aspects,
> > + * however to avoid the horrible mess in include/linux/mm_types.h, we'll
> > + * just define our own struct slqb_page type variant here.
> > + */
> > +struct slqb_page {
> > +	union {
> > +		struct {
> > +			unsigned long flags;	/* mandatory */
> > +			atomic_t _count;	/* mandatory */
> > +			unsigned int inuse;	/* Nr of objects */
> > +		   	struct kmem_cache_list *list; /* Pointer to list */
> > +			void **freelist;	/* freelist req. slab lock */
> > +			union {
> > +				struct list_head lru; /* misc. list */
> > +				struct rcu_head rcu_head; /* for rcu freeing */
> > +			};
> > +		};
> > +		struct page page;
> > +	};
> > +};
> 
(Continue reading)

Tom May | 1 May 2008 04:07

Re: [PATCH 0/8][for -mm] mem_notify v6

On Wed, Apr 23, 2008 at 1:27 AM, Daniel SpÄng <daniel.spang <at> gmail.com> wrote:
> Hi Tom
>
>
>  On 4/17/08, Tom May <tom <at> tommay.com> wrote:
>  >
>  >  Here is the start and end of the output from the test program.  At
>  >  each /dev/mem_notify notification Cached decreases, then eventually
>  >  Mapped decreases as well, which means the amount of time the program
>  >  has to free memory gets smaller and smaller.  Finally the oom killer
>  >  is invoked because the program can't react quickly enough to free
>  >  memory, even though it can free at a faster rate than it can use
>  >  memory.  My test is slow to free because it calls nanosleep, but this
>  >  is just a simulation of my actual program that has to perform garbage
>  >  collection before it can free memory.
>
>  I have also seen this behaviour in my static tests with low mem
>  notification on swapless systems. It is a problem with small programs
>  (typically static test programs) where the text segment is only a few
>  pages. I have not seen this behaviour in larger programs which use a
>  larger working set. As long as the system working set is bigger than
>  the amount of memory that needs to be allocated, between every
>  notification reaction opportunity, it seems to be ok.

Hi Daniel,

You're saying the program's in-core text pages serve as a reserve that
the kernel can discard when it needs some memory, correct?  And that
even if the kernel discards them, it will page them back in as a
matter of course as the program runs, to maintain the reserve?  That
(Continue reading)

Linus Torvalds | 1 May 2008 05:24
Gravatar

Re: [rfc] data race in page table setup/walking?


On Thu, 1 May 2008, Nick Piggin wrote:
> > 
> > Of course, on x86, the write ordering is strictly defined, and even if the 
> > CPU reorders writes they are guaranteed to never show up re-ordered, so 
> > this is not an issue. But I wanted to point out that memory ordering is 
> > *not* just about cachelines, and being in the same cacheline is no 
> > guarantee of anything, even if it can have *some* effects.
> 
> Well it is a guarantee about cache coherency presumably, but I guess
> you're taking that for granted.

Yes, I'm taking cache coherency for granted, I don't think it's worth even 
worrying about non-coherent cases.

> But I'm surprised that two writes to the same cacheline (different
> words) can be reordered. Of course write buffers are technically outside
> the coherency domain, but I would have thought any implementation will
> actually treat writes to the same line as aliasing. Is there a counter
> example?

I don't know if anybody does it, but no, normally I would *not* expect any 
alias logic to have anything to do with cachelines. Aliasing within a 
cacheline is so common (spills to the stack, if nothing else) that if the 
CPU has some write buffer alias logic, I'd expect it to be byte or perhaps 
word-granular.

So I think that at least in theory it is quite possible that a later write 
hits the same cacheline first, just because the write data or address got 
resolved first and the architecture allows out-of-order memory accesses. 
(Continue reading)

kamezawa.hiroyu | 1 May 2008 10:39
Favicon

Re: Re: [PATCH] more ZERO_PAGE handling ( was 2.6.24 regression: deadlock on coredump of big process)

>This patch fixes the deadlock.  Tested on 2.6.24.5.  Thanks!
>
>Tested-by: Tony Battersby <tonyb <at> cybernetics.com>
>
thank you for test. I'll post this again when I'm back.

Regards,
-Kame
Johannes Weiner | 1 May 2008 12:22
Picon

Re: [patch 0/4] Bootmem cleanups

Hi Ingo,

Ingo Molnar <mingo <at> elte.hu> writes:

> * Johannes Weiner <hannes <at> saeurebad.de> wrote:
>
>> Hi Ingo,
>> 
>> I now dropped the node-crossing patches from my bootmem series and 
>> here is what is left over.
>> 
>> They apply to Linus' current git 
>> (0ff5ce7f30b45cc2014cec465c0e96c16877116e).
>> 
>> Please note that all parts affecting !X86_32_BORING_UMA_BOX are 
>> untested!
>> 
>>  arch/alpha/mm/numa.c             |    8 ++--
>>  arch/arm/mm/discontig.c          |   34 ++++++++++-----------
>>  arch/ia64/mm/discontig.c         |   11 +++----
>>  arch/m32r/mm/discontig.c         |    4 +--
>>  arch/m68k/mm/init.c              |    4 +--
>>  arch/mips/sgi-ip27/ip27-memory.c |    3 +-
>>  arch/parisc/mm/init.c            |    3 +-
>>  arch/powerpc/mm/numa.c           |    3 +-
>>  arch/sh/mm/numa.c                |    5 +--
>>  arch/sparc64/mm/init.c           |    3 +-
>>  arch/x86/mm/discontig_32.c       |    3 +-
>>  arch/x86/mm/numa_64.c            |    6 +---
>>  include/linux/bootmem.h          |    7 +---
(Continue reading)

Hugh Dickins | 1 May 2008 14:45

Re: [rfc] data race in page table setup/walking?

On Thu, 1 May 2008, Nick Piggin wrote:
> On Wed, Apr 30, 2008 at 12:14:51PM +0100, Hugh Dickins wrote:
> > On Wed, 30 Apr 2008, Nick Piggin wrote:
> > > 
> > > Actually, aside, all those smp_wmb() things in pgtable-3level.h can
> > > probably go away if we cared: because we could be sneaky and leverage
> > > the assumption that top and bottom will always be in the same cacheline
> > > and thus should be shielded from memory consistency problems :)
> > 
> > I've sometimes wondered along those lines.  But it would need
> > interrupts disabled, wouldn't it?  And could SMM mess it up?
> > And what about another CPU taking the cacheline to modify it
> > in between our two accesses?
> 
> Nothing more than could not already happen with the smp_wmb in there,
> AFAIKS.

Yes, one does wonder just what I was wondering ;)

Hugh
KOSAKI Motohiro | 1 May 2008 17:06
Favicon

Re: [PATCH 0/8][for -mm] mem_notify v6

Hi Tom,

> In my case of a Java virtual machine, where I originally saw the
> problem, most of the code is interpreted byte codes or jit-compiled
> native code, all of which resides not in the text segment but in
> anonymous pages that aren't backed by a file, and there is no swap
> space.  The actual text segment working set can be very small (memory
> allocation, garbage collection, synchronization, other random native
> code).  And, as KOSAKI Motohiro pointed out, it may be wise to mlock
> these areas.  So the text working set doesn't make an adequate
> reserve.

your memnotify check routine is written by native or java?
if native, my suggestion is right.
but if java, it is wrong.

my point is "on swapless system, /dev/mem_notify checked routine should be mlocked".

> However, I can maintain a reserve of cached and/or mapped memory by
> touching pages in the text segment (or any mapped file) as the final
> step of low memory notification handling, if the cached page count is
> getting low.  For my purposes, this is nearly the same as having an
> additional threshold-based notification, since it forces notifications
> to occur while the kernel still has some memory to satisfy allocations
> while userspace code works to free memory.  And it's simple.
> 
> Unfortunately, this is more expensive than it could be since the pages
> need to be read in from some device (mapping /dev/zero doesn't cause
> pages to be allocated). What I'm looking for now is a cheap way to
> populate the cache with pages that the kernel can throw away when it
(Continue reading)

Andrea Arcangeli | 1 May 2008 20:12

[ofa-general] mmu notifier-core v14->v15 diff for review

Hello everyone,

this is the v14 to v15 difference to the mmu-notifier-core patch. This
is just for review of the difference, I'll post full v15 soon, please
review the diff in the meantime. Lots of those cleanups are thanks to
Andrew review on mmu-notifier-core in v14. He also spotted the
GFP_KERNEL allocation under spin_lock where DEBUG_SPINLOCK_SLEEP
failed to catch it until I enabled PREEMPT (GFP_KERNEL there was
perfectly safe with all patchset applied but not ok if only
mmu-notifier-core was applied). As usual that bug couldn't hurt
anybody unless the mmu notifiers were armed.

I also wrote a proper changelog to the mmu-notifier-core patch that I
will append before the v14->v15 diff:

Subject: mmu-notifier-core

With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to
pages. There are secondary MMUs (with secondary sptes and secondary
tlbs) too. sptes in the kvm case are shadow pagetables, but when I say
spte in mmu-notifier context, I mean "secondary pte". In GRU case
there's no actual secondary pte and there's only a secondary tlb
because the GRU secondary MMU has no knowledge about sptes and every
secondary tlb miss event in the MMU always generates a page fault that
has to be resolved by the CPU (this is not the case of KVM where the a
secondary tlb miss will walk sptes in hardware and it will refill the
secondary tlb transparently to software if the corresponding spte is
present). The same way zap_page_range has to invalidate the pte before
freeing the page, the spte (and secondary tlb) must also be
invalidated before any page is freed and reused.
(Continue reading)


Gmane