Andrew Morton | 1 Dec 02:33 2004

Re: [PATCH]: 1/4 batch mark_page_accessed()

Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
>
> Because the ordering of LRU pages should be enhanced in respect to locality, 
>  with the mark_page_accessed batching you group together tasks accessed pages 
>  and move them at once to the active list. 
> 
>  You maintain better locality ordering, while decreasing the precision of aging/
>  temporal locality.
> 
>  Which should enhance disk writeout performance.

I'll buy that explanation.  Although I'm a bit sceptical that it is
measurable.

Was that particular workload actually performing significant amounts of
writeout in vmscan.c?  (We should have direct+kswapd counters for that, but
we don't.  /proc/vmstat:pgrotated will give us an idea).

>  On the other hand, without batching you mix the locality up in LRU - the LRU becomes 
>  more precise in terms of "LRU aging", but less ordered in terms of sequential 
>  access pattern.
> 
>  The disk IO intensive reaim has very significant gain from the batching, its
>  probably due to the enhanced LRU ordering (what Nikita says).
> 
>  The slowdown is probably due to the additional atomic_inc by page_cache_get(). 
> 
>  Is there no way to avoid such page_cache_get there (and in lru_cache_add also)?

Not really.  The page is only in the pagevec at that time - if someone does
(Continue reading)

Nikita Danilov | 1 Dec 13:23 2004

Re: [PATCH]: 1/4 batch mark_page_accessed()

Andrew Morton writes:
 > Marcelo Tosatti <marcelo.tosatti <at> cyclades.com> wrote:
 > >
 > > Because the ordering of LRU pages should be enhanced in respect to locality, 
 > >  with the mark_page_accessed batching you group together tasks accessed pages 
 > >  and move them at once to the active list. 
 > > 
 > >  You maintain better locality ordering, while decreasing the precision of aging/
 > >  temporal locality.
 > > 
 > >  Which should enhance disk writeout performance.
 > 
 > I'll buy that explanation.  Although I'm a bit sceptical that it is
 > measurable.

cluster-pageout.patch that was sent together with
mark_page_accessed-batching.patch has roughly similar effect: page-out
is done in file order, ignoring local LRU order at end of the inactive
list. It did improve performance in page-out intensive micro-benchmark:

Averaged number of microseconds it takes to dirty 1GB of
16-times-larger-than-RAM ext3 file mmaped in 1GB chunks:

without-patch:   average:    74188417.156250
               deviation:    10538258.613280

   with-patch:   average:    69449001.583333
               deviation:    12621756.615280

 > 
(Continue reading)

Cliff White | 1 Dec 19:28 2004

Re: Automated performance testing system was Re: Text form for STP tests

> Linux-MM fellows,
> 
> I've been talking to Cliff about the need for a set of benchmarks,
> covering as many different workloads as possible, for developers to have a 
> better notion of impact on performance changes. 
> 
> Usually when one does a change which affects performance, he/she runs one 
> or two benchmarks with a limited amount of hardware configurations.
> This is a very painful, boring and time consuming process, which can 
> result in misinterpretation and/or limited understading of the results 
> of such changes.
> 
> It is important to automate such process, with a set of benchmarks 
> covering as wide as possible range of workloads, running on common 
> and most used hardware variations.
> 
> OSDL's STP provides the base framework for this.
> 
[ snip ]
> bonnie++
> reaim (default, new_fserver, shared)
> dbench_long
> kernbench
> tiobench
> 
> Each of these running one the following combinations:
> 
> 1CPU, 2CPU, 4CPU, 8CPU (4 variants).
> 
> total memory, half memory, a quarter of total memory (3 variants).
(Continue reading)

Marcelo Tosatti | 1 Dec 14:16 2004
Picon

Re: Automated performance testing system was Re: Text form for STP tests

On Wed, Dec 01, 2004 at 10:28:24AM -0800, Cliff White wrote:
> > Linux-MM fellows,
> > 
> > I've been talking to Cliff about the need for a set of benchmarks,
> > covering as many different workloads as possible, for developers to have a 
> > better notion of impact on performance changes. 
> > 
> > Usually when one does a change which affects performance, he/she runs one 
> > or two benchmarks with a limited amount of hardware configurations.
> > This is a very painful, boring and time consuming process, which can 
> > result in misinterpretation and/or limited understading of the results 
> > of such changes.
> > 
> > It is important to automate such process, with a set of benchmarks 
> > covering as wide as possible range of workloads, running on common 
> > and most used hardware variations.
> > 
> > OSDL's STP provides the base framework for this.
> > 
> [ snip ]
> > bonnie++
> > reaim (default, new_fserver, shared)
> > dbench_long
> > kernbench
> > tiobench
> > 
> > Each of these running one the following combinations:
> > 
> > 1CPU, 2CPU, 4CPU, 8CPU (4 variants).
> > 
(Continue reading)

Cliff White | 1 Dec 21:04 2004

Re: Automated performance testing system was Re: Text form for STP tests

> On Wed, Dec 01, 2004 at 10:28:24AM -0800, Cliff White wrote:
> > > Linux-MM fellows,
> > > 
> > > I've been talking to Cliff about the need for a set of benchmarks,
> > > covering as many different workloads as possible, for developers to have a 
> > > better notion of impact on performance changes. 
> > > 
[snip]
> > robots are running this test series against linux-2.6.7 ( for history data )
> > There will need to be some adjustments - some of these tests will no doubt
> > fail for reasons of script error or configuration ( i see already kernbench will 
> > have to be redunced for 1-cpu systems, as it runs > 13.5 hours :( )
> > 
> > And, the second part of the automation is already done, but needs input.
> > I can aim this test battery at any kernel patch, where 'any kernel patch'
> > is identified by a regexp. What kernels do you want this against? 
> 
> The most recent 2.6.10-rc2 and 2.6.10-rc2-mm in STP.
> 
> Will this be available through the web interface? 

Yes, the results should be visible. If something looks wrongs, email.
the 'advanced search' bit needs some test-specific fixes, and may not work
for all tests - some of the kits still needs some patching..
cliffw
> 
> Thanks again Cliff
> 
> > I've heard mention of 'baseline' - we call this baseline:
> > 
(Continue reading)

Christoph Lameter | 2 Dec 00:41 2004
Picon

page fault scalability patch V12 [0/7]: Overview and performance tests

Changes from V11->V12 of this patch:
- dump sloppy_rss in favor of list_rss (Linus' proposal)
- keep up against current Linus tree (patch is based on 2.6.10-rc2-bk14)

This is a series of patches that increases the scalability of
the page fault handler for SMP. Here are some performance results
on a machine with 512 processors allocating 32 GB with an increasing
number of threads (that are assigned a processor each).

Without the patches:
Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.416s    138.165s 139.050s 45073.831  45097.498
 32   3    2    1.397s    148.523s  78.044s 41965.149  80201.646
 32   3    4    1.390s    152.618s  44.044s 40851.258 141545.239
 32   3    8    1.500s    374.008s  53.001s 16754.519 118671.950
 32   3   16    1.415s   1051.759s  73.094s  5973.803  85087.358
 32   3   32    1.867s   3400.417s 117.003s  1849.186  53754.928
 32   3   64    5.361s  11633.040s 197.034s   540.577  31881.112
 32   3  128   23.387s  39386.390s 332.055s   159.642  18918.599
 32   3  256   15.409s  20031.450s 168.095s   313.837  37237.918
 32   3  512   18.720s  10338.511s  86.047s   607.446  72752.686

With the patches:
 Gb Rep Threads   User      System     Wall flt/cpu/s fault/wsec
 32   3    1    1.451s    140.151s 141.060s 44430.367  44428.115
 32   3    2    1.399s    136.349s  73.041s 45673.303  85699.793
 32   3    4    1.321s    129.760s  39.027s 47996.303 160197.217
 32   3    8    1.279s    100.648s  20.039s 61724.641 308454.557
 32   3   16    1.414s    153.975s  15.090s 40488.236 395681.716
 32   3   32    2.534s    337.021s  17.016s 18528.487 366445.400
(Continue reading)

Christoph Lameter | 2 Dec 00:42 2004
Picon

page fault scalability patch V12 [1/7]: Reduce use of thepage_table_lock

Changelog
        * Increase parallelism in SMP configurations by deferring
          the acquisition of page_table_lock in handle_mm_fault
        * Anonymous memory page faults bypass the page_table_lock
          through the use of atomic page table operations
        * Swapper does not set pte to empty in transition to swap
        * Simulate atomic page table operations using the
          page_table_lock if an arch does not define
          __HAVE_ARCH_ATOMIC_TABLE_OPS. This still provides
          a performance benefit since the page_table_lock
          is held for shorter periods of time.

Signed-off-by: Christoph Lameter <clameter <at> sgi.com

Index: linux-2.6.9/mm/memory.c
===================================================================
--- linux-2.6.9.orig/mm/memory.c	2004-11-23 10:06:03.000000000 -0800
+++ linux-2.6.9/mm/memory.c	2004-11-23 10:07:55.000000000 -0800
 <at>  <at>  -1330,8 +1330,7  <at>  <at> 
 }

 /*
- * We hold the mm semaphore and the page_table_lock on entry and
- * should release the pagetable lock on exit..
+ * We hold the mm semaphore
  */
 static int do_swap_page(struct mm_struct * mm,
 	struct vm_area_struct * vma, unsigned long address,
 <at>  <at>  -1343,15 +1342,13  <at>  <at> 
 	int ret = VM_FAULT_MINOR;
(Continue reading)

Christoph Lameter | 2 Dec 00:42 2004
Picon

page fault scalability patch V12 [2/7]: atomic pte operations for ia64

Changelog
        * Provide atomic pte operations for ia64
        * Enhanced parallelism in page fault handler if applied together
          with the generic patch

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

Index: linux-2.6.9/include/asm-ia64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-ia64/pgalloc.h	2004-10-18 14:53:06.000000000 -0700
+++ linux-2.6.9/include/asm-ia64/pgalloc.h	2004-11-19 07:54:19.000000000 -0800
 <at>  <at>  -34,6 +34,10  <at>  <at> 
 #define pmd_quicklist		(local_cpu_data->pmd_quick)
 #define pgtable_cache_size	(local_cpu_data->pgtable_cache_sz)

+/* Empty entries of PMD and PGD */
+#define PMD_NONE       0
+#define PGD_NONE       0
+
 static inline pgd_t*
 pgd_alloc_one_fast (struct mm_struct *mm)
 {
 <at>  <at>  -78,12 +82,19  <at>  <at> 
 	preempt_enable();
 }

+
 static inline void
 pgd_populate (struct mm_struct *mm, pgd_t *pgd_entry, pmd_t *pmd)
 {
(Continue reading)

Christoph Lameter | 2 Dec 00:43 2004
Picon

page fault scalability patch V12 [3/7]: universal cmpxchg for i386

Changelog
        * Make cmpxchg and cmpxchg8b generally available on the i386
	  platform.
        * Provide emulation of cmpxchg suitable for uniprocessor if
	  build and run on 386.
        * Provide emulation of cmpxchg8b suitable for uniprocessor
	  systems if build and run on 386 or 486.
	* Provide an inline function to atomically get a 64 bit value
	  via cmpxchg8b in an SMP system (courtesy of Nick Piggin)
	  (important for i386 PAE mode and other places where atomic
	  64 bit operations are useful)

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

Index: linux-2.6.9/arch/i386/Kconfig
===================================================================
--- linux-2.6.9.orig/arch/i386/Kconfig	2004-11-15 11:13:34.000000000 -0800
+++ linux-2.6.9/arch/i386/Kconfig	2004-11-19 10:02:54.000000000 -0800
 <at>  <at>  -351,6 +351,11  <at>  <at> 
 	depends on !M386
 	default y

+config X86_CMPXCHG8B
+	bool
+	depends on !M386 && !M486
+	default y
+
 config X86_XADD
 	bool
 	depends on !M386
(Continue reading)

Christoph Lameter | 2 Dec 00:44 2004
Picon

page fault scalability patch V12 [5/7]: atomic pte operations for x86_64

Changelog
        * Provide atomic pte operations for x86_64

Signed-off-by: Christoph Lameter <clameter <at> sgi.com>

Index: linux-2.6.9/include/asm-x86_64/pgalloc.h
===================================================================
--- linux-2.6.9.orig/include/asm-x86_64/pgalloc.h	2004-10-18 14:54:30.000000000 -0700
+++ linux-2.6.9/include/asm-x86_64/pgalloc.h	2004-11-23 10:59:01.000000000 -0800
 <at>  <at>  -7,16 +7,26  <at>  <at> 
 #include <linux/threads.h>
 #include <linux/mm.h>

+#define PMD_NONE 0
+#define PGD_NONE 0
+
 #define pmd_populate_kernel(mm, pmd, pte) \
 		set_pmd(pmd, __pmd(_PAGE_TABLE | __pa(pte)))
 #define pgd_populate(mm, pgd, pmd) \
 		set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(pmd)))
+#define pgd_test_and_populate(mm, pgd, pmd) \
+		(cmpxchg((int *)pgd, PGD_NONE, _PAGE_TABLE | __pa(pmd)) == PGD_NONE)

 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
 {
 	set_pmd(pmd, __pmd(_PAGE_TABLE | (page_to_pfn(pte) << PAGE_SHIFT)));
 }

+static inline int pmd_test_and_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
(Continue reading)


Gmane