Nick Piggin | 1 May 2012 05:13
Picon

Re: [RFC] vmalloc: add warning in __vmalloc

On 27 April 2012 20:36, David Rientjes <rientjes <at> google.com> wrote:
> On Fri, 27 Apr 2012, Minchan Kim wrote:
>
>> Now there are several places to use __vmalloc with GFP_ATOMIC,
>> GFP_NOIO, GFP_NOFS but unfortunately __vmalloc calls map_vm_area
>> which calls alloc_pages with GFP_KERNEL to allocate page tables.
>> It means it's possible to happen deadlock.
>> I don't know why it doesn't have reported until now.
>>
>> Firstly, I tried passing gfp_t to lower functions to support __vmalloc
>> with such flags but other mm guys don't want and decided that
>> all of caller should be fixed.
>>
>> http://marc.info/?l=linux-kernel&m=133517143616544&w=2
>>
>> To begin with, let's listen other's opinion whether they can fix it
>> by other approach without calling __vmalloc with such flags.
>>
>> So this patch adds warning to detect and to be fixed hopely.
>> I Cced related maintainers.
>> If I miss someone, please Cced them.
>>
>> side-note:
>>   I added WARN_ON instead of WARN_ONCE to detect all of callers
>>   and each WARN_ON for each flag to detect to use any flag easily.
>>   After we fix all of caller or reduce such caller, we can merge
>>   a warning with WARN_ONCE.
>>
>
> I disagree with this approach since it's going to violently spam an
(Continue reading)

Nick Piggin | 1 May 2012 05:34
Picon

Re: [RFC PATCH] do_try_to_free_pages() might enter infinite loop

On 25 April 2012 04:37, Ying Han <yinghan <at> google.com> wrote:
> On Mon, Apr 23, 2012 at 10:36 PM, Nick Piggin <npiggin <at> gmail.com> wrote:
>> On 24 April 2012 06:56, Ying Han <yinghan <at> google.com> wrote:
>>> This is not a patch targeted to be merged at all, but trying to understand
>>> a logic in global direct reclaim.
>>>
>>> There is a logic in global direct reclaim where reclaim fails on priority 0
>>> and zone->all_unreclaimable is not set, it will cause the direct to start over
>>> from DEF_PRIORITY. In some extreme cases, we've seen the system hang which is
>>> very likely caused by direct reclaim enters infinite loop.
>>
>> Very likely, or definitely? Can you reproduce it? What workload?
>
> No, we don't have reproduce workload for that yet. Everything is based
> on the watchdog dump file :(
>
>>
>>>
>>> There have been serious patches trying to fix similar issue and the latest
>>> patch has good summary of all the efforts:
>>>
>>> commit 929bea7c714220fc76ce3f75bef9056477c28e74
>>> Author: KOSAKI Motohiro <kosaki.motohiro <at> jp.fujitsu.com>
>>> Date:   Thu Apr 14 15:22:12 2011 -0700
>>>
>>>    vmscan: all_unreclaimable() use zone->all_unreclaimable as a name
>>>
>>> Kosaki explained the problem triggered by async zone->all_unreclaimable and
>>> zone->pages_scanned where the later one was being checked by direct reclaim.
>>> However, after the patch, the problem remains where the setting of
(Continue reading)

Michael Kerrisk (man-pages | 1 May 2012 07:50
Picon

Re: [PATCH] Describe race of direct read and fork for unaligned buffers

Jan,

On Mon, Apr 30, 2012 at 9:30 PM, Jan Kara <jack <at> suse.cz> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.
>
> CC: mgorman <at> suse.de
> CC: Jeff Moyer <jmoyer <at> redhat.com>
> Signed-off-by: Jan Kara <jack <at> suse.cz>
> ---
>
> --- a/man2/open.2       2012-04-27 00:07:51.736883092 +0200
> +++ b/man2/open.2       2012-04-27 00:29:59.489892980 +0200
>  <at>  <at>  -769,7 +769,12  <at>  <at> 
>  and the file offset must all be multiples of the logical block size
>  of the file system.
>  Under Linux 2.6, alignment to 512-byte boundaries
> -suffices.
> +suffices. However, if the user buffer is not page aligned and direct read
> +runs in parallel with a
> +.BR fork (2)
> +of the reader process, it may happen that the read data is split between
> +pages owned by the original process and its child. Thus effectively read
> +data is corrupted.
(Continue reading)

Nick Piggin | 1 May 2012 08:49
Picon

Re: [PATCH] Describe race of direct read and fork for unaligned buffers

On 30 April 2012 19:30, Jan Kara <jack <at> suse.cz> wrote:
> This is a long standing problem (or a surprising feature) in our implementation
> of get_user_pages() (used by direct IO). Since several attempts to fix it
> failed (e.g.
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg06542.html, or
> http://lkml.indiana.edu/hypermail/linux/kernel/0903.1/01498.html refused in
> http://comments.gmane.org/gmane.linux.kernel.mm/31569) and it's not completely
> clear whether we really want to fix it given the costs, let's at least document
> it.

In any case, it should be documented even if it is ever fixed in newer
kernels. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo <at> kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont <at> kvack.org"> email <at> kvack.org </a>

Nick Piggin | 1 May 2012 09:20
Picon

Re: [RFC] propagate gfp_t to page table alloc functions

On 25 April 2012 07:30, Andrew Morton <akpm <at> linux-foundation.org> wrote:
> On Tue, 24 Apr 2012 17:48:29 +1000
> Nick Piggin <npiggin <at> gmail.com> wrote:
>
>> > Hmm, there are several places to use GFP_NOIO and GFP_NOFS even, GFP_ATOMIC.
>> > I believe it's not trivial now.
>>
>> They're all buggy then. Unfortunately not through any real fault of their own.
>
> There are gruesome problems in block/blk-throttle.c (thread "mempool,
> percpu, blkcg: fix percpu stat allocation and remove stats_lock").  It
> wants to do an alloc_percpu()->vmalloc() from the IO submission path,
> under GFP_NOIO.

Yeah, that sucks. CFQ has something similar.

Should just allocate it up front when creating a throttled group.
Allocate and init when it first gets used schemes are usually pretty
problematic. Is it *really* warranted to do it lazily like this?

> Changing vmalloc() to take a gfp_t does make lots of sense, although I
> worry a bit about making vmalloc() easier to use!
>
> I do wonder whether the whole scheme of explicitly passing a gfp_t was
> a mistake and that the allocation context should be part of the task
> context.  ie: pass the allocation mode via *current.  As a handy
> side-effect that would probably save quite some code where functions
> are receiving a gfp_t arg then simply passing it on to the next
> callee.

(Continue reading)

Johannes Weiner | 1 May 2012 10:41

[patch 1/5] mm: readahead: move radix tree hole searching here

The readahead code searches the page cache for non-present pages, or
holes, to get a picture of the area surrounding a fault e.g.

For this it sufficed to rely on the radix tree definition of holes,
which is "empty tree slot".  This is about to change, though, because
shadow page descriptors will be stored in the page cache when the real
pages get evicted from memory.

Fortunately, nobody outside the readahead code uses these functions
and they have no internal knowledge of the radix tree structures, so
just move them over to mm/readahead.c where they can later be adapted
to handle the new definition of "page cache hole".

Signed-off-by: Johannes Weiner <hannes <at> cmpxchg.org>
---
 include/linux/radix-tree.h |    4 --
 lib/radix-tree.c           |   75 --------------------------------------------
 mm/readahead.c             |   36 ++++++++++++++++++++-
 3 files changed, 34 insertions(+), 81 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 07e360b..73e49c4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
 <at>  <at>  -224,10 +224,6  <at>  <at>  radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
 			void ***results, unsigned long *indices,
 			unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
-				unsigned long index, unsigned long max_scan);
(Continue reading)

Johannes Weiner | 1 May 2012 10:41

[patch 0/5] refault distance-based file cache sizing

Hi,

our file data caching implementation is done by having an inactive
list of pages that have yet to prove worth keeping around and an
active list of pages that already did.  The question is how to balance
those two lists against each other.

On one hand, the space for inactive pages needs to be big enough so
that they have the necessary time in memory to gather the references
required for an activation.  On the other hand, we want an active list
big enough to hold all data that is frequently used, if possible, to
protect it from streams of less frequently used/once used pages.

Our current balancing ("active can't grow larger than inactive") does
not really work too well.  We have people complaining that the working
set is not well protected from used-once file cache, and other people
complaining that we don't adapt to changes in the workingset and
protect stale pages in other cases.

This series stores file cache eviction information in the vacated page
cache radix tree slots and uses it on refault to see if the pages
currently on the active list need to have their status challenged.

A fully activated file set that occupies 85% of memory is successfully
detected as stale when another file set of equal size is accessed for
a few times (4-5).  The old kernel would never adapt to the second
one.  If the new set is bigger than memory, the old set is left
untouched, where the old kernel would shrink the old set to half of
memory and leave it at that.  Tested on a multi-zone single-node
machine.
(Continue reading)

Johannes Weiner | 1 May 2012 10:41

[patch 3/5] mm + fs: store shadow pages in page cache

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page.  As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently.  At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate.  Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes <at> cmpxchg.org>
---
 fs/inode.c              |    4 ++++
 include/linux/pagemap.h |   13 ++++++++++++-
 mm/filemap.c            |   14 ++++++++++----
 mm/truncate.c           |    2 +-
 mm/vmscan.c             |    2 +-
 5 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 645731f..9be6bac 100644
--- a/fs/inode.c
+++ b/fs/inode.c
 <at>  <at>  -541,6 +541,10  <at>  <at>  static void evict(struct inode *inode)

 	inode_sb_list_del(inode);

+	spin_lock_irq(&inode->i_data.tree_lock);
+	mapping_set_exiting(&inode->i_data);
+	spin_unlock_irq(&inode->i_data.tree_lock);
(Continue reading)

Johannes Weiner | 1 May 2012 10:41

[patch 2/5] mm + fs: prepare for non-page entries in page cache

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
non-page entries.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before.  But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <hannes <at> cmpxchg.org>
---
 fs/btrfs/compression.c   |    2 +-
 fs/btrfs/extent_io.c     |    3 +-
 fs/inode.c               |    3 +-
 fs/nilfs2/inode.c        |    6 +--
 include/linux/mm.h       |    8 +++
 include/linux/pagemap.h  |   13 +++--
 include/linux/pagevec.h  |    3 +
 include/linux/shmem_fs.h |    1 +
 mm/filemap.c             |  124 +++++++++++++++++++++++++++++++++++++++-------
 mm/mincore.c             |   20 +++++--
 mm/readahead.c           |   12 +++-
 mm/shmem.c               |   89 +++++----------------------------
 mm/swap.c                |   21 ++++++++
 mm/truncate.c            |   71 +++++++++++++++++++++------
 14 files changed, 244 insertions(+), 132 deletions(-)

(Continue reading)

Johannes Weiner | 1 May 2012 10:41

[patch 4/5] mm + fs: provide refault distance to page cache instantiations

The page allocator needs to be given the non-residency information
stored in the page cache at the time the page is faulted back in.

Every site that does a find_or_create()-style allocation is converted
to pass this refault information to the page_cache_alloc() family of
functions, which in turn passes it down to the page allocator via
current->refault_distance.

XXX: Pages are charged to memory cgroups only when being added to the
page cache, and, in case of multi-page reads, allocation and addition
happen in separate batches.  To communicate the individual refault
distance to the memory controller, the refault distance is stored in
page->private between allocation and add_to_page_cache().  memcg does
not do anything with it yet, though, but it will use it when charging
requires reclaiming (hard limit reclaim).

Signed-off-by: Johannes Weiner <hannes <at> cmpxchg.org>
---
 fs/btrfs/compression.c  |    8 +++--
 fs/cachefiles/rdwr.c    |   26 +++++++++-----
 fs/ceph/xattr.c         |    2 +-
 fs/logfs/readwrite.c    |    9 +++--
 fs/ntfs/file.c          |   11 ++++--
 fs/splice.c             |   10 +++--
 include/linux/pagemap.h |   28 ++++++++++-----
 include/linux/sched.h   |    1 +
 include/linux/swap.h    |    6 +++
 mm/filemap.c            |   84 ++++++++++++++++++++++++++++++++---------------
 mm/readahead.c          |    7 +++-
 net/ceph/messenger.c    |    2 +-
(Continue reading)


Gmane