1 Aug 2012 01:13
Re: [PATCH v2] list corruption by gather_surp
Cliff Wickman <cpw <at> sgi.com>
2012-07-31 23:13:06 GMT
2012-07-31 23:13:06 GMT
On Mon, Jul 30, 2012 at 02:22:24PM +0200, Michal Hocko wrote: > On Fri 27-07-12 17:32:15, Cliff Wickman wrote: > > From: Cliff Wickman <cpw <at> sgi.com> > > > > v2: diff'd against linux-next > > > > I am seeing list corruption occurring from within gather_surplus_pages() > > (mm/hugetlb.c). The problem occurs in a RHEL6 kernel under a heavy load, > > and seems to be because this function drops the hugetlb_lock. > > The list_add() in gather_surplus_pages() seems to need to be protected by > > the lock. > > (I don't have a similar test for a linux-next kernel) > > Because you cannot reproduce or you just didn't test it with linux-next? > > > I have CONFIG_DEBUG_LIST=y, and am running an MPI application with 64 threads > > and a library that creates a large heap of hugetlbfs pages for it. > > > > The below patch fixes the problem. > > The gist of this patch is that gather_surplus_pages() does not have to drop > > But you cannot hold spinlock while allocating memory because the > allocation is not atomic and you could deadlock easily. > > > the lock if alloc_buddy_huge_page() is told whether the lock is already held. > > The changelog doesn't actually explain how does the list gets corrupted. > alloc_buddy_huge_page doesn't provide the freshly allocated page to use > so nobody could get and free it. enqueue_huge_page happens under hugetlb_lock.(Continue reading)
RSS Feed