Konstantin Belousov | 3 Feb 20:37 2012
Picon

Prefaulting for i/o buffers

FreeBSD I/O infrastructure has well known issue with deadlock caused
by vnode lock order reversal when buffers supplied to read(2) or
write(2) syscalls are backed by mmaped file.

I previously published the patches to convert i/o path to use VMIO,
based on the Jeff Roberson proposal, see
http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
deadlock. Since that work is very intrusive and did not got any
follow-up, it get stalled.

Below is very lightweight patch which only goal is to fix deadlock in
the least intrusive way. This is possible after FreeBSD got the
vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
http://people.freebsd.org/~kib/misc/vm1.3.patch

Theory of operation is described in the patched sys/kern/vfs_vnops.c,
see preamble comment for vn_io_fault(). The patch borrows the
rangelocks implementation from VM6, which was discussed and improved
together with Attilio Rao.

I was not able to reproduce the deadlock in the targeted test running
for several hours, while stock HEAD deadlocks in the first iteration.

Below is the benchmark for the worst-case situation for the patched
system, reading 1 byte from a file in a loop. The value is the time in
seconds to execute read(2) for single byte and lseek back to the start
of the file. The loop is executed 100,000,000 times. Machine has
3.4Ghz Core i7 2600K and used HEAD <at> 230866 with debugging options
turned off.

(Continue reading)

Konstantin Belousov | 3 Feb 20:47 2012
Picon

Re: Prefaulting for i/o buffers

On Fri, Feb 03, 2012 at 07:40:37PM +0000, Attilio Rao wrote:
> Do you have an ETA for reviews? When do you plan to commit this?
Obviously, I cannot have an ETA for reviews, and I am not sure that
I get any review at all (as usual). I think it is reasonable to commit
in 2-3 weeks after the first post.

> it would be valuable to get a grasp on the benchmark and refine the
> performance difference as much as possible.
The benchmark is trivial and available at 
http://people.freebsd.org/~kib/misc/bench_read.c

There is also a targeted test for uio functionality
http://people.freebsd.org/~kib/misc/uio.c
which I used together with intensive swapping workload to verify
the correctness of the patch.

Regarding the performance hit, I consider the < 10% on the worst case
a reasonable cost for finally fixing this issue. If anybody can provide
a benchmark result for e.g. postgresql tests, I will be very grateful.
Attilio Rao | 3 Feb 20:40 2012
Picon

Re: Prefaulting for i/o buffers

2012/2/3 Konstantin Belousov <kostikbel <at> gmail.com>:
> FreeBSD I/O infrastructure has well known issue with deadlock caused
> by vnode lock order reversal when buffers supplied to read(2) or
> write(2) syscalls are backed by mmaped file.
>
> I previously published the patches to convert i/o path to use VMIO,
> based on the Jeff Roberson proposal, see
> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> deadlock. Since that work is very intrusive and did not got any
> follow-up, it get stalled.
>
> Below is very lightweight patch which only goal is to fix deadlock in
> the least intrusive way. This is possible after FreeBSD got the
> vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> http://people.freebsd.org/~kib/misc/vm1.3.patch
>
> Theory of operation is described in the patched sys/kern/vfs_vnops.c,
> see preamble comment for vn_io_fault(). The patch borrows the
> rangelocks implementation from VM6, which was discussed and improved
> together with Attilio Rao.
>
> I was not able to reproduce the deadlock in the targeted test running
> for several hours, while stock HEAD deadlocks in the first iteration.
>
> Below is the benchmark for the worst-case situation for the patched
> system, reading 1 byte from a file in a loop. The value is the time in
> seconds to execute read(2) for single byte and lseek back to the start
> of the file. The loop is executed 100,000,000 times. Machine has
> 3.4Ghz Core i7 2600K and used HEAD <at> 230866 with debugging options
> turned off.
(Continue reading)

Dag-Erling Smørgrav | 4 Feb 16:29 2012
Picon

Re: Prefaulting for i/o buffers

Konstantin Belousov <kostikbel <at> gmail.com> writes:
> I previously published the patches to convert i/o path to use VMIO,
> based on the Jeff Roberson proposal, see
> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> deadlock. Since that work is very intrusive and did not got any
> follow-up, it get stalled.

If it works, and you commit it now, we have at least a year to shake it
down before the 10.0 release cycle...

DES
--

-- 
Dag-Erling Smørgrav - des <at> des.no
_______________________________________________
freebsd-arch <at> freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
To unsubscribe, send any mail to "freebsd-arch-unsubscribe <at> freebsd.org"

Konstantin Belousov | 4 Feb 17:19 2012
Picon

Re: Prefaulting for i/o buffers

On Sat, Feb 04, 2012 at 04:29:45PM +0100, Dag-Erling Sm??rgrav wrote:
> Konstantin Belousov <kostikbel <at> gmail.com> writes:
> > I previously published the patches to convert i/o path to use VMIO,
> > based on the Jeff Roberson proposal, see
> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> > deadlock. Since that work is very intrusive and did not got any
> > follow-up, it get stalled.
> 
> If it works, and you commit it now, we have at least a year to shake it
> down before the 10.0 release cycle...
If it were KibBSD, the suggested route of action might be indeed reasonable.
But, since our VM and VFS is the community efforts, I cannot and do not
want to force other developers working in the same area to suffer from
my mistakes. In other words, this cannot go in without community review
and acceptance.
Pawel Jakub Dawidek | 5 Feb 12:43 2012
Picon

Re: Prefaulting for i/o buffers

On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote:
> FreeBSD I/O infrastructure has well known issue with deadlock caused
> by vnode lock order reversal when buffers supplied to read(2) or
> write(2) syscalls are backed by mmaped file.
> 
> I previously published the patches to convert i/o path to use VMIO,
> based on the Jeff Roberson proposal, see
> http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> deadlock. Since that work is very intrusive and did not got any
> follow-up, it get stalled.
> 
> Below is very lightweight patch which only goal is to fix deadlock in
> the least intrusive way. This is possible after FreeBSD got the
> vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> http://people.freebsd.org/~kib/misc/vm1.3.patch

+struct rl_q_entry *
+rlqentry_alloc()

Missing 'void'.

+static int
+rangelock_incompatible(const struct rl_q_entry *e1,
+    const struct rl_q_entry *e2)
+{
+
+	if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ &&
+	    (e2->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ)
+		return (0);
+#define	IN_RANGE(a, e) (a >= e->rl_q_start && a < e->rl_q_end)
(Continue reading)

Konstantin Belousov | 5 Feb 17:47 2012
Picon

Re: Prefaulting for i/o buffers

On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote:
> On Fri, Feb 03, 2012 at 09:37:19PM +0200, Konstantin Belousov wrote:
> > FreeBSD I/O infrastructure has well known issue with deadlock caused
> > by vnode lock order reversal when buffers supplied to read(2) or
> > write(2) syscalls are backed by mmaped file.
> > 
> > I previously published the patches to convert i/o path to use VMIO,
> > based on the Jeff Roberson proposal, see
> > http://wiki.freebsd.org/VM6. As a side effect, the VM6 fixed the
> > deadlock. Since that work is very intrusive and did not got any
> > follow-up, it get stalled.
> > 
> > Below is very lightweight patch which only goal is to fix deadlock in
> > the least intrusive way. This is possible after FreeBSD got the
> > vm_fault_quick_hold_pages(9) and vm_fault_disable_pagefaults(9) KPIs.
> > http://people.freebsd.org/~kib/misc/vm1.3.patch
> 
> +struct rl_q_entry *
> +rlqentry_alloc()
> 
> Missing 'void'.
Yes, it is style inconsistency. Fixed.

> 
> +static int
> +rangelock_incompatible(const struct rl_q_entry *e1,
> +    const struct rl_q_entry *e2)
> +{
> +
> +	if ((e1->rl_q_flags & RL_LOCK_TYPE_MASK) == RL_LOCK_READ &&
(Continue reading)

Pawel Jakub Dawidek | 5 Feb 20:20 2012
Picon

Re: Prefaulting for i/o buffers

On Sun, Feb 05, 2012 at 06:47:29PM +0200, Konstantin Belousov wrote:
> On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote:
> >  <at>  <at>  -199,6 +200,7  <at>  <at>  thread_init(void *mem, int size, int flags)
> >  
> >  	td->td_sleepqueue = sleepq_alloc();
> >  	td->td_turnstile = turnstile_alloc();
> > +	td->td_rlqe = rlqentry_alloc();
> > 
> > What is the purpose of td_rlqe field? From what I see thread can lock
> > more than one range. Could you elaborate on that as well?
> td_rlqe is a cached rangelock entry used for the typical case of the
> thread not doing recursive i/o. In this case, it is possible to avoid
> memory allocation on the i/o hot path by using entry allocated during
> thread creation. Since thread creation already allocates many chunks
> of memory, and typical thread performs much more i/o then it suffers
> creation, this shorten the i/o calculation path.

I see. I'd be in favour of dropping entry allocation on thread creation.
This would make the allocation lazy and it will be done on the first I/O
only. What do you think? Thread creation should be fast, so if we can
avoid adding up, I would go that route.

> > This is a bit hard to understand immediately. Maybe something like the
> > following is a bit more readable (I assume this goto could happen only
> > once):
> > 
> > 		len = MIN(uio->uio_iov->iov_len, io_hold_cnt * PAGE_SIZE);
> > 		addr = (vm_offset_t)uio->uio_iov->iov_base;
> > 		end = round_page(addr + len);
> > 		if (howmany(end - trunc_page(addr), PAGE_SIZE) > io_hold_cnt) {
(Continue reading)

Konstantin Belousov | 5 Feb 23:12 2012
Picon

Re: Prefaulting for i/o buffers

On Sun, Feb 05, 2012 at 08:20:45PM +0100, Pawel Jakub Dawidek wrote:
> On Sun, Feb 05, 2012 at 06:47:29PM +0200, Konstantin Belousov wrote:
> > On Sun, Feb 05, 2012 at 12:43:46PM +0100, Pawel Jakub Dawidek wrote:
> > >  <at>  <at>  -199,6 +200,7  <at>  <at>  thread_init(void *mem, int size, int flags)
> > >  
> > >  	td->td_sleepqueue = sleepq_alloc();
> > >  	td->td_turnstile = turnstile_alloc();
> > > +	td->td_rlqe = rlqentry_alloc();
> > > 
> > > What is the purpose of td_rlqe field? From what I see thread can lock
> > > more than one range. Could you elaborate on that as well?
> > td_rlqe is a cached rangelock entry used for the typical case of the
> > thread not doing recursive i/o. In this case, it is possible to avoid
> > memory allocation on the i/o hot path by using entry allocated during
> > thread creation. Since thread creation already allocates many chunks
> > of memory, and typical thread performs much more i/o then it suffers
> > creation, this shorten the i/o calculation path.
> 
> I see. I'd be in favour of dropping entry allocation on thread creation.
> This would make the allocation lazy and it will be done on the first I/O
> only. What do you think? Thread creation should be fast, so if we can
> avoid adding up, I would go that route.
I think this is negligible change in the speed of much rare operation
(thread creation) comparing with a hit on the first time i/o, but I just
did it in a way you suggested.

> 
> > > This is a bit hard to understand immediately. Maybe something like the
> > > following is a bit more readable (I assume this goto could happen only
> > > once):
(Continue reading)

Pawel Jakub Dawidek | 6 Feb 09:58 2012
Picon

Re: Prefaulting for i/o buffers

On Mon, Feb 06, 2012 at 12:12:59AM +0200, Konstantin Belousov wrote:
> http://people.freebsd.org/~kib/misc/vm1.9.patch

--- a/sys/fs/nfsclient/nfs_clbio.c
+++ b/sys/fs/nfsclient/nfs_clbio.c
 <at>  <at>  -820,7 +820,10  <at>  <at>  do_sync:
 			t_uio->uio_segflg = UIO_SYSSPACE;
 			t_uio->uio_rw = UIO_WRITE;
 			t_uio->uio_td = td;
-			bcopy(uiop->uio_iov->iov_base, t_iov->iov_base, size);
+			error = copyin(uiop->uio_iov->iov_base,
+			    t_iov->iov_base, size);
+			if (error != 0)
+				goto err_free;

Good catch. It makes me wonder if uiop will always point at userland
buffer. What if we eg. AUDIT subsystem writing from within the kernel to
a file stored on NFS file system?

+	if (lock->rl_currdep == TAILQ_FIRST(&lock->rl_waiters) &&
+	    lock->rl_currdep != NULL)
+		lock->rl_currdep = TAILQ_NEXT(lock->rl_currdep, rl_q_link);
+	for (entry = lock->rl_currdep; entry;

Minor style inconsistency. Two lines earlier you compare pointer with
NULL, which is nice, but here you treat pointer as boolean.

+void
+rangelock_unlock(struct rangelock *lock, void *cookie)
+{
(Continue reading)


Gmane