David Holland | 1 Jun 2010 03:54
Picon

Re: Layered fs, vnode locking and v_vnlock removal

On Sun, May 23, 2010 at 04:11:25PM +0200, Juergen Hannken-Illjes wrote:
 > With our current vnode lock implementation VOP_LOCK() and VOP_UNLOCK()
 > are not symmetric.  A vnode may be locked from one file system and
 > unlocked from another one.  Is there any reason left to have layered
 > file systems share the vnode lock with lower file systems via v_vnlock?
 > 
 > The attached patch will remove v_vnlock and change layered file systems
 > to always pass the locking VOP's down to the leaf file system.

I'm not sure. This is fine for things like umapfs and nullfs where
there's a 1-1 mapping between upper and lower vnodes; but what if
there isn't? Consider for example a versioning FS layer that stores
versions as diffs, or a layer that turns /var/mail into maildirs. It's
not immediately clear how either of these ought to work, so I'm
concerned that making the infrastructure less general will lead to
problems.

In the long run I intend to make all the vnode ops symmetric with
respect to locking, which should make a lot of this less toxic, but at
the rate I've been able to work on this stuff we won't be there
anytime soon.

--

-- 
David A. Holland
dholland <at> netbsd.org

Juergen Hannken-Illjes | 1 Jun 2010 11:44
Picon
Picon
Favicon

Re: Layered fs, vnode locking and v_vnlock removal

On Tue, Jun 01, 2010 at 01:54:53AM +0000, David Holland wrote:
> On Sun, May 23, 2010 at 04:11:25PM +0200, Juergen Hannken-Illjes wrote:
>  > With our current vnode lock implementation VOP_LOCK() and VOP_UNLOCK()
>  > are not symmetric.  A vnode may be locked from one file system and
>  > unlocked from another one.  Is there any reason left to have layered
>  > file systems share the vnode lock with lower file systems via v_vnlock?
>  > 
>  > The attached patch will remove v_vnlock and change layered file systems
>  > to always pass the locking VOP's down to the leaf file system.
> 
> I'm not sure. This is fine for things like umapfs and nullfs where
> there's a 1-1 mapping between upper and lower vnodes; but what if
> there isn't? Consider for example a versioning FS layer that stores
> versions as diffs, or a layer that turns /var/mail into maildirs. It's
> not immediately clear how either of these ought to work, so I'm
> concerned that making the infrastructure less general will lead to
> problems.

1) One upper to many lower vnodes
   This is a file system like unionfs.  It has to lock either one or
   many lower vnodes and does/will not earn anything of shared locks.

2) Many upper to one lower vnode
   Such a layered file system could use a lock shared between ALL
   upper and the lower vnode.  Always taking the lower vnode's lock
   will do the same.  I see no need for shared locks here.

> In the long run I intend to make all the vnode ops symmetric with
> respect to locking, which should make a lot of this less toxic, but at
> the rate I've been able to work on this stuff we won't be there
(Continue reading)

der Mouse | 1 Jun 2010 14:31

Re: WAPBL and IDE mac68k

> It happens even when I try to boot to single user mode because I see
> the message saying "/: replaying log to memory" right before it
> panics.  Not sure why the journaling stuff happens when booting in
> single user mode without mounting any filesystems, but that's what it
> is.

You can't boot without mounting any filesystems at all; / must be
mounted, at a minimum.

> When I moved the drive to the same machine's SCSI bus, it works fine
> with any kernel, so this is specific to the IDE bus of the Quadra
> 630-type machines.

What kind of disk is it that can sit on either SCSI or IDE?  Or do you
have some kind of adapter in front of it in one case?  Could the
adapter be related?

> (1) How does one start up in single user mode WITHOUT filesystems
> getting read?

One doesn't.  You need to get / from _somewhere_.  Perhaps for your
purposes it might be a good idea to boot with / on NFS?  I don't recall
enough of mac68k to say how, but most ports have _some_ way to specify
"prompt me for root &c" when booting.

> (2) Who knows enough about WAPBL and IDE busses to guess where to
> look for a possible solution to this problem?

Not me.

(Continue reading)

Antti Kantee | 1 Jun 2010 15:03
Picon
Picon

uvm & percpu

While reading the uvm page allocator code, I noticed it tries to allocate
from percpu storage before falling back to global storage.  However, even
if allocation from local storage was possible, a global stats counter is
incremented (e.g. "uvmexp.cpuhit++").  In my measurements I've observed
this type of "cheap" statcounting has a huge impact on percpu algorithms,
as you still need to load&store a globally contended memory address.
Furthermore, uvmexp cache lines are probably more contended than the page
queue, so theoretically you get less than half of the possible benefit.

I don't expect anyone to remember what the benchmark used to justify
the original percpu commit was, but if someone is going to work on it
further, I'm curious as to how much gain the percpu allocator produced
and how much more it would squeeze out if the global counter was left out.

The above example of course applies more generally.  When you're going
all out with the bag of tricks, "i++" can be very expensive ...

Sad Clouds | 1 Jun 2010 16:29

Re: uvm & percpu

On Tue, 1 Jun 2010 16:03:19 +0300
Antti Kantee <pooka <at> cs.hut.fi> wrote:

> The above example of course applies more generally.  When you're going
> all out with the bag of tricks, "i++" can be very expensive ...

A better way would be for cpu to have its own counter, which is
incremented until it reaches a certain threshold. At the threshold, the
global counter is incremented by the local counter.

It all depends on the logic of what the counter is supposed to do. But
it doesn't have to be very expensive, i.e. if it's just a statistics
counter, you can do the above trick to reduce contention.

Andrew Doran | 1 Jun 2010 17:16
Picon

Re: uvm & percpu

On Tue, Jun 01, 2010 at 04:03:19PM +0300, Antti Kantee wrote:
> While reading the uvm page allocator code, I noticed it tries to allocate
> from percpu storage before falling back to global storage.  However, even
> if allocation from local storage was possible, a global stats counter is
> incremented (e.g. "uvmexp.cpuhit++").  In my measurements I've observed
> this type of "cheap" statcounting has a huge impact on percpu algorithms,
> as you still need to load&store a globally contended memory address.
> Furthermore, uvmexp cache lines are probably more contended than the page
> queue, so theoretically you get less than half of the possible benefit.
> 
> I don't expect anyone to remember what the benchmark used to justify
> the original percpu commit was, but if someone is going to work on it
> further, I'm curious as to how much gain the percpu allocator produced
> and how much more it would squeeze out if the global counter was left out.
> 
> The above example of course applies more generally.  When you're going
> all out with the bag of tricks, "i++" can be very expensive ...

I ran into the same issue with timecounters and there it was a huge
overhead - disabling the counter showed great performance improvements
where the timecounter hardware was largely parallel and light on
memory access (CPU timestamp counter).

With the UVM allocator its less of an overhead since the allocator is
shielded by uvm_fpageqlock and often uvm_pageqlock (both globals),
and the data structures are not intentionally organised/optimized 
with MP cache behaviour in mind.  This is an area that definitely needs
improvement, and there's good potential there.  Consider NUMA,
or as a middle ground multiple sets of pagedaemon/allocator state that
correspond to cache units (cores, chips, whatever) or execution units
(Continue reading)

Matthew Mondor | 1 Jun 2010 17:26

Re: uvm & percpu

On Tue, 1 Jun 2010 15:29:05 +0100
Sad Clouds <cryintothebluesky <at> googlemail.com> wrote:

> A better way would be for cpu to have its own counter, which is
> incremented until it reaches a certain threshold. At the threshold, the
> global counter is incremented by the local counter.
> 
> It all depends on the logic of what the counter is supposed to do. But
> it doesn't have to be very expensive, i.e. if it's just a statistics
> counter, you can do the above trick to reduce contention.

Another way would be for the cpu+global values to only be assembled
when statistics are requested, unless there's active code taking that
global counter into consideration frequently.
--

-- 
Matt

Andrew Doran | 1 Jun 2010 17:47
Picon

Re: uvm & percpu

On Tue, Jun 01, 2010 at 11:26:40AM -0400, Matthew Mondor wrote:
> On Tue, 1 Jun 2010 15:29:05 +0100
> Sad Clouds <cryintothebluesky <at> googlemail.com> wrote:
> 
> > A better way would be for cpu to have its own counter, which is
> > incremented until it reaches a certain threshold. At the threshold, the
> > global counter is incremented by the local counter.
> > 
> > It all depends on the logic of what the counter is supposed to do. But
> > it doesn't have to be very expensive, i.e. if it's just a statistics
> > counter, you can do the above trick to reduce contention.
> 
> Another way would be for the cpu+global values to only be assembled
> when statistics are requested, unless there's active code taking that
> global counter into consideration frequently.

One general solution would be to enhance event counters (evcnt) to use
the percpu allocator.  A simple * way to access this from assembly language,
and a way for it to work when percpu isn't available (bootstrap) would
be necessary.

* when I say simple here, I mean "just increment the global counter"
because it's too difficult to do fancy stuff in assembly sometimes
and the performance impact may not be of concern. It really depends on
the application.

Adam Hamsik | 2 Jun 2010 00:54
Picon
Gravatar

Re: Layered fs, vnode locking and v_vnlock removal


> 
>> In the long run I intend to make all the vnode ops symmetric with
>> respect to locking, which should make a lot of this less toxic, but at
>> the rate I've been able to work on this stuff we won't be there
>> anytime soon.
> 
> The asymmetry comes from functions like null_mount() where a vnode gets
> locked by the lower layer and unlocked by the upper layer.  A lower
> layer expecting its VOP_LOCK() to be matched by a VOP_UNLOCK() will
> fail badly.
> 
> In the long term VOP_xxxLOCK() should become part of the file systems.

AFAIK there is a consensus between yamt <at> , ad <at>  and thorpej <at>  that locking should be moved down to the
filesystems. There was some discussion about it here some time before.

Regards

Adam.

Hiroyuki Bessho | 2 Jun 2010 16:59
Picon

patch for ataraid (Re: ataraid(4) problems)

Hi,

  I put a patch for ataraid driver at
     ftp://ftp.netbsd.org/pub/NetBSD/misc/bsh/ataraid.patch

  I'd like to get this patch reviewed.  I have tested this patch only
on my mother board with Intel Matrix RAID.  I'd appreciate it if
someone could test the patch for other ataraid flavors.

  I'm going to commit this patch along with the patch provided by Juan
in kern/38273.
 (http://mail-index.netbsd.org/tech-kern/2008/09/17/msg002734.html)

  Two more problems I found since my last mail were:

  4) Matrix RAID supports multiple arrays (sets of disks), and each
     array can have multiple volumes in it.  ataraid(4) assumes only
     one array and can't handle multiple arrays correctly.

  5) bioctl(8) expects interleave value in Kibytes, while ld_ataraid
     reports it in block counts.

  And there is one issue with my patch. It changes how to handle
interleave value for Matrix RAID, and makes it fail to access existing
NetBSD file systems on RAID0 volumes. The change was necessary to make
it possible to mount file systems made by other OSes (such as Windows)
on RAID0 volumes.

  As ataraid(4) has been broken for years, I don't think there are
many users who have NetBSD partitions on RAID0 volumes on Intel Matrix
(Continue reading)


Gmane