David Holland | 1 Mar 2008 08:01
Picon

Re: Replace lockmgr for vnodes

On Thu, Feb 28, 2008 at 10:44:14PM -0800, Jason Thorpe wrote:
 > On Feb 28, 2008, at 10:15 PM, David Holland wrote:
 > >I am still not convinced of this, for two reasons: first, I don't
 > >think layers are going to work unless locks are exported and shared;
 > >and second, if every fs does its own locking, it gives every fs the
 > >opportunity to do it wrong, and there'll furthermore tend to be a lot
 > >of cut&paste code with all the attendant problems.
 > 
 > The reason file systems can get it so wrong is because of the nutty  
 > rules that we currently have.

Yes and no. What you say is perfectly true, and the mess that
currently exists should not be allowed to contine.

That said, by "get it wrong" I mean things like ufs_rename(), or
worse, msdosfs_rename(), rename being particularly difficult to get
right - things where the per-fs code gets locks in the wrong order, or
gets the wrong locks, or doesn't bother getting them at all, or drops
things on the floor when it's halfway done, or whatever other creative
lossage someone manages to invent.

If the locking is provided in fs-independent code, then it only needs
to be debugged once.

 > If vnodes are never locked when descending into a vnode op, and vnodes  
 > are never locked when returning from a vnode op, and vnodes are  
 > manipulated by accessors / mutators from within vnode ops, then there  
 > is no need to have a vnode locking protocol at all as part of the VFS<- 
 > >file system interface.  It simply becomes the responsibility of  
 > underlying file systems to lock their own data structures as necessary  
(Continue reading)

Antti Kantee | 1 Mar 2008 13:15
Picon
Picon

Re: Replace lockmgr for vnodes

On Sat Mar 01 2008 at 07:01:53 +0000, David Holland wrote:
> On Thu, Feb 28, 2008 at 10:44:14PM -0800, Jason Thorpe wrote:
>  > On Feb 28, 2008, at 10:15 PM, David Holland wrote:
>  > >I am still not convinced of this, for two reasons: first, I don't
>  > >think layers are going to work unless locks are exported and shared;
>  > >and second, if every fs does its own locking, it gives every fs the
>  > >opportunity to do it wrong, and there'll furthermore tend to be a lot
>  > >of cut&paste code with all the attendant problems.
>  > 
>  > The reason file systems can get it so wrong is because of the nutty  
>  > rules that we currently have.

I don't buy the "nutty rules" argument.  The only really troublesome
part was lookup, and that was whacked into submission by chs.  You can
generally divide the ops into a few categories: lookup, creaters,
removers, rename (and link), symmetric locking, inactive.  And of those
inactive could be made symmetric now that it doesn't try to recycle itself
(and I even have a diff for that somewhere).

The only real fun part left is namei flags and . insanity.

> Yes and no. What you say is perfectly true, and the mess that
> currently exists should not be allowed to contine.
> 
> That said, by "get it wrong" I mean things like ufs_rename(), or
> worse, msdosfs_rename(), rename being particularly difficult to get
> right - things where the per-fs code gets locks in the wrong order, or
> gets the wrong locks, or doesn't bother getting them at all, or drops
> things on the floor when it's halfway done, or whatever other creative
> lossage someone manages to invent.
(Continue reading)

Joerg Sonnenberger | 1 Mar 2008 15:27
Picon

Re: Unifying /dev/{mem,kmem,zero,null} implementations

On Sun, Feb 24, 2008 at 11:56:52PM +0000, Chris Gilbert wrote:
> I'd have expected pmap_update to happen soon after the pmap_* call, 
> generally within the calling function.  The same as all other uses of 
> pmap_update.

Based on this and the related discussion, I've updated the patch to use
pmap_enter instead of pmap_kenter_pa. Any other comments? Any test
results?

Joerg

Martin Husemann | 1 Mar 2008 19:40
Picon

Re: Unifying /dev/{mem,kmem,zero,null} implementations

On Sat, Mar 01, 2008 at 03:27:46PM +0100, Joerg Sonnenberger wrote:
> Based on this and the related discussion, I've updated the patch to use
> pmap_enter instead of pmap_kenter_pa. Any other comments? Any test
> results?

With one minor nit (sparc64 needs #ifndef SUN4U around __HAVE_MM_MD_READWRITE)
this compiled and I tested a few archs, by running a very limited test:

  vmstat -N /nebsd

sparc, sparc64 and mac68k seem to work.

Not working:

alpha:		vmstat: kptr fffffc00005bbe00: boottime: kvm_read: Bad address
prep:		vmstat: kptr 2bc9b0: boottime: kvm_read: Bad address
sgimips:	vmstat: kptr 8031888c: boottime: kvm_read: Bad address

If you have any other tests, let me know and I'll run them.

Martin

Simon Burge | 2 Mar 2008 13:05

Patches for journalling support

Hi folks,

Wasabi Systems Inc is pleased to make our journalling code available
to the NetBSD community.  This code is known as WAPBL - Write Ahead
Physical Block Logging, and has been used to provide meta-data
journalling in production environments for over 4 years.  WAPBL
journals meta-data only - not file data, and has been used on
filesystems ranging from from 16MB to multiple terabytes in size.

WAPBL was originally written by Darrin B. Jewell while at Wasabi.
Many thanks to Antti Kantee, Andy Doran and Greg Oster for their work
in adapting our journalling code to NetBSD -current (our contribution
was based on the netbsd-4 branch) and testing.

WAPBL is enabled in the kernel with:

	options WAPBL

Currently WAPBL uses space after the filesystem (but before the
partition's end) for its journal.  The suggested method is to use:

	newfs -s -64m <other options> /dev/≤foo>

to leave 64MB of space for the journal.  Obviously this will destroy
any existing data in the filesystem on /dev/≤foo> and you should, of
course, back up any data on this filesystem first.  The journal size
itself is still the subject of some research.  Solaris uses "10% of
filesystem size to a maximum of 64MB", and at Wasabi we've always used
a 64MB journal on larger filesystems.

(Continue reading)

Simon Burge | 2 Mar 2008 13:43
Picon

Re: Patches for journalling support

Simon Burge wrote:

> There are three known issues:
> 
>  - "mount -u -o log" doesn't work yet, so you can't have logging enabled
>    on the root filesystem. "mount -u -o log" is disabled in this patch
>    (see around line 222 of sys/kern/vfs_syscalls.c).

Related to this is also a problem when the device node for a
logged filesystem isn't on a ffs filesystem (for example if
you have a tmpfs /dev).  PR kern/38057 has more details.  For
now, you need to use an ffs /dev to use journalling.

Cheers,
Simon.

Simon Burge | 2 Mar 2008 15:04
Picon

Re: Unifying /dev/{mem,kmem,zero,null} implementations

Martin Husemann wrote:
> On Sat, Mar 01, 2008 at 03:27:46PM +0100, Joerg Sonnenberger wrote:
> > Based on this and the related discussion, I've updated the patch to use
> > pmap_enter instead of pmap_kenter_pa. Any other comments? Any test
> > results?
> 
> [ ... ]
> 
> Not working:
> 
> [ ... ]
> sgimips:	vmstat: kptr 8031888c: boottime: kvm_read: Bad address

This works on sbmips if I add __HAVE_MM_MD_KERNACC to <mips/types.h>.

Cheers,
Simon.

Michael Lorenz | 2 Mar 2008 15:23
Picon

Re: Patches for journalling support


Hello,

On Mar 2, 2008, at 07:05, Simon Burge wrote:

> Currently WAPBL uses space after the filesystem (but before the
> partition's end) for its journal.  The suggested method is to use:
>
> 	newfs -s -64m <other options> /dev/≤foo>
>
> to leave 64MB of space for the journal.  Obviously this will destroy
> any existing data in the filesystem on /dev/≤foo> and you should, of
> course, back up any data on this filesystem first.  The journal size
> itself is still the subject of some research.  Solaris uses "10% of
> filesystem size to a maximum of 64MB", and at Wasabi we've always used
> a 64MB journal on larger filesystems.

Hmm, I remember AIX using a 16MB partition for journalling data from  
a whole volume group - several filesystems spread out on several  
disks, in my case about 80GB total. And it was a 16MB partition  
because that's the smallest unit the volume manager could assign to a  
logical volume. So, is the 64MB figure based on experience, design or  
just a random number that happens to work well?

have fun
Michael

Michael Lorenz | 2 Mar 2008 15:26
Picon

Re: Patches for journalling support


Hello,

On Mar 2, 2008, at 07:43, Simon Burge wrote:

> Simon Burge wrote:
>
>> There are three known issues:
>>
>>  - "mount -u -o log" doesn't work yet, so you can't have logging  
>> enabled
>>    on the root filesystem. "mount -u -o log" is disabled in this  
>> patch
>>    (see around line 222 of sys/kern/vfs_syscalls.c).
>
> Related to this is also a problem when the device node for a
> logged filesystem isn't on a ffs filesystem (for example if
> you have a tmpfs /dev).  PR kern/38057 has more details.  For
> now, you need to use an ffs /dev to use journalling.

Any idea if APPLE_UFS would do? We treat is as ffs from outer space  
after all.
( I mean holding the device nodes - I don't think I can reasonably  
expect journalling to work on APPLE_UFS filesystems at all )

have fun
Michael
Tobias Nygren | 2 Mar 2008 21:08
Picon

Re: Patches for journalling support

Here are some results from the ubiquitous "untar pkgsrc" benchmark.
Beats softdep already, nice!

softdep, noatime:

root <at> x40:work> time sh -c "tar -xzf /pkgsrc.tar.gz; sync"
  132.06s real     3.48s user    12.48s system
root <at> x40:work> time sh -c "rm -rf pkgsrc; sync"
   28.11s real     0.21s user     9.33s system

log, noatime:

root <at> x40:work> time sh -c "tar -xzf /pkgsrc.tar.gz; sync"
  120.51s real     3.33s user    13.12s system
root <at> x40:work> time sh -c "rm -rf pkgsrc; sync"
   21.41s real     0.19s user     3.16s system


Gmane